2022-04-27 16:35:45

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 00/16] support non power of 2 zoned devices

- Background and Motivation:

The zone storage implementation in Linux, introduced since v4.10, first
targetted SMR drives which have a power of 2 (po2) zone size alignment
requirement. The po2 zone size was further imposed implicitly by the
block layer's blk_queue_chunk_sectors(), used to prevent IO merging
across chunks beyond the specified size, since v3.16 through commit
762380ad9322 ("block: add notion of a chunk size for request merging").
But this same general block layer po2 requirement for blk_queue_chunk_sectors()
was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
to be non-power-of-2"). NAND, which is the media used in newer zoned storage
devices, does not naturally align to po2, and so the po2 requirement
does not make sense for those type of zone storage devices.

Removing the po2 requirement from zone storage should therefore be possible
now provided that no userspace regression and no performance regressions are
introduced. Stop-gap patches have been already merged into f2fs-tools to
proactively not allow npo2 zone sizes until proper support is added [0].
Additional kernel stop-gap patches are provided in this series for dm-zoned.
Support for npo2 zonefs and btrfs support is addressed in this series.

There was an effort previously [1] to add support to non po2 devices via
device level emulation but that was rejected with a final conclusion
to add support for non po2 zoned device in the complete stack[2].

- Patchset description:
This patchset aims at adding support to non power of 2 zoned devices in
the block layer, nvme layer, null blk and adds support to btrfs and
zonefs.

This round of patches **will not** support f2fs and DM layer for non
power of 2 zoned devices. More about this in the future work section.

Patches 1-4 deals with removing the po2 constraint from the
block layer. Note that the patches have been split for clarity and it can
be squashed together if the community feels that is better.

Patches 5-6 deals with removing the constraint from nvme zns.

Patches 7-11 adds support to btrfs for non po2 zoned devices.

Patch 12 removes the po2 constraint in ZoneFS

Patch 13 removes the po2 contraint in null blk

Patches 14-16 adds conditions to not allow non power of 2 devices in
f2fs and DM.

- Performance:
PO2 zone sizes utilizes log and shifts instead of division when
determing alignment, zone number, etc. The same math cannot be used when
using a zoned device with non po2 zone size. Hence, to avoid any performance
regression on zoned devices with po2 zone sizes, the optimized math in the
hot paths has been retained with branching.

The performance was measured using null blk for regression
and the results have been posted in the appropriate commit log. No
performance regression was noticed.

- Testing
With respect to testing we need to tackle two things: one for regression
on po2 zoned device and progression on non po2 zoned devices.

kdevops (https://github.com/mcgrof/kdevops) was extensively used to
automate the testing for blktests and (x)fstests for btrfs changes. The
known failures were excluded during the test based on the baseline
v5.17.0-rc7

-- regression
Emulated zoned device with zone size =128M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Changes were tested with the following profile in QEMU:
[btrfs_simple_zns]
TEST_DIR=<dir>
SCRATCH_MNT=<mnt>
FSTYP=btrfs
MKFS_OPTIONS="-f -d single -m single"
TEST_DEV=<dev>
SCRATCH_DEV_POOL=<dev-pool>

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests-nullblk.sh and zonefs-tests.sh from zonefs-tools were run
with no failures.

nullblk:
t/zbd/run-tests-against-nullb from fio was run with no failures.

F2FS and DM:
It was verified if f2fs and dm-zoned successfully mounts without any
error.

-- progression
Emulated zoned device with zone size = 96M , nr_zones = 10000

Block and nvme zns:
blktests were run with no new failures

Btrfs:
Same profile as po2 zone size was used.

Many tests in xfstests for btrfs included dm-flakey and some tests
required dm-linear. As they are not supported at the moment for non
po2 devices, those **tests were excluded for non po2 devices**.

No new failures were observed in btrfs, generic and shared test suite

ZoneFS:
zonefs-tests.sh from zonefs-tools were run with no failures.

nullblk:
A new section was added to cover non po2 devices:

section14()
{
conv_pcnt=10
zone_size=3
zone_capacity=3
max_open=${set_max_open}
zbd_test_opts+=("-o ${max_open}")
}
t/zbd/run-tests-against-nullb from fio was run with no failures.

F2FS and DM:
It was verified that f2fs and dm-zoned does not mount.

- Open issue:
* btrfs superblock location for zoned devices is expected to be in 0,
512GB(mirror) and 4TB(mirror) in the device. Zoned devices with po2
zone size will naturally align with these superblock location but non
po2 devices will not align with 512GB and 4TB offset.

The current approach for npo2 devices is to place the superblock mirror
zones near 512GB and 4TB that is **aligned to the zone size**. This
is of no issue for normal operation as we keep track where the superblock
mirror are placed but this can cause an issue with recovery tools for
zoned devices as they expect mirror superblock to be in 512GB and 4TB.

Note that ATM, recovery tools such as `btrfs check` does not work for
image dumps for zoned devices even for po2 zone sizes.

I hope this issue could be discussed as a part of the BoF on ZNS
during the upcoming LSFMM.

- Tools:
Some tools had to be updated to support non po2 devices. Once these
patches are accepted in the kernel, these tool updates will also be
upstreamed.
* btrfs-prog: https://github.com/Panky-codes/btrfs-progs/tree/remove-po2-btrfs
* blkzone: https://github.com/Panky-codes/util-linux/tree/remove-po2

- Future work
To reduce the amount of changes and testing, support for F2FS and DM was
excluded in this round of patches. The plan is to add support to them in
the forthcoming future.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git/commit/?h=dev-test&id=6afcf6493578e77528abe65ab8b12f3e1c16749f
[1] https://lore.kernel.org/all/[email protected]/T/
[2] https://lore.kernel.org/all/20220315135245.eqf4tqngxxb7ymqa@unifi/

Luis Chamberlain (4):
nvmet: use blk_queue_zone_no()
f2fs: call bdev_zone_sectors() only once on init_blkz_info()
f2fs: ensure only power of 2 zone sizes are allowed
dm-zoned: ensure only power of 2 zone sizes are allowed

Pankaj Raghav (12):
block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2
zsze
block: add blk_queue_zone_aligned and bdev_zone_aligned helper
block: add bdev_zone_no helper
block: allow blk-zoned devices to have non-power-of-2 zone size
nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
btrfs: zoned: Cache superblock location in btrfs_zoned_device_info
btrfs: zoned: add generic btrfs helpers for zoned devices
btrfs: zoned: Make sb_zone_number function non power of 2 compatible
btrfs: zoned: use btrfs zone helpers to support non po2 zoned devices
btrfs: zoned: relax the alignment constraint for zoned devices
zonefs: allow non power of 2 zoned devices
null_blk: allow non power of 2 zoned devices

block/blk-core.c | 3 +-
block/blk-zoned.c | 20 ++++--
drivers/block/null_blk/main.c | 4 +-
drivers/block/null_blk/zoned.c | 14 ++--
drivers/md/dm-zone.c | 12 ++++
drivers/nvme/host/zns.c | 11 +---
drivers/nvme/target/zns.c | 2 +-
fs/btrfs/volumes.c | 24 +++----
fs/btrfs/zoned.c | 115 +++++++++++++++++----------------
fs/btrfs/zoned.h | 41 ++++++++++--
fs/f2fs/super.c | 15 +++--
fs/zonefs/super.c | 6 +-
fs/zonefs/zonefs.h | 1 -
include/linux/blkdev.h | 48 +++++++++++++-
14 files changed, 208 insertions(+), 108 deletions(-)

--
2.25.1


2022-04-27 16:37:05

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
block/blk-zoned.c | 8 +++++++-
include/linux/blkdev.h | 8 +++++++-
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8838..1dff4a8bd51d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
unsigned int blkdev_nr_zones(struct gendisk *disk)
{
sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
+ sector_t capacity = get_capacity(disk);

if (!blk_queue_is_zoned(disk->queue))
return 0;
- return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
+
+ if (is_power_of_2(zone_sectors))
+ return (capacity + zone_sectors - 1) >>
+ ilog2(zone_sectors);
+
+ return div64_u64(capacity + zone_sectors - 1, zone_sectors);
}
EXPORT_SYMBOL_GPL(blkdev_nr_zones);

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60d016138997..c4e4c7071b7b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
static inline unsigned int blk_queue_zone_no(struct request_queue *q,
sector_t sector)
{
+ sector_t zone_sectors = blk_queue_zone_sectors(q);
+
if (!blk_queue_is_zoned(q))
return 0;
- return sector >> ilog2(q->limits.chunk_sectors);
+
+ if (is_power_of_2(zone_sectors))
+ return sector >> ilog2(zone_sectors);
+
+ return div64_u64(sector, zone_sectors);
}

static inline bool blk_queue_zone_is_seq(struct request_queue *q,
--
2.25.1

2022-04-27 16:37:18

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/nvme/host/zns.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index 9f81beb4df4e..2087de0768ee 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
}

ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
- if (!is_power_of_2(ns->zsze)) {
- dev_warn(ns->ctrl->device,
- "invalid zone size:%llu for namespace:%u\n",
- ns->zsze, ns->head->ns_id);
- status = -ENODEV;
- goto free_data;
- }

blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
@@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
sizeof(struct nvme_zone_descriptor);

nr_zones = min_t(unsigned int, nr_zones,
- get_capacity(ns->disk) >> ilog2(ns->zsze));
+ div64_u64(get_capacity(ns->disk), ns->zsze));

bufsize = sizeof(struct nvme_zone_report) +
nr_zones * sizeof(struct nvme_zone_descriptor);
@@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;

- sector &= ~(ns->zsze - 1);
+ sector = rounddown(sector, ns->zsze);
while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
memset(report, 0, buflen);

--
2.25.1

2022-04-27 16:37:19

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 08/16] btrfs: zoned: add generic btrfs helpers for zoned devices

Add helpers to calculate alignment, round up and round down
for zoned devices. These helpers encapsulates the necessary handling for
power_of_2 and non-power_of_2 zone sizes. Optimized calculations are
performed for zone sizes that are power_of_2 with log and shifts.

btrfs_zoned_is_aligned() is added instead of reusing bdev_zone_aligned()
helper is due to some use cases in btrfs where zone alignment is checked
before having access to the underlying block device such as in this
function: btrfs_load_block_group_zone_info().

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/btrfs/zoned.h | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 49317524e9a6..b9c435961361 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -9,6 +9,7 @@
#include "disk-io.h"
#include "block-group.h"
#include "btrfs_inode.h"
+#include "misc.h"

/*
* Block groups with more than this value (percents) of unusable space will be
@@ -34,6 +35,33 @@ struct btrfs_zoned_device_info {
u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
};

+static inline bool btrfs_zoned_is_aligned(u64 pos, u64 zone_size)
+{
+ u64 remainder = 0;
+
+ if (is_power_of_two_u64(zone_size))
+ return IS_ALIGNED(pos, zone_size);
+
+ div64_u64_rem(pos, zone_size, &remainder);
+ return remainder == 0;
+}
+
+static inline u64 btrfs_zoned_roundup(u64 pos, u64 zone_size)
+{
+ if (is_power_of_two_u64(zone_size))
+ return ALIGN(pos, zone_size);
+
+ return roundup(pos, zone_size);
+}
+
+static inline u64 btrfs_zoned_rounddown(u64 pos, u64 zone_size)
+{
+ if (is_power_of_two_u64(zone_size))
+ return round_down(pos, zone_size);
+
+ return rounddown(pos, zone_size);
+}
+
#ifdef CONFIG_BLK_DEV_ZONED
int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
struct blk_zone *zone);
--
2.25.1

2022-04-27 16:37:19

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 09/16] btrfs: zoned: Make sb_zone_number function non power of 2 compatible

Make the calculation in sb_zone_number function to be generic and work
for both power-of-2 and non power-of-2 zone sizes.

The function signature has been modified to take block device and mirror
as input as this function is only invoked from callers that have access
to the block device. This enables to use the generic bdev_zone_no
function provided by the block layer to calculate the zone number.

Even though division is used to calculate the zone index for non
power-of-2 zone sizes, this function will not be used in the fast path as
the sb_zone_location cache is used for the superblock zone location.

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/btrfs/zoned.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 6f76942d0ea5..8f574a474420 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -34,9 +34,6 @@
#define BTRFS_SB_LOG_FIRST_OFFSET (512ULL * SZ_1G)
#define BTRFS_SB_LOG_SECOND_OFFSET (4096ULL * SZ_1G)

-#define BTRFS_SB_LOG_FIRST_SHIFT const_ilog2(BTRFS_SB_LOG_FIRST_OFFSET)
-#define BTRFS_SB_LOG_SECOND_SHIFT const_ilog2(BTRFS_SB_LOG_SECOND_OFFSET)
-
/* Number of superblock log zones */
#define BTRFS_NR_SB_LOG_ZONES 2

@@ -153,15 +150,23 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
/*
* Get the first zone number of the superblock mirror
*/
-static inline u32 sb_zone_number(int shift, int mirror)
+static inline u32 sb_zone_number(struct block_device *bdev, int mirror)
{
u64 zone;

ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
switch (mirror) {
- case 0: zone = 0; break;
- case 1: zone = 1ULL << (BTRFS_SB_LOG_FIRST_SHIFT - shift); break;
- case 2: zone = 1ULL << (BTRFS_SB_LOG_SECOND_SHIFT - shift); break;
+ case 0:
+ zone = 0;
+ break;
+ case 1:
+ zone = bdev_zone_no(bdev,
+ BTRFS_SB_LOG_FIRST_OFFSET >> SECTOR_SHIFT);
+ break;
+ case 2:
+ zone = bdev_zone_no(bdev,
+ BTRFS_SB_LOG_SECOND_OFFSET >> SECTOR_SHIFT);
+ break;
}

ASSERT(zone <= U32_MAX);
@@ -515,7 +520,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
/* Cache the sb zone number */
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
zone_info->sb_zone_location[i] =
- sb_zone_number(zone_info->zone_size_shift, i);
+ sb_zone_number(bdev, i);
}
/* Validate superblock log */
nr_zones = BTRFS_NR_SB_LOG_ZONES;
@@ -840,7 +845,7 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
nr_sectors = bdev_nr_sectors(bdev);
nr_zones = nr_sectors >> zone_sectors_shift;

- sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+ sb_zone = sb_zone_number(bdev, mirror);
if (sb_zone + 1 >= nr_zones)
return -ENOENT;

@@ -964,7 +969,7 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
nr_sectors = bdev_nr_sectors(bdev);
nr_zones = nr_sectors >> zone_sectors_shift;

- sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+ sb_zone = sb_zone_number(bdev, mirror);
if (sb_zone + 1 >= nr_zones)
return -ENOENT;

--
2.25.1

2022-04-27 16:37:25

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper

Checking if a given sector is aligned to a zone is a very common
operation that is performed for zoned devices. Add
blk_queue_zone_aligned helper to check for this instead of opencoding it
everywhere.

The helper is made to be generic so that it can also check for alignment
for non non-power-of-2 zone size devices.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is done for devices with
power-of-2 zone size

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
include/linux/blkdev.h | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c4e4c7071b7b..f8f2d2998afb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -676,6 +676,22 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
return div64_u64(sector, zone_sectors);
}

+static inline bool blk_queue_zone_aligned(struct request_queue *q, sector_t sec)
+{
+ sector_t zone_sectors = blk_queue_zone_sectors(q);
+ u64 remainder = 0;
+
+ if (!blk_queue_is_zoned(q))
+ return false;
+
+ if (is_power_of_2(zone_sectors))
+ return IS_ALIGNED(sec, zone_sectors);
+
+ div64_u64_rem(sec, zone_sectors, &remainder);
+ /* if there is a remainder, then the sector is not aligned */
+ return remainder == 0;
+}
+
static inline bool blk_queue_zone_is_seq(struct request_queue *q,
sector_t sector)
{
@@ -722,6 +738,12 @@ static inline unsigned int blk_queue_zone_no(struct request_queue *q,
{
return 0;
}
+
+static inline bool blk_queue_zone_aligned(struct request_queue *q, sector_t sec)
+{
+ return false;
+}
+
static inline unsigned int queue_max_open_zones(const struct request_queue *q)
{
return 0;
@@ -1361,6 +1383,15 @@ static inline sector_t bdev_zone_sectors(struct block_device *bdev)
return 0;
}

+static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec)
+{
+ struct request_queue *q = bdev_get_queue(bdev);
+
+ if (q)
+ return blk_queue_zone_aligned(q, sec);
+ return false;
+}
+
static inline unsigned int bdev_max_open_zones(struct block_device *bdev)
{
struct request_queue *q = bdev_get_queue(bdev);
--
2.25.1

2022-04-27 16:37:36

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size

Convert the calculations on zone size to be generic instead of relying on
power_of_2 based logic in the block layer using the helpers wherever
possible.

The only hot path affected by this change for power_of_2 zoned devices
is in blk_check_zone_append() but the effects should be negligible as the
helper blk_queue_zone_aligned() optimizes the calculation for those
devices. Note that the append path cannot be accessed by direct raw access
to the block device but only through a filesystem abstraction.

Finally, remove the check for power_of_2 zone size requirement in
blk-zoned.c

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
block/blk-core.c | 3 +--
block/blk-zoned.c | 12 ++++++------
2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 937bb6b86331..850caf311064 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
return BLK_STS_NOTSUPP;

/* The bio sector must point to the start of a sequential zone */
- if (pos & (blk_queue_zone_sectors(q) - 1) ||
- !blk_queue_zone_is_seq(q, pos))
+ if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))
return BLK_STS_IOERR;

/*
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 1dff4a8bd51d..f7c7c3bd148d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -288,10 +288,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
return -EINVAL;

/* Check alignment (handle eventual smaller last zone) */
- if (sector & (zone_sectors - 1))
+ if (!blk_queue_zone_aligned(q, sector))
return -EINVAL;

- if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
+ if (!blk_queue_zone_aligned(q, nr_sectors) && end_sector != capacity)
return -EINVAL;

/*
@@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
* smaller last zone.
*/
if (zone->start == 0) {
- if (zone->len == 0 || !is_power_of_2(zone->len)) {
- pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
- disk->disk_name, zone->len);
+ if (zone->len == 0) {
+ pr_warn("%s: Invalid zoned device size",
+ disk->disk_name);
return -ENODEV;
}

args->zone_sectors = zone->len;
- args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
+ args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
} else if (zone->start + args->zone_sectors < capacity) {
if (zone->len != args->zone_sectors) {
pr_warn("%s: Invalid zoned device with non constant zone size\n",
--
2.25.1

2022-04-27 16:37:36

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed

From: Luis Chamberlain <[email protected]>

F2FS zoned support has power of 2 zone size assumption in many places
such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
requirement has been removed from the block layer, explicitly add a
condition in f2fs to allow only power of 2 zone size devices.

This condition will be relaxed once those calculation based on power of
2 is made generic.

Signed-off-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/f2fs/super.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index f64761a15df7..db79abf30002 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -3685,6 +3685,10 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
return 0;

zone_sectors = bdev_zone_sectors(bdev);
+ if (!is_power_of_2(zone_sectors)) {
+ f2fs_err(sbi, "F2FS does not support non power of 2 zone sizes\n");
+ return -EINVAL;
+ }

if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
SECTOR_TO_BLOCK(zone_sectors))
--
2.25.1

2022-04-27 16:37:37

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed

From: Luis Chamberlain <[email protected]>

Today dm-zoned relies on the assumption that you have a zone size
with a power of 2. Even though the block layer today enforces this
requirement, these devices do exist and so provide a stop-gap measure
to ensure these devices cannot be used by mistake

Signed-off-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/md/dm-zone.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 57daa86c19cf..221e0aa0f1a7 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
struct request_queue *q = md->queue;
unsigned int noio_flag;
int ret;
+ struct block_device *bdev = md->disk->part0;
+ sector_t zone_sectors;
+ char bname[BDEVNAME_SIZE];
+
+ zone_sectors = bdev_zone_sectors(bdev);
+
+ if (!is_power_of_2(zone_sectors)) {
+ DMWARN("%s: %s only power of two zone size supported\n",
+ dm_device_name(md),
+ bdevname(bdev, bname));
+ return 1;
+ }

/*
* Check if something changed. If yes, cleanup the current resources
--
2.25.1

2022-04-27 16:37:49

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 10/16] btrfs: zoned: use btrfs zone helpers to support non po2 zoned devices

Use the generic btrfs zone helpers to calculate zone index, check zone
alignment, round up and round down operations.

The zone_size_shift field is not needed anymore as generic helpers are
used for calculation.

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/btrfs/volumes.c | 24 +++++++++-------
fs/btrfs/zoned.c | 72 ++++++++++++++++++++++------------------------
fs/btrfs/zoned.h | 12 ++++----
3 files changed, 54 insertions(+), 54 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a8cc736731fd..0c6d1600d8b3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1421,7 +1421,7 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
* allocator, because we anyway use/reserve the first two zones
* for superblock logging.
*/
- return ALIGN(start, device->zone_info->zone_size);
+ return btrfs_zoned_roundup(start, device->zone_info->zone_size);
default:
BUG();
}
@@ -1436,7 +1436,7 @@ static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
int ret;
bool changed = false;

- ASSERT(IS_ALIGNED(*hole_start, zone_size));
+ ASSERT(btrfs_zoned_is_aligned(*hole_start, zone_size));

while (*hole_size > 0) {
pos = btrfs_find_allocatable_zones(device, *hole_start,
@@ -1573,7 +1573,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
search_start = dev_extent_search_start(device, search_start);

WARN_ON(device->zone_info &&
- !IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+ !btrfs_zoned_is_aligned(num_bytes, device->zone_info->zone_size));

path = btrfs_alloc_path();
if (!path)
@@ -5131,8 +5131,8 @@ static void init_alloc_chunk_ctl_policy_zoned(

ctl->max_stripe_size = zone_size;
if (type & BTRFS_BLOCK_GROUP_DATA) {
- ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
- zone_size);
+ ctl->max_chunk_size = btrfs_zoned_rounddown(
+ BTRFS_MAX_DATA_CHUNK_SIZE, zone_size);
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
ctl->max_chunk_size = ctl->max_stripe_size;
} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
@@ -5144,9 +5144,10 @@ static void init_alloc_chunk_ctl_policy_zoned(
}

/* We don't want a chunk larger than 10% of writable space */
- limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),
- zone_size),
- min_chunk_size);
+ limit = max(
+ btrfs_zoned_rounddown(div_factor(fs_devices->total_rw_bytes, 1),
+ zone_size),
+ min_chunk_size);
ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
ctl->dev_extent_min = zone_size * ctl->dev_stripes;
}
@@ -6725,7 +6726,8 @@ static void submit_stripe_bio(struct btrfs_io_context *bioc, struct bio *bio,
*/
if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
if (btrfs_dev_is_sequential(dev, physical)) {
- u64 zone_start = round_down(physical, fs_info->zone_size);
+ u64 zone_start = btrfs_zoned_rounddown(physical,
+ fs_info->zone_size);

bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
} else {
@@ -8119,8 +8121,8 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
if (dev->zone_info) {
u64 zone_size = dev->zone_info->zone_size;

- if (!IS_ALIGNED(physical_offset, zone_size) ||
- !IS_ALIGNED(physical_len, zone_size)) {
+ if (!btrfs_zoned_is_aligned(physical_offset, zone_size) ||
+ !btrfs_zoned_is_aligned(physical_len, zone_size)) {
btrfs_err(fs_info,
"zoned: dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
devid, physical_offset, physical_len);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 8f574a474420..8f3f542e174c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -177,13 +177,13 @@ static inline u32 sb_zone_number(struct block_device *bdev, int mirror)
static inline sector_t zone_start_sector(u32 zone_number,
struct block_device *bdev)
{
- return (sector_t)zone_number << ilog2(bdev_zone_sectors(bdev));
+ return zone_number * bdev_zone_sectors(bdev);
}

static inline u64 zone_start_physical(u32 zone_number,
struct btrfs_zoned_device_info *zone_info)
{
- return (u64)zone_number << zone_info->zone_size_shift;
+ return zone_number * zone_info->zone_size;
}

/*
@@ -236,8 +236,8 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
if (zinfo->zone_cache) {
unsigned int i;

- ASSERT(IS_ALIGNED(pos, zinfo->zone_size));
- zno = pos >> zinfo->zone_size_shift;
+ ASSERT(btrfs_zoned_is_aligned(pos, zinfo->zone_size));
+ zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
/*
* We cannot report zones beyond the zone end. So, it is OK to
* cap *nr_zones to at the end.
@@ -410,9 +410,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
}

nr_sectors = bdev_nr_sectors(bdev);
- zone_info->zone_size_shift = ilog2(zone_info->zone_size);
- zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
- if (!IS_ALIGNED(nr_sectors, zone_sectors))
+ zone_info->nr_zones = bdev_zone_no(bdev, nr_sectors);
+ if (!btrfs_zoned_is_aligned(nr_sectors, zone_sectors))
zone_info->nr_zones++;

max_active_zones = queue_max_active_zones(queue);
@@ -824,10 +823,8 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
u64 *bytenr_ret)
{
struct blk_zone zones[BTRFS_NR_SB_LOG_ZONES];
- sector_t zone_sectors;
u32 sb_zone;
int ret;
- u8 zone_sectors_shift;
sector_t nr_sectors;
u32 nr_zones;

@@ -838,12 +835,10 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,

ASSERT(rw == READ || rw == WRITE);

- zone_sectors = bdev_zone_sectors(bdev);
- if (!is_power_of_2(zone_sectors))
+ if (!is_power_of_2(bdev_zone_sectors(bdev)))
return -EINVAL;
- zone_sectors_shift = ilog2(zone_sectors);
nr_sectors = bdev_nr_sectors(bdev);
- nr_zones = nr_sectors >> zone_sectors_shift;
+ nr_zones = bdev_zone_no(bdev, nr_sectors);

sb_zone = sb_zone_number(bdev, mirror);
if (sb_zone + 1 >= nr_zones)
@@ -960,14 +955,12 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
{
sector_t zone_sectors;
sector_t nr_sectors;
- u8 zone_sectors_shift;
u32 sb_zone;
u32 nr_zones;

zone_sectors = bdev_zone_sectors(bdev);
- zone_sectors_shift = ilog2(zone_sectors);
nr_sectors = bdev_nr_sectors(bdev);
- nr_zones = nr_sectors >> zone_sectors_shift;
+ nr_zones = bdev_zone_no(bdev, nr_sectors);

sb_zone = sb_zone_number(bdev, mirror);
if (sb_zone + 1 >= nr_zones)
@@ -993,18 +986,17 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
u64 hole_end, u64 num_bytes)
{
struct btrfs_zoned_device_info *zinfo = device->zone_info;
- const u8 shift = zinfo->zone_size_shift;
- u64 nzones = num_bytes >> shift;
+ u64 nzones = bdev_zone_no(device->bdev, num_bytes >> SECTOR_SHIFT);
u64 pos = hole_start;
u64 begin, end;
bool have_sb;
int i;

- ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size));
- ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+ ASSERT(btrfs_zoned_is_aligned(hole_start, zinfo->zone_size));
+ ASSERT(btrfs_zoned_is_aligned(num_bytes, zinfo->zone_size));

while (pos < hole_end) {
- begin = pos >> shift;
+ begin = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
end = begin + nzones;

if (end > zinfo->nr_zones)
@@ -1036,8 +1028,9 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
if (!(pos + num_bytes <= sb_pos ||
sb_pos + BTRFS_SUPER_INFO_SIZE <= pos)) {
have_sb = true;
- pos = ALIGN(sb_pos + BTRFS_SUPER_INFO_SIZE,
- zinfo->zone_size);
+ pos = btrfs_zoned_roundup(
+ sb_pos + BTRFS_SUPER_INFO_SIZE,
+ zinfo->zone_size);
break;
}
}
@@ -1051,7 +1044,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
static bool btrfs_dev_set_active_zone(struct btrfs_device *device, u64 pos)
{
struct btrfs_zoned_device_info *zone_info = device->zone_info;
- unsigned int zno = (pos >> zone_info->zone_size_shift);
+ unsigned int zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);

/* We can use any number of zones */
if (zone_info->max_active_zones == 0)
@@ -1073,7 +1066,7 @@ static bool btrfs_dev_set_active_zone(struct btrfs_device *device, u64 pos)
static void btrfs_dev_clear_active_zone(struct btrfs_device *device, u64 pos)
{
struct btrfs_zoned_device_info *zone_info = device->zone_info;
- unsigned int zno = (pos >> zone_info->zone_size_shift);
+ unsigned int zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);

/* We can use any number of zones */
if (zone_info->max_active_zones == 0)
@@ -1109,14 +1102,14 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
{
struct btrfs_zoned_device_info *zinfo = device->zone_info;
- const u8 shift = zinfo->zone_size_shift;
- unsigned long begin = start >> shift;
- unsigned long end = (start + size) >> shift;
+ unsigned long begin = bdev_zone_no(device->bdev, start >> SECTOR_SHIFT);
+ unsigned long end =
+ bdev_zone_no(device->bdev, (start + size) >> SECTOR_SHIFT);
u64 pos;
int ret;

- ASSERT(IS_ALIGNED(start, zinfo->zone_size));
- ASSERT(IS_ALIGNED(size, zinfo->zone_size));
+ ASSERT(btrfs_zoned_is_aligned(start, zinfo->zone_size));
+ ASSERT(btrfs_zoned_is_aligned(size, zinfo->zone_size));

if (end > zinfo->nr_zones)
return -ERANGE;
@@ -1140,8 +1133,9 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
/* Free regions should be empty */
btrfs_warn_in_rcu(
device->fs_info,
- "zoned: resetting device %s (devid %llu) zone %llu for allocation",
- rcu_str_deref(device->name), device->devid, pos >> shift);
+ "zoned: resetting device %s (devid %llu) zone %u for allocation",
+ rcu_str_deref(device->name), device->devid,
+ bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT));
WARN_ON_ONCE(1);

ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size,
@@ -1238,7 +1232,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
return 0;

/* Sanity check */
- if (!IS_ALIGNED(length, fs_info->zone_size)) {
+ if (!btrfs_zoned_is_aligned(length, fs_info->zone_size)) {
btrfs_err(fs_info,
"zoned: block group %llu len %llu unaligned to zone size %llu",
logical, length, fs_info->zone_size);
@@ -1326,7 +1320,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
* The group is mapped to a sequential zone. Get the zone write
* pointer to determine the allocation offset within the zone.
*/
- WARN_ON(!IS_ALIGNED(physical[i], fs_info->zone_size));
+ WARN_ON(!btrfs_zoned_is_aligned(physical[i], fs_info->zone_size));
nofs_flag = memalloc_nofs_save();
ret = btrfs_get_dev_zone(device, physical[i], &zone);
memalloc_nofs_restore(nofs_flag);
@@ -1352,10 +1346,12 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new)
switch (zone.cond) {
case BLK_ZONE_COND_OFFLINE:
case BLK_ZONE_COND_READONLY:
- btrfs_err(fs_info,
- "zoned: offline/readonly zone %llu on device %s (devid %llu)",
- physical[i] >> device->zone_info->zone_size_shift,
- rcu_str_deref(device->name), device->devid);
+ btrfs_err(
+ fs_info,
+ "zoned: offline/readonly zone %u on device %s (devid %llu)",
+ bdev_zone_no(device->bdev,
+ physical[i] >> SECTOR_SHIFT),
+ rcu_str_deref(device->name), device->devid);
alloc_offsets[i] = WP_MISSING_DEV;
break;
case BLK_ZONE_COND_EMPTY:
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index b9c435961361..c578bdb6bf2f 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -23,7 +23,6 @@ struct btrfs_zoned_device_info {
* zoned block device.
*/
u64 zone_size;
- u8 zone_size_shift;
u32 nr_zones;
unsigned int max_active_zones;
atomic_t active_zones_left;
@@ -274,7 +273,8 @@ static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
if (!zone_info)
return false;

- return test_bit(pos >> zone_info->zone_size_shift, zone_info->seq_zones);
+ return test_bit(bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT),
+ zone_info->seq_zones);
}

static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
@@ -284,7 +284,8 @@ static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
if (!zone_info)
return true;

- return test_bit(pos >> zone_info->zone_size_shift, zone_info->empty_zones);
+ return test_bit(bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT),
+ zone_info->empty_zones);
}

static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
@@ -296,7 +297,7 @@ static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
if (!zone_info)
return;

- zno = pos >> zone_info->zone_size_shift;
+ zno = bdev_zone_no(device->bdev, pos >> SECTOR_SHIFT);
if (set)
set_bit(zno, zone_info->empty_zones);
else
@@ -350,7 +351,8 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
return false;

zone_size = device->zone_info->zone_size;
- if (!IS_ALIGNED(physical, zone_size) || !IS_ALIGNED(length, zone_size))
+ if (!btrfs_zoned_is_aligned(physical, zone_size) ||
+ !btrfs_zoned_is_aligned(length, zone_size))
return false;

return true;
--
2.25.1

2022-04-27 16:37:56

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 14/16] f2fs: call bdev_zone_sectors() only once on init_blkz_info()

From: Luis Chamberlain <[email protected]>

Instead of calling bdev_zone_sectors() multiple times, call
it once and cache the value locally. This will make the
subsequent change easier to read.

Signed-off-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/f2fs/super.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index ea939db18f88..f64761a15df7 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -3678,22 +3678,25 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
struct block_device *bdev = FDEV(devi).bdev;
sector_t nr_sectors = bdev_nr_sectors(bdev);
struct f2fs_report_zones_args rep_zone_arg;
+ u64 zone_sectors;
int ret;

if (!f2fs_sb_has_blkzoned(sbi))
return 0;

+ zone_sectors = bdev_zone_sectors(bdev);
+
if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
- SECTOR_TO_BLOCK(bdev_zone_sectors(bdev)))
+ SECTOR_TO_BLOCK(zone_sectors))
return -EINVAL;
- sbi->blocks_per_blkz = SECTOR_TO_BLOCK(bdev_zone_sectors(bdev));
+ sbi->blocks_per_blkz = SECTOR_TO_BLOCK(zone_sectors);
if (sbi->log_blocks_per_blkz && sbi->log_blocks_per_blkz !=
__ilog2_u32(sbi->blocks_per_blkz))
return -EINVAL;
sbi->log_blocks_per_blkz = __ilog2_u32(sbi->blocks_per_blkz);
FDEV(devi).nr_blkz = SECTOR_TO_BLOCK(nr_sectors) >>
sbi->log_blocks_per_blkz;
- if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
+ if (nr_sectors & (zone_sectors - 1))
FDEV(devi).nr_blkz++;

FDEV(devi).blkz_seq = f2fs_kvzalloc(sbi,
--
2.25.1

2022-04-27 16:38:28

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 06/16] nvmet: use blk_queue_zone_no()

From: Luis Chamberlain <[email protected]>

Instead of open coding the number of zones given a sector, use the helper
blk_queue_zone_no(). This let's us make modifications to the math if
needed in one place and adds now support for npo2 zone devices.

Signed-off-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/nvme/target/zns.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
index e34718b09550..e41b6a6ef048 100644
--- a/drivers/nvme/target/zns.c
+++ b/drivers/nvme/target/zns.c
@@ -243,7 +243,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);

return blkdev_nr_zones(req->ns->bdev->bd_disk) -
- (sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
+ bdev_zone_no(req->ns->bdev, sect);
}

static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
--
2.25.1

2022-04-27 16:38:36

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 11/16] btrfs: zoned: relax the alignment constraint for zoned devices

Checks were in place to return error when a non power-of-2 zoned devices
is detected. Remove those checks as non power-of-2 zoned devices are
now supported.

Relax the zone size constraint to align with the BTRFS_STRIPE_LEN(64k)
so that block groups are aligned to the BTRFS_STRIPE_LEN.

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/btrfs/zoned.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 8f3f542e174c..3ed085762f14 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -395,8 +395,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
zone_sectors = bdev_zone_sectors(bdev);
}

- /* Check if it's power of 2 (see is_power_of_2) */
- ASSERT(zone_sectors != 0 && (zone_sectors & (zone_sectors - 1)) == 0);
+ ASSERT(zone_sectors != 0 && IS_ALIGNED(zone_sectors, BTRFS_STRIPE_LEN));
zone_info->zone_size = zone_sectors << SECTOR_SHIFT;

/* We reject devices with a zone size larger than 8GB */
@@ -835,9 +834,11 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,

ASSERT(rw == READ || rw == WRITE);

- if (!is_power_of_2(bdev_zone_sectors(bdev)))
- return -EINVAL;
nr_sectors = bdev_nr_sectors(bdev);
+
+ if (!IS_ALIGNED(nr_sectors, BTRFS_STRIPE_LEN))
+ return -EINVAL;
+
nr_zones = bdev_zone_no(bdev, nr_sectors);

sb_zone = sb_zone_number(bdev, mirror);
--
2.25.1

2022-04-27 16:38:36

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 07/16] btrfs: zoned: Cache superblock location in btrfs_zoned_device_info

Instead of calculating the superblock location every time, cache the
superblock zone location in btrfs_zoned_device_info struct and use it to
locate the zone index.

The functions such as btrfs_sb_log_location_bdev() and
btrfs_reset_sb_log_zones() which work directly on block_device shall
continue to use the sb_zone_number because btrfs_zoned_device_info
struct might not have been initialized at that point.

This patch will enable non power-of-2 zoned devices to not perform
division to lookup superblock and its mirror location.

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
fs/btrfs/zoned.c | 13 +++++++++----
fs/btrfs/zoned.h | 1 +
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1b1b310c3c51..6f76942d0ea5 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -512,6 +512,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
max_active_zones - nactive);
}

+ /* Cache the sb zone number */
+ for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; ++i) {
+ zone_info->sb_zone_location[i] =
+ sb_zone_number(zone_info->zone_size_shift, i);
+ }
/* Validate superblock log */
nr_zones = BTRFS_NR_SB_LOG_ZONES;
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
@@ -519,7 +524,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
u64 sb_wp;
int sb_pos = BTRFS_NR_SB_LOG_ZONES * i;

- sb_zone = sb_zone_number(zone_info->zone_size_shift, i);
+ sb_zone = zone_info->sb_zone_location[i];
if (sb_zone + 1 >= zone_info->nr_zones)
continue;

@@ -867,7 +872,7 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
return 0;
}

- zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+ zone_num = zinfo->sb_zone_location[mirror];
if (zone_num + 1 >= zinfo->nr_zones)
return -ENOENT;

@@ -884,7 +889,7 @@ static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
if (!zinfo)
return false;

- zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+ zone_num = zinfo->sb_zone_location[mirror];
if (zone_num + 1 >= zinfo->nr_zones)
return false;

@@ -1012,7 +1017,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
u32 sb_zone;
u64 sb_pos;

- sb_zone = sb_zone_number(shift, i);
+ sb_zone = zinfo->sb_zone_location[i];
if (!(end <= sb_zone ||
sb_zone + BTRFS_NR_SB_LOG_ZONES <= begin)) {
have_sb = true;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index cbf016a7bb5d..49317524e9a6 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -31,6 +31,7 @@ struct btrfs_zoned_device_info {
unsigned long *active_zones;
struct blk_zone *zone_cache;
struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
+ u32 sb_zone_location[BTRFS_SUPER_MIRROR_MAX];
};

#ifdef CONFIG_BLK_DEV_ZONED
--
2.25.1

2022-04-27 16:38:37

by Pankaj Raghav

[permalink] [raw]
Subject: [PATCH 13/16] null_blk: allow non power of 2 zoned devices

Convert the power of 2 based calculation with zone size to be generic in
null_zone_no with optimization for power of 2 based zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for power of 2 based zone sizes as
this function is called only during the initialization and will not
invoked in the hot path.

Performance Measurement:

Device:
zone size = 128M, blocksize=4k

FIO cmd:

fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G
--io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
--loops=4

The following results are an average of 4 runs on AMD Ryzen 5 5600X with
32GB of RAM:

Sequential Write:

x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 578 | 2257 | 12.80 | 576 | 2248 | 25.78 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 581 | 2268 | 12.74 | 576 | 2248 | 25.85 |
x-----------------x---------------------------------x---------------------------------x

Sequential read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 667 | 2605 | 11.79 | 675 | 2637 | 23.49 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 667 | 2605 | 11.79 | 675 | 2638 | 23.48 |
x-----------------x---------------------------------x---------------------------------x

Random read:

x-----------------x---------------------------------x---------------------------------x
| IOdepth | 8 | 16 |
x-----------------x---------------------------------x---------------------------------x
| | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
x-----------------x---------------------------------x---------------------------------x
| Without patch | 522 | 2038 | 15.05 | 514 | 2006 | 30.87 |
x-----------------x---------------------------------x---------------------------------x
| With patch | 522 | 2039 | 15.04 | 523 | 2042 | 30.33 |
x-----------------x---------------------------------x---------------------------------x

Minor variations are noticed in Sequential write with io depth 8 and
in random read with io depth 16. But overall no noticeable differences
were noticed

Reviewed-by: Luis Chamberlain <[email protected]>
Signed-off-by: Pankaj Raghav <[email protected]>
---
drivers/block/null_blk/main.c | 4 ++--
drivers/block/null_blk/zoned.c | 14 +++++++-------
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index c441a4972064..82a62b543782 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
dev->mbps = 0;

if (dev->zoned &&
- (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
- pr_err("zone_size must be power-of-two\n");
+ (!dev->zone_size)) {
+ pr_err("zone_size must not be zero\n");
return -EINVAL;
}

diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index dae54dd1aeac..00c34e65ef0a 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -13,7 +13,10 @@ static inline sector_t mb_to_sects(unsigned long mb)

static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
{
- return sect >> ilog2(dev->zone_size_sects);
+ if (is_power_of_2(dev->zone_size_sects))
+ return sect >> ilog2(dev->zone_size_sects);
+
+ return div64_u64(sect, dev->zone_size_sects);
}

static inline void null_lock_zone_res(struct nullb_device *dev)
@@ -62,10 +65,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
sector_t sector = 0;
unsigned int i;

- if (!is_power_of_2(dev->zone_size)) {
- pr_err("zone_size must be power-of-two\n");
- return -EINVAL;
- }
if (dev->zone_size > dev->size) {
pr_err("Zone size larger than device capacity\n");
return -EINVAL;
@@ -83,8 +82,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
zone_capacity_sects = mb_to_sects(dev->zone_capacity);
dev_capacity_sects = mb_to_sects(dev->size);
dev->zone_size_sects = mb_to_sects(dev->zone_size);
- dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
- >> ilog2(dev->zone_size_sects);
+ dev->nr_zones =
+ div64_u64(roundup(dev_capacity_sects, dev->zone_size_sects),
+ dev->zone_size_sects);

dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
GFP_KERNEL | __GFP_ZERO);
--
2.25.1

2022-04-28 01:02:20

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size

On 4/28/22 01:02, Pankaj Raghav wrote:
> Convert the calculations on zone size to be generic instead of relying on
> power_of_2 based logic in the block layer using the helpers wherever
> possible.
>
> The only hot path affected by this change for power_of_2 zoned devices
> is in blk_check_zone_append() but the effects should be negligible as the
> helper blk_queue_zone_aligned() optimizes the calculation for those
> devices. Note that the append path cannot be accessed by direct raw access
> to the block device but only through a filesystem abstraction.
>
> Finally, remove the check for power_of_2 zone size requirement in
> blk-zoned.c
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> block/blk-core.c | 3 +--
> block/blk-zoned.c | 12 ++++++------
> 2 files changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 937bb6b86331..850caf311064 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
> return BLK_STS_NOTSUPP;
>
> /* The bio sector must point to the start of a sequential zone */
> - if (pos & (blk_queue_zone_sectors(q) - 1) ||
> - !blk_queue_zone_is_seq(q, pos))
> + if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))

blk_queue_zone_aligned() is a little confusing since "aligned" is also
used for write-pointer aligned. I would rename this helper

blk_queue_is_zone_start()

or something like that.


> return BLK_STS_IOERR;
>
> /*
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 1dff4a8bd51d..f7c7c3bd148d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -288,10 +288,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
> return -EINVAL;
>
> /* Check alignment (handle eventual smaller last zone) */
> - if (sector & (zone_sectors - 1))
> + if (!blk_queue_zone_aligned(q, sector))
> return -EINVAL;
>
> - if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
> + if (!blk_queue_zone_aligned(q, nr_sectors) && end_sector != capacity)
> return -EINVAL;
>
> /*
> @@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
> * smaller last zone.
> */
> if (zone->start == 0) {
> - if (zone->len == 0 || !is_power_of_2(zone->len)) {
> - pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> - disk->disk_name, zone->len);
> + if (zone->len == 0) {
> + pr_warn("%s: Invalid zoned device size",
> + disk->disk_name);

The message is weird now. Please change it to "Invalid zone size".

Also, the entire premise of this patch series is that it is hard for
people to support the unusable sectors between zone capacity and zone end
for drives with a zone capacity smaller than the zone size.

Yet, here you do not check that zone capacity == zone size for drives that
do not have a zone size equal to a power of 2 number of sectors. This
means that we can still have drives with ZC < ZS AND ZS not equal to a
power of 2. So from the point of view of your arguments, no gains at all.
Any thoughts on this ?

> return -ENODEV;
> }
>
> args->zone_sectors = zone->len;
> - args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
> + args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
> } else if (zone->start + args->zone_sectors < capacity) {
> if (zone->len != args->zone_sectors) {
> pr_warn("%s: Invalid zoned device with non constant zone size\n",


--
Damien Le Moal
Western Digital Research

2022-04-28 02:10:50

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper

On 4/27/22 09:02, Pankaj Raghav wrote:
> +static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec)
> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> +
> + if (q)
> + return blk_queue_zone_aligned(q, sec);
> + return false;
> +}

Which patch uses this function? I can't find any patch in this series
that introduces a call to this function.

Thanks,

Bart.


2022-04-28 08:12:22

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed

On 4/28/22 01:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> Today dm-zoned relies on the assumption that you have a zone size
> with a power of 2. Even though the block layer today enforces this
> requirement, these devices do exist and so provide a stop-gap measure
> to ensure these devices cannot be used by mistake
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/md/dm-zone.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 57daa86c19cf..221e0aa0f1a7 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> struct request_queue *q = md->queue;
> unsigned int noio_flag;
> int ret;
> + struct block_device *bdev = md->disk->part0;
> + sector_t zone_sectors;
> + char bname[BDEVNAME_SIZE];
> +
> + zone_sectors = bdev_zone_sectors(bdev);
> +
> + if (!is_power_of_2(zone_sectors)) {
> + DMWARN("%s: %s only power of two zone size supported\n",
> + dm_device_name(md),
> + bdevname(bdev, bname));
> + return 1;
> + }

Why ?

See my previous email about still allowing ZC < ZS for non power of 2 zone
size drives. dm-zoned can easily support non power of 2 zone size as long
as ZC == ZS for all zones.

The problem with dm-zoned is ZC < ZS *AND* potentially variable ZC per
zone. That cannot be supported easily (still not impossible, but
definitely a lot more complex).

>
> /*
> * Check if something changed. If yes, cleanup the current resources


--
Damien Le Moal
Western Digital Research

2022-04-29 00:57:18

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size

On Thu, Apr 28, 2022 at 08:37:27AM +0900, Damien Le Moal wrote:
> Also, the entire premise of this patch series is that it is hard for
> people to support the unusable sectors between zone capacity and zone end
> for drives with a zone capacity smaller than the zone size.
>
> Yet, here you do not check that zone capacity == zone size for drives that
> do not have a zone size equal to a power of 2 number of sectors. This
> means that we can still have drives with ZC < ZS AND ZS not equal to a
> power of 2. So from the point of view of your arguments, no gains at all.
> Any thoughts on this ?

You are right, a check should be added on bringup so that if npo2 is
used, zone cap == zone size. That should be added on the next iteration
of this patch.

Luis

2022-04-29 07:56:02

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size

On 2022-04-28 01:37, Damien Le Moal wrote:
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 937bb6b86331..850caf311064 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>> return BLK_STS_NOTSUPP;
>>
>> /* The bio sector must point to the start of a sequential zone */
>> - if (pos & (blk_queue_zone_sectors(q) - 1) ||
>> - !blk_queue_zone_is_seq(q, pos))
>> + if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))
>
> blk_queue_zone_aligned() is a little confusing since "aligned" is also
> used for write-pointer aligned. I would rename this helper
>
> blk_queue_is_zone_start()
>
> or something like that.
>
That is a good idea and definitely a better name that
blk_queue_zone_aligned. I will fix it.

>> /*
>> @@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>> * smaller last zone.
>> */
>> if (zone->start == 0) {
>> - if (zone->len == 0 || !is_power_of_2(zone->len)) {
>> - pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
>> - disk->disk_name, zone->len);
>> + if (zone->len == 0) {
>> + pr_warn("%s: Invalid zoned device size",
>> + disk->disk_name);
>
> The message is weird now. Please change it to "Invalid zone size".
>
Ok.
> Also, the entire premise of this patch series is that it is hard for
> people to support the unusable sectors between zone capacity and zone end
> for drives with a zone capacity smaller than the zone size.
>
> Yet, here you do not check that zone capacity == zone size for drives that
> do not have a zone size equal to a power of 2 number of sectors. This
> means that we can still have drives with ZC < ZS AND ZS not equal to a
> power of 2. So from the point of view of your arguments, no gains at all.
> Any thoughts on this ?
>
That is a good point. Instead of implicitly assuming npo2 drives to have
ZC == ZS, it is better to be explicit during bringup. Thanks. As Luis
mentioned, I will add this condition in the next revision.

2022-04-29 12:19:13

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper

On 2022-04-28 01:52, Bart Van Assche wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> +static inline bool bdev_zone_aligned(struct block_device *bdev,
>> sector_t sec)
>> +{
>> +    struct request_queue *q = bdev_get_queue(bdev);
>> +
>> +    if (q)
>> +        return blk_queue_zone_aligned(q, sec);
>> +    return false;
>> +}
>
> Which patch uses this function? I can't find any patch in this series
> that introduces a call to this function.
>
Initially I used it but at the end I had to remove that patch but I
forgot to remove this function. Thanks for pointing it out. I will fix
it up in the next rev.
> Thanks,
>
> Bart.
>
>

2022-04-29 19:19:51

by Adam Manzanares

[permalink] [raw]
Subject: Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

On Wed, Apr 27, 2022 at 06:02:40PM +0200, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
>
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
>
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> block/blk-zoned.c | 8 +++++++-
> include/linux/blkdev.h | 8 +++++++-
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 38cd840d8838..1dff4a8bd51d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
> unsigned int blkdev_nr_zones(struct gendisk *disk)
> {
> sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
> + sector_t capacity = get_capacity(disk);
>
> if (!blk_queue_is_zoned(disk->queue))
> return 0;
> - return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
> +
> + if (is_power_of_2(zone_sectors))
> + return (capacity + zone_sectors - 1) >>
> + ilog2(zone_sectors);
> +
> + return div64_u64(capacity + zone_sectors - 1, zone_sectors);
> }
> EXPORT_SYMBOL_GPL(blkdev_nr_zones);
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 60d016138997..c4e4c7071b7b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
> static inline unsigned int blk_queue_zone_no(struct request_queue *q,
> sector_t sector)
> {
> + sector_t zone_sectors = blk_queue_zone_sectors(q);
> +
> if (!blk_queue_is_zoned(q))
> return 0;
> - return sector >> ilog2(q->limits.chunk_sectors);
> +
> + if (is_power_of_2(zone_sectors))
> + return sector >> ilog2(zone_sectors);
> +
> + return div64_u64(sector, zone_sectors);
> }
>
> static inline bool blk_queue_zone_is_seq(struct request_queue *q,
> --
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <[email protected]>

2022-04-29 21:30:23

by Adam Manzanares

[permalink] [raw]
Subject: Re: [PATCH 06/16] nvmet: use blk_queue_zone_no()

On Wed, Apr 27, 2022 at 06:02:45PM +0200, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> Instead of open coding the number of zones given a sector, use the helper
> blk_queue_zone_no(). This let's us make modifications to the math if
> needed in one place and adds now support for npo2 zone devices.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/nvme/target/zns.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/target/zns.c b/drivers/nvme/target/zns.c
> index e34718b09550..e41b6a6ef048 100644
> --- a/drivers/nvme/target/zns.c
> +++ b/drivers/nvme/target/zns.c
> @@ -243,7 +243,7 @@ static unsigned long nvmet_req_nr_zones_from_slba(struct nvmet_req *req)
> unsigned int sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
>
> return blkdev_nr_zones(req->ns->bdev->bd_disk) -
> - (sect >> ilog2(bdev_zone_sectors(req->ns->bdev)));
> + bdev_zone_no(req->ns->bdev, sect);
> }
>
> static unsigned long get_nr_zones_from_buf(struct nvmet_req *req, u32 bufsize)
> --
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <[email protected]>

2022-04-30 13:20:42

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed

On Thu, Apr 28, 2022 at 08:42:41AM +0900, Damien Le Moal wrote:
> On 4/28/22 01:02, Pankaj Raghav wrote:
> > From: Luis Chamberlain <[email protected]>
> >
> > Today dm-zoned relies on the assumption that you have a zone size
> > with a power of 2. Even though the block layer today enforces this
> > requirement, these devices do exist and so provide a stop-gap measure
> > to ensure these devices cannot be used by mistake
> >
> > Signed-off-by: Luis Chamberlain <[email protected]>
> > Signed-off-by: Pankaj Raghav <[email protected]>
> > ---
> > drivers/md/dm-zone.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> > index 57daa86c19cf..221e0aa0f1a7 100644
> > --- a/drivers/md/dm-zone.c
> > +++ b/drivers/md/dm-zone.c
> > @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> > struct request_queue *q = md->queue;
> > unsigned int noio_flag;
> > int ret;
> > + struct block_device *bdev = md->disk->part0;
> > + sector_t zone_sectors;
> > + char bname[BDEVNAME_SIZE];
> > +
> > + zone_sectors = bdev_zone_sectors(bdev);
> > +
> > + if (!is_power_of_2(zone_sectors)) {
> > + DMWARN("%s: %s only power of two zone size supported\n",
> > + dm_device_name(md),
> > + bdevname(bdev, bname));
> > + return 1;
> > + }
>
> Why ?
>
> See my previous email about still allowing ZC < ZS for non power of 2 zone
> size drives. dm-zoned can easily support non power of 2 zone size as long
> as ZC == ZS for all zones.

Great, thanks for the heads up.

> The problem with dm-zoned is ZC < ZS *AND* potentially variable ZC per
> zone. That cannot be supported easily (still not impossible, but
> definitely a lot more complex).

I see thanks.

Testing would still be required to ensure this all works well with npo2.
So I'd prefer to do that as a separate effort, even if it is easy. So
for now I think it makes sense to avoid this as this is not yet well
tested.

As with filesystem support, we've even have gotten hints that support
for npo2 should be easy, but without proper testing it would not be
prudent to enable support for users yet.

One step at a time.

Luis

2022-05-02 09:49:03

by Adam Manzanares

[permalink] [raw]
Subject: Re: [PATCH 13/16] null_blk: allow non power of 2 zoned devices

On Wed, Apr 27, 2022 at 06:02:52PM +0200, Pankaj Raghav wrote:
> Convert the power of 2 based calculation with zone size to be generic in
> null_zone_no with optimization for power of 2 based zone sizes.
>
> The nr_zones calculation in null_init_zoned_dev has been replaced with a
> division without special handling for power of 2 based zone sizes as
> this function is called only during the initialization and will not
> invoked in the hot path.
>
> Performance Measurement:
>
> Device:
> zone size = 128M, blocksize=4k
>
> FIO cmd:
>
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
>
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
>
> Sequential Write:
>
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth | 8 | 16 |
> x-----------------x---------------------------------x---------------------------------x
> | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch | 578 | 2257 | 12.80 | 576 | 2248 | 25.78 |
> x-----------------x---------------------------------x---------------------------------x
> | With patch | 581 | 2268 | 12.74 | 576 | 2248 | 25.85 |
> x-----------------x---------------------------------x---------------------------------x
>
> Sequential read:
>
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth | 8 | 16 |
> x-----------------x---------------------------------x---------------------------------x
> | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch | 667 | 2605 | 11.79 | 675 | 2637 | 23.49 |
> x-----------------x---------------------------------x---------------------------------x
> | With patch | 667 | 2605 | 11.79 | 675 | 2638 | 23.48 |
> x-----------------x---------------------------------x---------------------------------x
>
> Random read:
>
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth | 8 | 16 |
> x-----------------x---------------------------------x---------------------------------x
> | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch | 522 | 2038 | 15.05 | 514 | 2006 | 30.87 |
> x-----------------x---------------------------------x---------------------------------x
> | With patch | 522 | 2039 | 15.04 | 523 | 2042 | 30.33 |
> x-----------------x---------------------------------x---------------------------------x
>
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/block/null_blk/main.c | 4 ++--
> drivers/block/null_blk/zoned.c | 14 +++++++-------
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index c441a4972064..82a62b543782 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
> dev->mbps = 0;
>
> if (dev->zoned &&
> - (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
> - pr_err("zone_size must be power-of-two\n");
> + (!dev->zone_size)) {
> + pr_err("zone_size must not be zero\n");
> return -EINVAL;
> }
>
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index dae54dd1aeac..00c34e65ef0a 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -13,7 +13,10 @@ static inline sector_t mb_to_sects(unsigned long mb)
>
> static inline unsigned int null_zone_no(struct nullb_device *dev, sector_t sect)
> {
> - return sect >> ilog2(dev->zone_size_sects);
> + if (is_power_of_2(dev->zone_size_sects))
> + return sect >> ilog2(dev->zone_size_sects);
> +
> + return div64_u64(sect, dev->zone_size_sects);
> }
>
> static inline void null_lock_zone_res(struct nullb_device *dev)
> @@ -62,10 +65,6 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
> sector_t sector = 0;
> unsigned int i;
>
> - if (!is_power_of_2(dev->zone_size)) {
> - pr_err("zone_size must be power-of-two\n");
> - return -EINVAL;
> - }
> if (dev->zone_size > dev->size) {
> pr_err("Zone size larger than device capacity\n");
> return -EINVAL;
> @@ -83,8 +82,9 @@ int null_init_zoned_dev(struct nullb_device *dev, struct request_queue *q)
> zone_capacity_sects = mb_to_sects(dev->zone_capacity);
> dev_capacity_sects = mb_to_sects(dev->size);
> dev->zone_size_sects = mb_to_sects(dev->zone_size);
> - dev->nr_zones = round_up(dev_capacity_sects, dev->zone_size_sects)
> - >> ilog2(dev->zone_size_sects);
> + dev->nr_zones =
> + div64_u64(roundup(dev_capacity_sects, dev->zone_size_sects),
> + dev->zone_size_sects);
>
> dev->zones = kvmalloc_array(dev->nr_zones, sizeof(struct nullb_zone),
> GFP_KERNEL | __GFP_ZERO);
> --
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <[email protected]>

2022-05-02 10:42:22

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed

On 4/29/22 02:34, Luis Chamberlain wrote:
> On Thu, Apr 28, 2022 at 08:42:41AM +0900, Damien Le Moal wrote:
>> On 4/28/22 01:02, Pankaj Raghav wrote:
>>> From: Luis Chamberlain <[email protected]>
>>>
>>> Today dm-zoned relies on the assumption that you have a zone size
>>> with a power of 2. Even though the block layer today enforces this
>>> requirement, these devices do exist and so provide a stop-gap measure
>>> to ensure these devices cannot be used by mistake
>>>
>>> Signed-off-by: Luis Chamberlain <[email protected]>
>>> Signed-off-by: Pankaj Raghav <[email protected]>
>>> ---
>>> drivers/md/dm-zone.c | 12 ++++++++++++
>>> 1 file changed, 12 insertions(+)
>>>
>>> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
>>> index 57daa86c19cf..221e0aa0f1a7 100644
>>> --- a/drivers/md/dm-zone.c
>>> +++ b/drivers/md/dm-zone.c
>>> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
>>> struct request_queue *q = md->queue;
>>> unsigned int noio_flag;
>>> int ret;
>>> + struct block_device *bdev = md->disk->part0;
>>> + sector_t zone_sectors;
>>> + char bname[BDEVNAME_SIZE];
>>> +
>>> + zone_sectors = bdev_zone_sectors(bdev);
>>> +
>>> + if (!is_power_of_2(zone_sectors)) {
>>> + DMWARN("%s: %s only power of two zone size supported\n",
>>> + dm_device_name(md),
>>> + bdevname(bdev, bname));
>>> + return 1;
>>> + }
>>
>> Why ?
>>
>> See my previous email about still allowing ZC < ZS for non power of 2 zone
>> size drives. dm-zoned can easily support non power of 2 zone size as long
>> as ZC == ZS for all zones.
>
> Great, thanks for the heads up.
>
>> The problem with dm-zoned is ZC < ZS *AND* potentially variable ZC per
>> zone. That cannot be supported easily (still not impossible, but
>> definitely a lot more complex).
>
> I see thanks.
>
> Testing would still be required to ensure this all works well with npo2.
> So I'd prefer to do that as a separate effort, even if it is easy. So
> for now I think it makes sense to avoid this as this is not yet well
> tested.
>
> As with filesystem support, we've even have gotten hints that support
> for npo2 should be easy, but without proper testing it would not be
> prudent to enable support for users yet.
>
> One step at a time.

Yes, in general, I agree. But in this case, that will create kernel
versions that end up having partial support for zoned drives. Not ideal to
say the least. So if the patches are not that big, I would rather like to
see everything go into a single release.

--
Damien Le Moal
Western Digital Research

2022-05-02 13:40:16

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed

On Fri, Apr 29, 2022 at 06:43:58AM +0900, Damien Le Moal wrote:
> On 4/29/22 02:34, Luis Chamberlain wrote:
> > One step at a time.
>
> Yes, in general, I agree. But in this case, that will create kernel
> versions that end up having partial support for zoned drives. Not ideal to
> say the least. So if the patches are not that big, I would rather like to
> see everything go into a single release.

This would have delayed the patches more, and I see no rush for npo2 to
use dm-zoned really. Just as with f2fs, we can take priorities at a
time.

Luis

2022-05-03 00:39:50

by Adam Manzanares

[permalink] [raw]
Subject: Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

On Wed, Apr 27, 2022 at 06:02:44PM +0200, Pankaj Raghav wrote:
> Remove the condition which disallows non-power_of_2 zone size ZNS drive
> to be updated and use generic method to calculate number of zones
> instead of relying on log and shift based calculation on zone size.
>
> The power_of_2 calculation has been replaced directly with generic
> calculation without special handling. Both modified functions are not
> used in hot paths, they are only used during initialization &
> revalidation of the ZNS device.
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/nvme/host/zns.c | 11 ++---------
> 1 file changed, 2 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 9f81beb4df4e..2087de0768ee 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
> }
>
> ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
> - if (!is_power_of_2(ns->zsze)) {
> - dev_warn(ns->ctrl->device,
> - "invalid zone size:%llu for namespace:%u\n",
> - ns->zsze, ns->head->ns_id);
> - status = -ENODEV;
> - goto free_data;
> - }
>
> blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
> blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
> sizeof(struct nvme_zone_descriptor);
>
> nr_zones = min_t(unsigned int, nr_zones,
> - get_capacity(ns->disk) >> ilog2(ns->zsze));
> + div64_u64(get_capacity(ns->disk), ns->zsze));
>
> bufsize = sizeof(struct nvme_zone_report) +
> nr_zones * sizeof(struct nvme_zone_descriptor);
> @@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
> c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
> c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>
> - sector &= ~(ns->zsze - 1);
> + sector = rounddown(sector, ns->zsze);
> while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
> memset(report, 0, buflen);
>
> --
> 2.25.1
>


Looks good.

Reviewed by: Adam Manzanares <[email protected]>

2022-05-03 01:24:26

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: [PATCH 00/16] support non power of 2 zoned devices

On 27/04/2022 09:03, Pankaj Raghav wrote:
> - Background and Motivation:
>
> The zone storage implementation in Linux, introduced since v4.10, first
> targetted SMR drives which have a power of 2 (po2) zone size alignment
> requirement. The po2 zone size was further imposed implicitly by the
> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
> across chunks beyond the specified size, since v3.16 through commit
> 762380ad9322 ("block: add notion of a chunk size for request merging").
> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
> to be non-power-of-2"). NAND, which is the media used in newer zoned storage
> devices, does not naturally align to po2, and so the po2 requirement
> does not make sense for those type of zone storage devices.
>
> Removing the po2 requirement from zone storage should therefore be possible
> now provided that no userspace regression and no performance regressions are
> introduced. Stop-gap patches have been already merged into f2fs-tools to
> proactively not allow npo2 zone sizes until proper support is added [0].
> Additional kernel stop-gap patches are provided in this series for dm-zoned.
> Support for npo2 zonefs and btrfs support is addressed in this series.
>
> There was an effort previously [1] to add support to non po2 devices via
> device level emulation but that was rejected with a final conclusion
> to add support for non po2 zoned device in the complete stack[2].

Hey Pankaj,

One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise
not fs_info->sectorsize aligned zones) we have to check on every allocation if we still
have at least have fs_info->sectorsize bytes left in a zone. If not we need to
explicitly finish the zone, otherwise we'll run out of max active zones.

This is a problem for zoned btrfs at the moment already but it'll be even worse
with npo2, because we're never implicitly finishing zones.

See also
https://lore.kernel.org/linux-btrfs/42758829d8696a471a27f7aaeab5468f60b1565d.1651157034.git.naohiro.aota@wdc.com

Thanks,
Johannes

2022-05-03 11:16:36

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 00/16] support non power of 2 zoned devices

Hi Johannes,
On 2022-05-03 00:07, Johannes Thumshirn wrote:
>> There was an effort previously [1] to add support to non po2 devices via
>> device level emulation but that was rejected with a final conclusion
>> to add support for non po2 zoned device in the complete stack[2].
>
> Hey Pankaj,
>
> One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise
> not fs_info->sectorsize aligned zones) we have to check on every allocation if we still
> have at least have fs_info->sectorsize bytes left in a zone. If not we need to
> explicitly finish the zone, otherwise we'll run out of max active zones.
>
This commit: `btrfs: zoned: relax the alignment constraint for zoned
devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
which is typically 4k.

This was one of the comment that came from David Sterba:
https://lore.kernel.org/all/[email protected]/
where he suggested to have some sane alignment for the zone sizes.

> This is a problem for zoned btrfs at the moment already but it'll be even worse
> with npo2, because we're never implicitly finishing zones.
>
> See also
> https://lore.kernel.org/linux-btrfs/42758829d8696a471a27f7aaeab5468f60b1565d.1651157034.git.naohiro.aota@wdc.com
>
I did take a look at this few days back and the patch should work fine
also for npo2 zoned device as we allow only zone sizes that are
BTRFS_STRIPE_LEN aligned. So even the max nodesize for METADATA BGs is
only 64k and it should be aligned correctly to implicitly finish the zone.

Let me know your thoughts and if I am missing something.

Regards,
Pankaj

2022-05-03 20:34:42

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

On 4/27/22 09:02, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
>
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
>
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> block/blk-zoned.c | 8 +++++++-
> include/linux/blkdev.h | 8 +++++++-
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 38cd840d8838..1dff4a8bd51d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
> unsigned int blkdev_nr_zones(struct gendisk *disk)
> {
> sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
> + sector_t capacity = get_capacity(disk);
>
> if (!blk_queue_is_zoned(disk->queue))
> return 0;
> - return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
> +
> + if (is_power_of_2(zone_sectors))
> + return (capacity + zone_sectors - 1) >>
> + ilog2(zone_sectors);
> +
> + return div64_u64(capacity + zone_sectors - 1, zone_sectors);
> }
> EXPORT_SYMBOL_GPL(blkdev_nr_zones);

Does anyone need support for more than 4 billion sectors per zone? If
not, do_div() should be sufficient.

> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 60d016138997..c4e4c7071b7b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
> static inline unsigned int blk_queue_zone_no(struct request_queue *q,
> sector_t sector)
> {
> + sector_t zone_sectors = blk_queue_zone_sectors(q);
> +
> if (!blk_queue_is_zoned(q))
> return 0;
> - return sector >> ilog2(q->limits.chunk_sectors);
> +
> + if (is_power_of_2(zone_sectors))
> + return sector >> ilog2(zone_sectors);
> +
> + return div64_u64(sector, zone_sectors);
> }

Same comment here.

Thanks,

Bart.

2022-05-03 20:56:21

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

On 2022/05/04 1:37, Bart Van Assche wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
>> also work for non-power-of-2 zone sizes.
>>
>> As the existing deployments of zoned devices had power-of-2
>> assumption, power-of-2 optimized calculation is kept for those devices.
>>
>> There are no direct hot paths modified and the changes just
>> introduce one new branch per call.
>>
>> Reviewed-by: Luis Chamberlain <[email protected]>
>> Signed-off-by: Pankaj Raghav <[email protected]>
>> ---
>> block/blk-zoned.c | 8 +++++++-
>> include/linux/blkdev.h | 8 +++++++-
>> 2 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>> index 38cd840d8838..1dff4a8bd51d 100644
>> --- a/block/blk-zoned.c
>> +++ b/block/blk-zoned.c
>> @@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
>> unsigned int blkdev_nr_zones(struct gendisk *disk)
>> {
>> sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
>> + sector_t capacity = get_capacity(disk);
>>
>> if (!blk_queue_is_zoned(disk->queue))
>> return 0;
>> - return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
>> +
>> + if (is_power_of_2(zone_sectors))
>> + return (capacity + zone_sectors - 1) >>
>> + ilog2(zone_sectors);
>> +
>> + return div64_u64(capacity + zone_sectors - 1, zone_sectors);
>> }
>> EXPORT_SYMBOL_GPL(blkdev_nr_zones);
>
> Does anyone need support for more than 4 billion sectors per zone? If
> not, do_div() should be sufficient.
>
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 60d016138997..c4e4c7071b7b 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q)
>> static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>> sector_t sector)
>> {
>> + sector_t zone_sectors = blk_queue_zone_sectors(q);
>> +
>> if (!blk_queue_is_zoned(q))
>> return 0;
>> - return sector >> ilog2(q->limits.chunk_sectors);
>> +
>> + if (is_power_of_2(zone_sectors))
>> + return sector >> ilog2(zone_sectors);
>> +
>> + return div64_u64(sector, zone_sectors);
>> }
>
> Same comment here.

sector_t is 64-bits even on 32-bits arch, no ?
so div64_u64 is needed here I think, which will be a simple regular division for
64-bit arch.

>
> Thanks,
>
> Bart.


--
Damien Le Moal
Western Digital Research

2022-05-03 21:19:00

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

On 4/27/22 09:02, Pankaj Raghav wrote:
> - sector &= ~(ns->zsze - 1);
> + sector = rounddown(sector, ns->zsze);

The above change breaks 32-bit builds since ns->zsze is 64 bits wide and
since rounddown() uses the C division operator instead of div64_u64().

Thanks,

Bart.

2022-05-03 21:49:25

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 13/16] null_blk: allow non power of 2 zoned devices

On 4/27/22 09:02, Pankaj Raghav wrote:
> diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
> index c441a4972064..82a62b543782 100644
> --- a/drivers/block/null_blk/main.c
> +++ b/drivers/block/null_blk/main.c
> @@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
> dev->mbps = 0;
>
> if (dev->zoned &&
> - (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
> - pr_err("zone_size must be power-of-two\n");
> + (!dev->zone_size)) {
> + pr_err("zone_size must not be zero\n");
> return -EINVAL;
> }

Please combine "if (dev->zoned &&" and "(!dev->zone_size)) {" into a
single line and leave out the parentheses that became superfluous.

Thanks,

Bart.

2022-05-04 01:53:36

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed

Applied to f2fs tree. Thanks,

On 04/27, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> F2FS zoned support has power of 2 zone size assumption in many places
> such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
> requirement has been removed from the block layer, explicitly add a
> condition in f2fs to allow only power of 2 zone size devices.
>
> This condition will be relaxed once those calculation based on power of
> 2 is made generic.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> fs/f2fs/super.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index f64761a15df7..db79abf30002 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -3685,6 +3685,10 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
> return 0;
>
> zone_sectors = bdev_zone_sectors(bdev);
> + if (!is_power_of_2(zone_sectors)) {
> + f2fs_err(sbi, "F2FS does not support non power of 2 zone sizes\n");
> + return -EINVAL;
> + }
>
> if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
> SECTOR_TO_BLOCK(zone_sectors))
> --
> 2.25.1

2022-05-04 08:51:37

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

On 2022-05-03 18:37, Bart Van Assche wrote:
>>       sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
>> +    sector_t capacity = get_capacity(disk);
>>         if (!blk_queue_is_zoned(disk->queue))
>>           return 0;
>> -    return (get_capacity(disk) + zone_sectors - 1) >>
>> ilog2(zone_sectors);
>> +
>> +    if (is_power_of_2(zone_sectors))
>> +        return (capacity + zone_sectors - 1) >>
>> +               ilog2(zone_sectors);
>> +
>> +    return div64_u64(capacity + zone_sectors - 1, zone_sectors);
>>   }
>>   EXPORT_SYMBOL_GPL(blkdev_nr_zones);
>
> Does anyone need support for more than 4 billion sectors per zone? If
> not, do_div() should be sufficient.
>
You are absolutely right but blk_queue_zone_sectors explicitly changed
the return type to uint32_t to sector_t in this commit:
'113ab72 block: Fix potential overflow in blk_report_zones()'.

I initially did have a do_div but later changed to div64_u64 to avoid
any implicit down casting even though zone_sectors will be always
limited to an unsigned int.
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 60d016138997..c4e4c7071b7b 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -665,9 +665,15 @@ static inline unsigned int
>> blk_queue_nr_zones(struct request_queue *q)
>>   static inline unsigned int blk_queue_zone_no(struct request_queue *q,
>>                            sector_t sector)
>>   {
>> +    sector_t zone_sectors = blk_queue_zone_sectors(q);
>> +
>>       if (!blk_queue_is_zoned(q))
>>           return 0;
>> -    return sector >> ilog2(q->limits.chunk_sectors);
>> +
>> +    if (is_power_of_2(zone_sectors))
>> +        return sector >> ilog2(zone_sectors);
>> +
>> +    return div64_u64(sector, zone_sectors);
>>   }
>
> Same comment here.
>
> Thanks,
>
> Bart.

2022-05-04 08:53:21

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH 14/16] f2fs: call bdev_zone_sectors() only once on init_blkz_info()

Applied to f2fs tree. Thanks,

On 04/27, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> Instead of calling bdev_zone_sectors() multiple times, call
> it once and cache the value locally. This will make the
> subsequent change easier to read.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> fs/f2fs/super.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
> index ea939db18f88..f64761a15df7 100644
> --- a/fs/f2fs/super.c
> +++ b/fs/f2fs/super.c
> @@ -3678,22 +3678,25 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
> struct block_device *bdev = FDEV(devi).bdev;
> sector_t nr_sectors = bdev_nr_sectors(bdev);
> struct f2fs_report_zones_args rep_zone_arg;
> + u64 zone_sectors;
> int ret;
>
> if (!f2fs_sb_has_blkzoned(sbi))
> return 0;
>
> + zone_sectors = bdev_zone_sectors(bdev);
> +
> if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
> - SECTOR_TO_BLOCK(bdev_zone_sectors(bdev)))
> + SECTOR_TO_BLOCK(zone_sectors))
> return -EINVAL;
> - sbi->blocks_per_blkz = SECTOR_TO_BLOCK(bdev_zone_sectors(bdev));
> + sbi->blocks_per_blkz = SECTOR_TO_BLOCK(zone_sectors);
> if (sbi->log_blocks_per_blkz && sbi->log_blocks_per_blkz !=
> __ilog2_u32(sbi->blocks_per_blkz))
> return -EINVAL;
> sbi->log_blocks_per_blkz = __ilog2_u32(sbi->blocks_per_blkz);
> FDEV(devi).nr_blkz = SECTOR_TO_BLOCK(nr_sectors) >>
> sbi->log_blocks_per_blkz;
> - if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
> + if (nr_sectors & (zone_sectors - 1))
> FDEV(devi).nr_blkz++;
>
> FDEV(devi).blkz_seq = f2fs_kvzalloc(sbi,
> --
> 2.25.1

2022-05-04 09:10:29

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH 06/16] nvmet: use blk_queue_zone_no()

On 4/27/22 09:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> Instead of open coding the number of zones given a sector, use the helper
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I can't parse this. Please rephrase this.

> blk_queue_zone_no(). This let's us make modifications to the math if
> needed in one place and adds now support for npo2 zone devices.

But since the code looks fine:

Reviewed-by: Bart Van Assche <[email protected]>

2022-05-04 09:23:09

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size


On 2022-05-03 18:50, Bart Van Assche wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> -    sector &= ~(ns->zsze - 1);
>> +    sector = rounddown(sector, ns->zsze);
>
> The above change breaks 32-bit builds since ns->zsze is 64 bits wide and
> since rounddown() uses the C division operator instead of div64_u64().
>
Good catch. I don't see any generic helper for rounddown that will work
for both 32 bits and 64 bits. Maybe I will open code the rounddown logic
here so that it works for both 32 and 64 bits.
> Thanks,
>
> Bart.

2022-05-04 17:04:54

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 15/16] f2fs: ensure only power of 2 zone sizes are allowed


On 2022-05-03 22:05, Jaegeuk Kim wrote:
> Applied to f2fs tree. Thanks,
>
Thanks Jaegeuk. I will remove the f2fs patches from my next revision

Regards,
Pankaj
> On 04/27, Pankaj Raghav wrote:
>> From: Luis Chamberlain <[email protected]>
>>
>> F2FS zoned support has power of 2 zone size assumption in many places
>> such as in __f2fs_issue_discard_zone, init_blkz_info. As the power of 2
>> requirement has been removed from the block layer, explicitly add a
>> condition in f2fs to allow only power of 2 zone size devices.
>>
>> This condition will be relaxed once those calculation based on power of
>> 2 is made generic.
>>
>> Signed-off-by: Luis Chamberlain <[email protected]>
>> Signed-off-by: Pankaj Raghav <[email protected]>
>> ---
>> fs/f2fs/super.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>> index f64761a15df7..db79abf30002 100644
>> --- a/fs/f2fs/super.c
>> +++ b/fs/f2fs/super.c
>> @@ -3685,6 +3685,10 @@ static int init_blkz_info(struct f2fs_sb_info *sbi, int devi)
>> return 0;
>>
>> zone_sectors = bdev_zone_sectors(bdev);
>> + if (!is_power_of_2(zone_sectors)) {
>> + f2fs_err(sbi, "F2FS does not support non power of 2 zone sizes\n");
>> + return -EINVAL;
>> + }
>>
>> if (sbi->blocks_per_blkz && sbi->blocks_per_blkz !=
>> SECTOR_TO_BLOCK(zone_sectors))
>> --
>> 2.25.1

2022-05-04 21:34:59

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 16/16] dm-zoned: ensure only power of 2 zone sizes are allowed

On 4/27/22 09:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> Today dm-zoned relies on the assumption that you have a zone size
> with a power of 2. Even though the block layer today enforces this
> requirement, these devices do exist and so provide a stop-gap measure
> to ensure these devices cannot be used by mistake
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/md/dm-zone.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 57daa86c19cf..221e0aa0f1a7 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -231,6 +231,18 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> struct request_queue *q = md->queue;
> unsigned int noio_flag;
> int ret;
> + struct block_device *bdev = md->disk->part0;
> + sector_t zone_sectors;
> + char bname[BDEVNAME_SIZE];
> +
> + zone_sectors = bdev_zone_sectors(bdev);
> +
> + if (!is_power_of_2(zone_sectors)) {
> + DMWARN("%s: %s only power of two zone size supported\n",
> + dm_device_name(md),
> + bdevname(bdev, bname));
> + return 1;
> + }
>
> /*
> * Check if something changed. If yes, cleanup the current resources

Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2022-05-04 22:54:10

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH 00/16] support non power of 2 zoned devices

On Tue, May 03, 2022 at 11:12:04AM +0200, Pankaj Raghav wrote:
> Hi Johannes,
> On 2022-05-03 00:07, Johannes Thumshirn wrote:
> >> There was an effort previously [1] to add support to non po2 devices via
> >> device level emulation but that was rejected with a final conclusion
> >> to add support for non po2 zoned device in the complete stack[2].
> >
> > Hey Pankaj,
> >
> > One thing I'm concerned with this patches is, once we have npo2 zones (or to be precise
> > not fs_info->sectorsize aligned zones) we have to check on every allocation if we still
> > have at least have fs_info->sectorsize bytes left in a zone. If not we need to
> > explicitly finish the zone, otherwise we'll run out of max active zones.
> >
> This commit: `btrfs: zoned: relax the alignment constraint for zoned
> devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
> even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
> which is typically 4k.
>
> This was one of the comment that came from David Sterba:
> https://lore.kernel.org/all/[email protected]/
> where he suggested to have some sane alignment for the zone sizes.

My idea of 'sane' value would be 1M, that we have 4K for sectors is
because of the 1:1 mapping to pages, but RAM sizes are on a different
scale than storage devices. The 4K is absolute minimum but if the page
size is taken as a basic constraint, ARM has 64K and there are some 256K
arches.

2022-05-05 00:57:58

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size

Hi Hannes,
Somehow your message did not go through the mailing list. Maybe you hit
reply instead of reply all?

I have added your reviewed-by tag in the other commits based on your
response. Thanks.

Anyway, my response to this email below:

On 2022-05-04 18:59, Hannes Reinecke wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> zone size (%llu)\n",
>> -                disk->disk_name, zone->len);
>> +        if (zone->len == 0) {
>> +            pr_warn("%s: Invalid zoned device size",
>> +                disk->disk_name);
>>               return -ENODEV;
>>           }
>>             args->zone_sectors = zone->len;
>> -        args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
>> +        args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);
>
> This is a different calculation than the one you're using in the first
> patch. Can you please add a helper such that both are using the same
> calculation?
>
So this calculation is actually doing a roundup to the nearest zone
number operation and not just a division that is done in the block layer
helper such as bdev_zone_no(). Note the `zone->len - 1` added to the
capacity. This is done to take into account also the last unequal zone
size, if present.

Another thing to note is that block layer helpers cannot be used here
because at this point we haven't set the chunk sectors and we are still
in the revalidation callback.

Maybe some comments on top of this will help to avoid any confusion?
What do you think? And, I am not aware of any generic helper in math.h
that does this operation for both 32 and 64 bit architecture.

Regards,
Pankaj

2022-05-05 02:28:50

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

On 2022-05-04 19:03, Hannes Reinecke wrote:
> On 4/27/22 09:02, Pankaj Raghav wrote:
>> Remove the condition which disallows non-power_of_2 zone size ZNS drive
>> to be updated and use generic method to calculate number of zones
>> instead of relying on log and shift based calculation on zone size.
>>
>> The power_of_2 calculation has been replaced directly with generic
>> calculation without special handling. Both modified functions are not
>> used in hot paths, they are only used during initialization &
>> revalidation of the ZNS device.
>>
>> Reviewed-by: Luis Chamberlain <[email protected]>
>> Signed-off-by: Pankaj Raghav <[email protected]>
>> ---
     }
>>         ns->zsze = nvme_lba_to_sect(ns,
>> le64_to_cpu(id->lbafe[lbaf].zsze));
>> -    if (!is_power_of_2(ns->zsze)) {
>> -        dev_warn(ns->ctrl->device,
>> -            "invalid zone size:%llu for namespace:%u\n",
>> -            ns->zsze, ns->head->ns_id);
>> -        status = -ENODEV;
>> -        goto free_data;
>> -    }
>>         blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
>>       blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
>> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct
>> nvme_ns *ns,
>>                      sizeof(struct nvme_zone_descriptor);
>>         nr_zones = min_t(unsigned int, nr_zones,
>> -             get_capacity(ns->disk) >> ilog2(ns->zsze));
>> +             div64_u64(get_capacity(ns->disk), ns->zsze));
>>  
> Same here; please add a helper calculating the number of zones for a
> given disk.
>
I am already using the div64_u64 helper and this is not done again
anywhere in the nvme zns driver. I am not sure if having a separate
helper for this will add value. And this is not in the hot path, so no
need for special handling.
>>       bufsize = sizeof(struct nvme_zone_report) +
>>           nr_zones * sizeof(struct nvme_zone_descriptor);
>> @@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns,
>> sector_t sector,
>>       c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>>       c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>>   -    sector &= ~(ns->zsze - 1);
>> +    sector = rounddown(sector, ns->zsze);
>>       while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
>>           memset(report, 0, buflen);
>>  
> Please be a bit more consistent. In the previous patches you always had
> a condition to check if it's a power_of_2 zone size, but here you are
> using the same calculation for each disk.
> So please use the check in all locations, or add a comment why the
> generic calculation is okay to use here.
>
That is a good point. I have mentioned that in my commit log that I am
not having any special handling because these calculations are not in
the hot path.

Maybe adding comments is better for clarity. I will also do it for your
previous comment.

I will queue this up for my next revision. Thanks.
> Cheers,
>
> Hannes

2022-05-05 12:32:31

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper

On 4/27/22 09:02, Pankaj Raghav wrote:
> Checking if a given sector is aligned to a zone is a very common
> operation that is performed for zoned devices. Add
> blk_queue_zone_aligned helper to check for this instead of opencoding it
> everywhere.
>
> The helper is made to be generic so that it can also check for alignment
> for non non-power-of-2 zone size devices.
>
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is done for devices with
> power-of-2 zone size
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> include/linux/blkdev.h | 31 +++++++++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2022-05-05 16:46:25

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 13/16] null_blk: allow non power of 2 zoned devices

On 4/27/22 09:02, Pankaj Raghav wrote:
> Convert the power of 2 based calculation with zone size to be generic in
> null_zone_no with optimization for power of 2 based zone sizes.
>
> The nr_zones calculation in null_init_zoned_dev has been replaced with a
> division without special handling for power of 2 based zone sizes as
> this function is called only during the initialization and will not
> invoked in the hot path.
>
> Performance Measurement:
>
> Device:
> zone size = 128M, blocksize=4k
>
> FIO cmd:
>
> fio --name=zbc --filename=/dev/nullb0 --direct=1 --zonemode=zbd --size=23G
> --io_size=<iosize> --ioengine=io_uring --iodepth=<iod> --rw=<mode> --bs=4k
> --loops=4
>
> The following results are an average of 4 runs on AMD Ryzen 5 5600X with
> 32GB of RAM:
>
> Sequential Write:
>
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth | 8 | 16 |
> x-----------------x---------------------------------x---------------------------------x
> | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch | 578 | 2257 | 12.80 | 576 | 2248 | 25.78 |
> x-----------------x---------------------------------x---------------------------------x
> | With patch | 581 | 2268 | 12.74 | 576 | 2248 | 25.85 |
> x-----------------x---------------------------------x---------------------------------x
>
> Sequential read:
>
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth | 8 | 16 |
> x-----------------x---------------------------------x---------------------------------x
> | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch | 667 | 2605 | 11.79 | 675 | 2637 | 23.49 |
> x-----------------x---------------------------------x---------------------------------x
> | With patch | 667 | 2605 | 11.79 | 675 | 2638 | 23.48 |
> x-----------------x---------------------------------x---------------------------------x
>
> Random read:
>
> x-----------------x---------------------------------x---------------------------------x
> | IOdepth | 8 | 16 |
> x-----------------x---------------------------------x---------------------------------x
> | | KIOPS |BW(MiB/s) | Lat(usec) | KIOPS |BW(MiB/s) | Lat(usec) |
> x-----------------x---------------------------------x---------------------------------x
> | Without patch | 522 | 2038 | 15.05 | 514 | 2006 | 30.87 |
> x-----------------x---------------------------------x---------------------------------x
> | With patch | 522 | 2039 | 15.04 | 523 | 2042 | 30.33 |
> x-----------------x---------------------------------x---------------------------------x
>
> Minor variations are noticed in Sequential write with io depth 8 and
> in random read with io depth 16. But overall no noticeable differences
> were noticed
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/block/null_blk/main.c | 4 ++--
> drivers/block/null_blk/zoned.c | 14 +++++++-------
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2022-05-06 07:04:16

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 04/16] block: allow blk-zoned devices to have non-power-of-2 zone size

On 4/27/22 09:02, Pankaj Raghav wrote:
> Convert the calculations on zone size to be generic instead of relying on
> power_of_2 based logic in the block layer using the helpers wherever
> possible.
>
> The only hot path affected by this change for power_of_2 zoned devices
> is in blk_check_zone_append() but the effects should be negligible as the
> helper blk_queue_zone_aligned() optimizes the calculation for those
> devices. Note that the append path cannot be accessed by direct raw access
> to the block device but only through a filesystem abstraction.
>
> Finally, remove the check for power_of_2 zone size requirement in
> blk-zoned.c
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> block/blk-core.c | 3 +--
> block/blk-zoned.c | 12 ++++++------
> 2 files changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 937bb6b86331..850caf311064 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -634,8 +634,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
> return BLK_STS_NOTSUPP;
>
> /* The bio sector must point to the start of a sequential zone */
> - if (pos & (blk_queue_zone_sectors(q) - 1) ||
> - !blk_queue_zone_is_seq(q, pos))
> + if (!blk_queue_zone_aligned(q, pos) || !blk_queue_zone_is_seq(q, pos))
> return BLK_STS_IOERR;
>
> /*
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 1dff4a8bd51d..f7c7c3bd148d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -288,10 +288,10 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_opf op,
> return -EINVAL;
>
> /* Check alignment (handle eventual smaller last zone) */
> - if (sector & (zone_sectors - 1))
> + if (!blk_queue_zone_aligned(q, sector))
> return -EINVAL;
>
> - if ((nr_sectors & (zone_sectors - 1)) && end_sector != capacity)
> + if (!blk_queue_zone_aligned(q, nr_sectors) && end_sector != capacity)
> return -EINVAL;
>
> /*
> @@ -489,14 +489,14 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
> * smaller last zone.
> */
> if (zone->start == 0) {
> - if (zone->len == 0 || !is_power_of_2(zone->len)) {
> - pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n",
> - disk->disk_name, zone->len);
> + if (zone->len == 0) {
> + pr_warn("%s: Invalid zoned device size",
> + disk->disk_name);
> return -ENODEV;
> }
>
> args->zone_sectors = zone->len;
> - args->nr_zones = (capacity + zone->len - 1) >> ilog2(zone->len);
> + args->nr_zones = div64_u64(capacity + zone->len - 1, zone->len);

This is a different calculation than the one you're using in the first
patch. Can you please add a helper such that both are using the same
calculation?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2022-05-06 10:11:45

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

On 4/27/22 09:02, Pankaj Raghav wrote:
> Remove the condition which disallows non-power_of_2 zone size ZNS drive
> to be updated and use generic method to calculate number of zones
> instead of relying on log and shift based calculation on zone size.
>
> The power_of_2 calculation has been replaced directly with generic
> calculation without special handling. Both modified functions are not
> used in hot paths, they are only used during initialization &
> revalidation of the ZNS device.
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/nvme/host/zns.c | 11 ++---------
> 1 file changed, 2 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index 9f81beb4df4e..2087de0768ee 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -101,13 +101,6 @@ int nvme_update_zone_info(struct nvme_ns *ns, unsigned lbaf)
> }
>
> ns->zsze = nvme_lba_to_sect(ns, le64_to_cpu(id->lbafe[lbaf].zsze));
> - if (!is_power_of_2(ns->zsze)) {
> - dev_warn(ns->ctrl->device,
> - "invalid zone size:%llu for namespace:%u\n",
> - ns->zsze, ns->head->ns_id);
> - status = -ENODEV;
> - goto free_data;
> - }
>
> blk_queue_set_zoned(ns->disk, BLK_ZONED_HM);
> blk_queue_flag_set(QUEUE_FLAG_ZONE_RESETALL, q);
> @@ -129,7 +122,7 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
> sizeof(struct nvme_zone_descriptor);
>
> nr_zones = min_t(unsigned int, nr_zones,
> - get_capacity(ns->disk) >> ilog2(ns->zsze));
> + div64_u64(get_capacity(ns->disk), ns->zsze));
>
Same here; please add a helper calculating the number of zones for a
given disk.

> bufsize = sizeof(struct nvme_zone_report) +
> nr_zones * sizeof(struct nvme_zone_descriptor);
> @@ -197,7 +190,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
> c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
> c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
>
> - sector &= ~(ns->zsze - 1);
> + sector = rounddown(sector, ns->zsze);
> while (zone_idx < nr_zones && sector < get_capacity(ns->disk)) {
> memset(report, 0, buflen);
>
Please be a bit more consistent. In the previous patches you always had
a condition to check if it's a power_of_2 zone size, but here you are
using the same calculation for each disk.
So please use the check in all locations, or add a comment why the
generic calculation is okay to use here.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2022-05-06 15:31:07

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

On 4/27/22 09:02, Pankaj Raghav wrote:
> Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
> also work for non-power-of-2 zone sizes.
>
> As the existing deployments of zoned devices had power-of-2
> assumption, power-of-2 optimized calculation is kept for those devices.
>
> There are no direct hot paths modified and the changes just
> introduce one new branch per call.
>
> Reviewed-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> block/blk-zoned.c | 8 +++++++-
> include/linux/blkdev.h | 8 +++++++-
> 2 files changed, 14 insertions(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2022-05-09 03:28:34

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH 06/16] nvmet: use blk_queue_zone_no()

On 4/27/22 09:02, Pankaj Raghav wrote:
> From: Luis Chamberlain <[email protected]>
>
> Instead of open coding the number of zones given a sector, use the helper
> blk_queue_zone_no(). This let's us make modifications to the math if
> needed in one place and adds now support for npo2 zone devices.
>
> Signed-off-by: Luis Chamberlain <[email protected]>
> Signed-off-by: Pankaj Raghav <[email protected]>
> ---
> drivers/nvme/target/zns.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
Reviewed-by: Hannes Reinecke <[email protected]>

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


2022-05-09 04:35:19

by Pankaj Raghav

[permalink] [raw]
Subject: Re: [PATCH 00/16] support non power of 2 zoned devices

On 2022-05-04 23:14, David Sterba wrote:
>> This commit: `btrfs: zoned: relax the alignment constraint for zoned
>> devices` makes sure the zone size is BTRFS_STRIPE_LEN aligned (64K). So
>> even the npo2 zoned device should be aligned to `fs_info->sectorsize`,
>> which is typically 4k.
>>
>> This was one of the comment that came from David Sterba:
>> https://lore.kernel.org/all/[email protected]/
>> where he suggested to have some sane alignment for the zone sizes.
>
> My idea of 'sane' value would be 1M, that we have 4K for sectors is
> because of the 1:1 mapping to pages, but RAM sizes are on a different
> scale than storage devices. The 4K is absolute minimum but if the page
> size is taken as a basic constraint, ARM has 64K and there are some 256K
> arches.

That is a good point. I think it is safe to have 1MB as the minimum
alignment so that it covers all architecture's page sizes. Thanks. I
will queue this up.