2018-08-09 18:07:27

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 00/17] btrfs zoned block device support

This series adds zoned block device support to btrfs.

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring that
writes be issued in LBA order from each zone write pointer position. This
patch series ensures that the sequential write constraint of sequential
zones is respected while fundamentally not changing BtrFS block and I/O
management for block stored in conventional zones.

To achieve this, the default dev extent size of btrfs is changed on zoned
block devices so that dev extents are always aligned to a zone. Allocation
of blocks within a block group is changed so that the allocation is always
sequential from the beginning of the block groups. To do so, an allocation
pointer is added to block groups and used as the allocation hint. The
allocation changes also ensures that block freed below the allocation
pointer are ignored, resulting in sequential block allocation regardless of
the block group usage.

While the introduction of the allocation pointer ensure that blocks will be
allocated sequentially, I/Os to write out newly allocated blocks may be
issued out of order, causing errors when writing to sequential zones. This
problem s solved by introducing a submit_buffer() function and changes to
the internal I/O scheduler to ensure in-order issuing of write I/Os for
each chunk and corresponding to the block allocation order in the chunk.

The zones of a chunk are reset to allow reusing of the zone only when the
block group is being freed, that is, when all the extents of the block group
are unused.

For btrfs volumes composed of multiple zoned disks, restrictions are added
to ensure that all disks have the same zone size. This matches the existing
constraint that all dev extents in a chunk must have the same size.

It requires zoned block devices to test the patchset. Even if you don't
have zone devices, you can use tcmu-runner [1] to emulate zoned block
devices. It can export emulated zoned block devices via iSCSI. Please see
the README.md of tcmu-runner [2] for howtos to generate a zoned block
device on tcmu-runner.

[1] https://github.com/open-iscsi/tcmu-runner
[2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md

Patch 1 introduces the HMZONED incompatible feature flag to indicate that
the btrfs volume was formatted for use on zoned block devices.

Patches 2 and 3 implement functions to gather information on the zones of
the device (zones type and write pointer position).

Patch 4 restrict the possible locations of super blocks to conventional
zones to preserve the existing update in-place mechanism for the super
blocks.

Patches 5 to 7 disable features which are not compatible with the sequential
write constraints of zoned block devices. This includes fallocate and
direct I/O support. Device replace is also disabled for now.

Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
implement sequential block allocation in block groups and chunks.

Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
write I/O delivery to the device zones.

Patches 13 to 16 modify several parts of btrfs to handle free blocks
without breaking the sequential block allocation and sequential write order
as well as zone reset for unused chunks.

Finally, patch 17 adds the HMZONED feature to the list of supported
features.

Naohiro Aota (17):
btrfs: introduce HMZONED feature flag
btrfs: Get zone information of zoned block devices
btrfs: Check and enable HMZONED mode
btrfs: limit super block locations in HMZONED mode
btrfs: disable fallocate in HMZONED mode
btrfs: disable direct IO in HMZONED mode
btrfs: disable device replace in HMZONED mode
btrfs: align extent allocation to zone boundary
btrfs: do sequential allocation on HMZONED drives
btrfs: split btrfs_map_bio()
btrfs: introduce submit buffer
btrfs: expire submit buffer on timeout
btrfs: avoid sync IO prioritization on checksum in HMZONED mode
btrfs: redirty released extent buffers in sequential BGs
btrfs: reset zones of unused block groups
btrfs: wait existing extents before truncating
btrfs: enable to mount HMZONED incompat flag

fs/btrfs/async-thread.c | 1 +
fs/btrfs/async-thread.h | 1 +
fs/btrfs/ctree.h | 36 ++-
fs/btrfs/dev-replace.c | 10 +
fs/btrfs/disk-io.c | 48 +++-
fs/btrfs/extent-tree.c | 281 +++++++++++++++++-
fs/btrfs/extent_io.c | 1 +
fs/btrfs/extent_io.h | 1 +
fs/btrfs/file.c | 4 +
fs/btrfs/free-space-cache.c | 36 +++
fs/btrfs/free-space-cache.h | 10 +
fs/btrfs/inode.c | 14 +
fs/btrfs/super.c | 32 ++-
fs/btrfs/sysfs.c | 2 +
fs/btrfs/transaction.c | 32 +++
fs/btrfs/transaction.h | 3 +
fs/btrfs/volumes.c | 551 ++++++++++++++++++++++++++++++++++--
fs/btrfs/volumes.h | 37 +++
include/uapi/linux/btrfs.h | 1 +
19 files changed, 1061 insertions(+), 40 deletions(-)

--
2.18.0



2018-08-09 18:07:08

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 01/17] btrfs: introduce HMZONED feature flag

This patch introduces the HMZONED incompat flag. The flag indicates that
the volume management will satisfy the constraints imposed by host-managed
zoned block devices.

Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/sysfs.c | 2 ++
include/uapi/linux/btrfs.h | 1 +
2 files changed, 3 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 3717c864ba23..8065d416fb38 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -191,6 +191,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(extended_iref, EXTENDED_IREF);
BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
+BTRFS_FEAT_ATTR_INCOMPAT(hmzoned, HMZONED);
BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);

static struct attribute *btrfs_supported_feature_attrs[] = {
@@ -204,6 +205,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
BTRFS_FEAT_ATTR_PTR(raid56),
BTRFS_FEAT_ATTR_PTR(skinny_metadata),
BTRFS_FEAT_ATTR_PTR(no_holes),
+ BTRFS_FEAT_ATTR_PTR(hmzoned),
BTRFS_FEAT_ATTR_PTR(free_space_tree),
NULL
};
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 245aace2a400..c37b31a5b29d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -269,6 +269,7 @@ struct btrfs_ioctl_fs_info_args {
#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7)
#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
#define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED (1ULL << 10)

struct btrfs_ioctl_feature_flags {
__u64 compat_flags;
--
2.18.0


2018-08-09 18:07:15

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 04/17] btrfs: limit super block locations in HMZONED mode

When in HMZONED mode, make sure that device super blocks are located in
randomly writable zones of zoned block devices. That is, do not write super
blocks in sequential write required zones of host-managed zoned block
devices as update would not be possible.

Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/disk-io.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 14f284382ba7..6a014632ca1e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3435,6 +3435,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
return latest;
}

+static int check_super_location(struct btrfs_device *device, u64 pos)
+{
+ /* any address is good on a regular (zone_size == 0) device */
+ /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
+ return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
+}
+
/*
* Write superblock @sb to the @device. Do not wait for completion, all the
* buffer heads we write are pinned.
@@ -3464,6 +3471,8 @@ static int write_dev_supers(struct btrfs_device *device,
if (bytenr + BTRFS_SUPER_INFO_SIZE >=
device->commit_total_bytes)
break;
+ if (!check_super_location(device, bytenr))
+ continue;

btrfs_set_super_bytenr(sb, bytenr);

--
2.18.0


2018-08-09 18:07:26

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 07/17] btrfs: disable device replace in HMZONED mode

To enable device replacing feature in HMZONED mode, we should avoid write
replicating from replace source device to replace target device so that we
do not write to a location after the position of a zone's device write
pointer. In addition, scrub process should be modified to dispatch the
write I/Os sequentially and to fill holes between the extents. Finally,
write pointers of the zones should be synchronized to match with RAID
siblings in a RAID case.

It needs more works to solve all these issues. So disable the device
replace feature in HMZONED mode for now.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/dev-replace.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 839a35008fd8..cde61fb217db 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -416,6 +416,9 @@ int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;

+ if (btrfs_fs_incompat(fs_info, HMZONED))
+ return -EOPNOTSUPP;
+
ret = btrfs_find_device_by_devspec(fs_info, srcdevid,
srcdev_name, &src_device);
if (ret)
--
2.18.0


2018-08-09 18:07:32

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 08/17] btrfs: align extent allocation to zone boundary

In HMZONED mode, align the device extents to zone boundaries so that write
I/Os can begin at the start of a zone, as mandated on host-managed zoned
block devices. Also, check that a region allocation is always over empty
zones.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/extent-tree.c | 3 ++
fs/btrfs/volumes.c | 69 ++++++++++++++++++++++++++++++++++++++----
2 files changed, 66 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f77226d8020a..fc3daf0e5b92 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9527,6 +9527,9 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
min_free = div64_u64(min_free, dev_min);
}

+ /* We cannot allocate size less than zone_size anyway */
+ min_free = max_t(u64, min_free, fs_info->zone_size);
+
/* We need to do this so that we can look at pending chunks */
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ba7ebb80de4d..ada13120c2cd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1521,6 +1521,31 @@ static int contains_pending_extent(struct btrfs_transaction *transaction,
return ret;
}

+static u64 dev_zone_align(struct btrfs_device *device, u64 pos)
+{
+ if (device->zone_size)
+ return ALIGN(pos, device->zone_size);
+ return pos;
+}
+
+static int is_empty_zone_region(struct btrfs_device *device,
+ u64 pos, u64 num_bytes)
+{
+ if (device->zone_size == 0)
+ return 1;
+
+ WARN_ON(!IS_ALIGNED(pos, device->zone_size));
+ WARN_ON(!IS_ALIGNED(num_bytes, device->zone_size));
+
+ while (num_bytes > 0) {
+ if (!btrfs_dev_is_empty_zone(device, pos))
+ return 0;
+ pos += device->zone_size;
+ num_bytes -= device->zone_size;
+ }
+
+ return 1;
+}

/*
* find_free_dev_extent_start - find free space in the specified device
@@ -1564,9 +1589,14 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
/*
* We don't want to overwrite the superblock on the drive nor any area
* used by the boot loader (grub for example), so we make sure to start
- * at an offset of at least 1MB.
+ * at an offset of at least 1MB on a regular disk. For a zoned block
+ * device, skip the first zone of the device entirely.
*/
- search_start = max_t(u64, search_start, SZ_1M);
+ if (device->zone_size)
+ search_start = max_t(u64, dev_zone_align(device, search_start),
+ device->zone_size);
+ else
+ search_start = max_t(u64, search_start, SZ_1M);

path = btrfs_alloc_path();
if (!path)
@@ -1632,6 +1662,8 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
if (contains_pending_extent(transaction, device,
&search_start,
hole_size)) {
+ search_start = dev_zone_align(device,
+ search_start);
if (key.offset >= search_start) {
hole_size = key.offset - search_start;
} else {
@@ -1640,6 +1672,14 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
}
}

+ if (!is_empty_zone_region(device, search_start,
+ num_bytes)) {
+ search_start = dev_zone_align(device,
+ search_start+1);
+ btrfs_release_path(path);
+ goto again;
+ }
+
if (hole_size > max_hole_size) {
max_hole_start = search_start;
max_hole_size = hole_size;
@@ -1664,7 +1704,7 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,
extent_end = key.offset + btrfs_dev_extent_length(l,
dev_extent);
if (extent_end > search_start)
- search_start = extent_end;
+ search_start = dev_zone_align(device, extent_end);
next:
path->slots[0]++;
cond_resched();
@@ -1680,6 +1720,14 @@ int find_free_dev_extent_start(struct btrfs_transaction *transaction,

if (contains_pending_extent(transaction, device, &search_start,
hole_size)) {
+ search_start = dev_zone_align(device,
+ search_start);
+ btrfs_release_path(path);
+ goto again;
+ }
+
+ if (!is_empty_zone_region(device, search_start, num_bytes)) {
+ search_start = dev_zone_align(device, search_start+1);
btrfs_release_path(path);
goto again;
}
@@ -4832,6 +4880,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
int i;
int j;
int index;
+ int hmzoned = btrfs_fs_incompat(info, HMZONED);

BUG_ON(!alloc_profile_is_valid(type, 0));

@@ -4851,13 +4900,18 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
ncopies = btrfs_raid_array[index].ncopies;

if (type & BTRFS_BLOCK_GROUP_DATA) {
- max_stripe_size = SZ_1G;
+ if (hmzoned)
+ max_stripe_size = info->zone_size;
+ else
+ max_stripe_size = SZ_1G;
max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE;
if (!devs_max)
devs_max = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
/* for larger filesystems, use larger metadata chunks */
- if (fs_devices->total_rw_bytes > 50ULL * SZ_1G)
+ if (hmzoned)
+ max_stripe_size = info->zone_size;
+ else if (fs_devices->total_rw_bytes > 50ULL * SZ_1G)
max_stripe_size = SZ_1G;
else
max_stripe_size = SZ_256M;
@@ -4865,7 +4919,10 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
if (!devs_max)
devs_max = BTRFS_MAX_DEVS(info);
} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
- max_stripe_size = SZ_32M;
+ if (hmzoned)
+ max_stripe_size = info->zone_size;
+ else
+ max_stripe_size = SZ_32M;
max_chunk_size = 2 * max_stripe_size;
if (!devs_max)
devs_max = BTRFS_MAX_DEVS_SYS_CHUNK;
--
2.18.0


2018-08-09 18:07:34

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 09/17] btrfs: do sequential allocation on HMZONED drives

On HMZONED drives, writes must always be sequential and directed at a block
group zone write pointer position. Thus, block allocation in a block group
must also be done sequentially using an allocation pointer equal to the
block group zone write pointer plus the number of blocks allocated but not
yet written.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/ctree.h | 22 ++++
fs/btrfs/extent-tree.c | 231 ++++++++++++++++++++++++++++++++++++
fs/btrfs/free-space-cache.c | 36 ++++++
fs/btrfs/free-space-cache.h | 10 ++
4 files changed, 299 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 14f880126532..5060bcdcb72b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -562,6 +562,20 @@ struct btrfs_full_stripe_locks_tree {
struct mutex lock;
};

+/* Block group allocation types */
+enum btrfs_alloc_type {
+
+ /* Regular first fit allocation */
+ BTRFS_ALLOC_FIT = 0,
+
+ /*
+ * Sequential allocation: this is for HMZONED mode and
+ * will result in ignoring free space before a block
+ * group allocation offset.
+ */
+ BTRFS_ALLOC_SEQ = 1,
+};
+
struct btrfs_block_group_cache {
struct btrfs_key key;
struct btrfs_block_group_item item;
@@ -674,6 +688,14 @@ struct btrfs_block_group_cache {

/* Record locked full stripes for RAID5/6 block group */
struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+ /*
+ * Allocation offset for the block group to implement sequential
+ * allocation. This is used only with HMZONED mode enabled and if
+ * the block group resides on a sequential zone.
+ */
+ enum btrfs_alloc_type alloc_type;
+ u64 alloc_offset;
};

/* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc3daf0e5b92..d4355b9b494e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7412,6 +7412,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
}

have_block_group:
+ if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
+ offset = btrfs_find_space_for_alloc_seq(block_group,
+ num_bytes,
+ &max_extent_size);
+ if (!offset)
+ goto loop;
+ goto checks;
+ }
+
cached = block_group_cache_done(block_group);
if (unlikely(!cached)) {
have_caching_bg = true;
@@ -9847,11 +9856,223 @@ static void link_block_group(struct btrfs_block_group_cache *cache)
}
}

+static int
+btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
+{
+ struct btrfs_fs_info *fs_info = cache->fs_info;
+ struct extent_map_tree *em_tree = &fs_info->mapping_tree.map_tree;
+ struct extent_map *em;
+ struct map_lookup *map;
+ struct btrfs_device *device;
+ u64 logical = cache->key.objectid;
+ u64 length = cache->key.offset;
+ u64 physical = 0;
+ int ret, alloc_type;
+ int i, j;
+ u64 *alloc_offsets = NULL;
+
+#define WP_MISSING_DEV ((u64)-1)
+
+ /* Sanity check */
+ if (!IS_ALIGNED(length, fs_info->zone_size)) {
+ btrfs_err(fs_info, "unaligned block group at %llu + %llu",
+ logical, length);
+ return -EIO;
+ }
+
+ /* Get the chunk mapping */
+ em_tree = &fs_info->mapping_tree.map_tree;
+ read_lock(&em_tree->lock);
+ em = lookup_extent_mapping(em_tree, logical, length);
+ read_unlock(&em_tree->lock);
+
+ if (!em)
+ return -EINVAL;
+
+ map = em->map_lookup;
+
+ /*
+ * Get the zone type: if the group is mapped to a non-sequential zone,
+ * there is no need for the allocation offset (fit allocation is OK).
+ */
+ alloc_type = -1;
+ alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
+ GFP_NOFS);
+ if (!alloc_offsets) {
+ free_extent_map(em);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < map->num_stripes; i++) {
+ int is_sequential;
+ struct blk_zone zone;
+
+ device = map->stripes[i].dev;
+ physical = map->stripes[i].physical;
+
+ if (device->bdev == NULL) {
+ alloc_offsets[i] = WP_MISSING_DEV;
+ continue;
+ }
+
+ is_sequential = btrfs_dev_is_sequential(device, physical);
+ if (alloc_type == -1)
+ alloc_type = is_sequential ?
+ BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT;
+
+ if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) ||
+ (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) {
+ btrfs_err(fs_info, "found block group of mixed zone types");
+ ret = -EIO;
+ goto out;
+ }
+
+ if (!is_sequential)
+ continue;
+
+ /* this zone will be used for allocation, so mark this
+ * zone non-empty
+ */
+ clear_bit(physical >> device->zone_size_shift,
+ device->empty_zones);
+
+ /*
+ * The group is mapped to a sequential zone. Get the zone write
+ * pointer to determine the allocation offset within the zone.
+ */
+ WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
+ ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS);
+ if (ret == -EIO || ret == -EOPNOTSUPP) {
+ ret = 0;
+ alloc_offsets[i] = WP_MISSING_DEV;
+ continue;
+ } else if (ret) {
+ goto out;
+ }
+
+
+ switch (zone.cond) {
+ case BLK_ZONE_COND_OFFLINE:
+ case BLK_ZONE_COND_READONLY:
+ btrfs_err(fs_info, "Offline/readonly zone %llu",
+ physical >> device->zone_size_shift);
+ alloc_offsets[i] = WP_MISSING_DEV;
+ break;
+ case BLK_ZONE_COND_EMPTY:
+ alloc_offsets[i] = 0;
+ break;
+ case BLK_ZONE_COND_FULL:
+ alloc_offsets[i] = fs_info->zone_size;
+ break;
+ default:
+ /* Partially used zone */
+ alloc_offsets[i] = ((zone.wp - zone.start) << 9);
+ break;
+ }
+ }
+
+ if (alloc_type == BTRFS_ALLOC_FIT)
+ goto out;
+
+ switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+ case 0: /* single */
+ case BTRFS_BLOCK_GROUP_DUP:
+ case BTRFS_BLOCK_GROUP_RAID1:
+ cache->alloc_offset = WP_MISSING_DEV;
+ for (i = 0; i < map->num_stripes; i++) {
+ if (alloc_offsets[i] == WP_MISSING_DEV)
+ continue;
+ if (cache->alloc_offset == WP_MISSING_DEV)
+ cache->alloc_offset = alloc_offsets[i];
+ if (alloc_offsets[i] != cache->alloc_offset) {
+ btrfs_err(fs_info, "zones' write pointer mismatch");
+ ret = -EIO;
+ goto out;
+ }
+ }
+ break;
+ case BTRFS_BLOCK_GROUP_RAID0:
+ cache->alloc_offset = 0;
+ for (i = 0; i < map->num_stripes; i++) {
+ if (alloc_offsets[i] == WP_MISSING_DEV) {
+ btrfs_err(fs_info, "cannot recover Write pointer");
+ ret = -EIO;
+ goto out;
+ }
+ cache->alloc_offset += alloc_offsets[i];
+ if (alloc_offsets[0] < alloc_offsets[i]) {
+ btrfs_err(fs_info, "zones' write pointer mismatch");
+ ret = -EIO;
+ goto out;
+ }
+ }
+ break;
+ case BTRFS_BLOCK_GROUP_RAID10:
+ /*
+ * Pass1: check write pointer of RAID1 level: each pointer
+ * should be equal
+ */
+ for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+ int base = i*map->sub_stripes;
+ u64 offset = WP_MISSING_DEV;
+
+ for (j = 0; j < map->sub_stripes; j++) {
+ if (alloc_offsets[base+j] == WP_MISSING_DEV)
+ continue;
+ if (offset == WP_MISSING_DEV)
+ offset = alloc_offsets[base+j];
+ if (alloc_offsets[base+j] != offset) {
+ btrfs_err(fs_info, "zones' write pointer mismatch");
+ ret = -EIO;
+ goto out;
+ }
+ }
+ for (j = 0; j < map->sub_stripes; j++)
+ alloc_offsets[base+j] = offset;
+ }
+
+ /* Pass2: check write pointer of RAID1 level */
+ cache->alloc_offset = 0;
+ for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+ int base = i*map->sub_stripes;
+
+ if (alloc_offsets[base] == WP_MISSING_DEV) {
+ btrfs_err(fs_info, "cannot recover Write pointer");
+ ret = -EIO;
+ goto out;
+ }
+ if (alloc_offsets[0] < alloc_offsets[base]) {
+ btrfs_err(fs_info, "zones' write pointer mismatch");
+ ret = -EIO;
+ goto out;
+ }
+ cache->alloc_offset += alloc_offsets[base];
+ }
+ break;
+ case BTRFS_BLOCK_GROUP_RAID5:
+ case BTRFS_BLOCK_GROUP_RAID6:
+ /* RAID5/6 is not supported yet */
+ default:
+ btrfs_err(fs_info, "Unsupported profile %llu",
+ map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+ ret = -EINVAL;
+ goto out;
+ }
+
+out:
+ cache->alloc_type = alloc_type;
+ kfree(alloc_offsets);
+ free_extent_map(em);
+
+ return ret;
+}
+
static struct btrfs_block_group_cache *
btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
u64 start, u64 size)
{
struct btrfs_block_group_cache *cache;
+ int ret;

cache = kzalloc(sizeof(*cache), GFP_NOFS);
if (!cache)
@@ -9885,6 +10106,16 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
atomic_set(&cache->trimming, 0);
mutex_init(&cache->free_space_lock);
btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
+ cache->alloc_type = BTRFS_ALLOC_FIT;
+ cache->alloc_offset = 0;
+
+ if (btrfs_fs_incompat(fs_info, HMZONED)) {
+ ret = btrfs_get_block_group_alloc_offset(cache);
+ if (ret) {
+ kfree(cache);
+ return NULL;
+ }
+ }

return cache;
}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index c3888c113d81..b3ff9809d1e4 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2582,6 +2582,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
u64 align_gap = 0;
u64 align_gap_len = 0;

+ WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
+
spin_lock(&ctl->tree_lock);
entry = find_free_space(ctl, &offset, &bytes_search,
block_group->full_stripe_len, max_extent_size);
@@ -2616,6 +2618,38 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
return ret;
}

+/*
+ * Simple allocator for sequential only block group. It only allows sequential
+ * allocation. No need to play with trees.
+ */
+
+u64 btrfs_find_space_for_alloc_seq(struct btrfs_block_group_cache *block_group,
+ u64 bytes, u64 *max_extent_size)
+{
+ struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+ u64 start = block_group->key.objectid;
+ u64 avail;
+ u64 ret = 0;
+
+ /* Sanity check */
+ if (block_group->alloc_type != BTRFS_ALLOC_SEQ)
+ return 0;
+
+ spin_lock(&ctl->tree_lock);
+ avail = block_group->key.offset - block_group->alloc_offset;
+ if (avail < bytes) {
+ *max_extent_size = avail;
+ goto out;
+ }
+
+ ret = start + block_group->alloc_offset;
+ block_group->alloc_offset += bytes;
+ ctl->free_space -= bytes;
+out:
+ spin_unlock(&ctl->tree_lock);
+ return ret;
+}
+
/*
* given a cluster, put all of its extents back into the free space
* cache. If a block group is passed, this function will only free
@@ -2701,6 +2735,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group,
struct rb_node *node;
u64 ret = 0;

+ WARN_ON(block_group->alloc_type == BTRFS_ALLOC_SEQ);
+
spin_lock(&cluster->lock);
if (bytes > cluster->max_size)
goto out;
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index 794a444c3f73..79b4fa31bc8f 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -80,6 +80,14 @@ static inline int
btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
u64 bytenr, u64 size)
{
+ if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
+ struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+
+ spin_lock(&ctl->tree_lock);
+ ctl->free_space += size;
+ spin_unlock(&ctl->tree_lock);
+ return 0;
+ }
return __btrfs_add_free_space(block_group->fs_info,
block_group->free_space_ctl,
bytenr, size);
@@ -92,6 +100,8 @@ void btrfs_remove_free_space_cache(struct btrfs_block_group_cache
u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group,
u64 offset, u64 bytes, u64 empty_size,
u64 *max_extent_size);
+u64 btrfs_find_space_for_alloc_seq(struct btrfs_block_group_cache *block_group,
+ u64 bytes, u64 *max_extent_size);
u64 btrfs_find_ino_for_alloc(struct btrfs_root *fs_root);
void btrfs_dump_free_space(struct btrfs_block_group_cache *block_group,
u64 bytes);
--
2.18.0


2018-08-09 18:07:41

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 13/17] btrfs: avoid sync IO prioritization on checksum in HMZONED mode

By prioritizing sync I/Os, btrfs calls btrfs_map_block() for blocks
allocated later before calling the function allocated earlier. By the
disorder of calling btrfs_map_block(), syncing on I/Os on larger LBAs
sometime wait for I/Os on smaller LBAs.

Since active checksum worker is limited to some specified number, it is
possible to wait for non-starting checksum on smaller LBAs. In such
situation, transactions are stucked waiting for I/Os on smaller LBAs to
finish, which is never finished.

This situation can be reproduced by e.g. fstests btrfs/073.

To avoid such disordering, disable sync IO prioritization for now. In the
future, it will be reworked to finish checksumming of I/Os on smaller I/Os
on committing a transaction.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/disk-io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 00fa6aca9bb5..f79abd5e6b3a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -807,7 +807,7 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,

async->status = 0;

- if (op_is_sync(bio->bi_opf))
+ if (op_is_sync(bio->bi_opf) && !btrfs_fs_incompat(fs_info, HMZONED))
btrfs_set_work_high_priority(&async->work);

btrfs_queue_work(fs_info->workers, &async->work);
--
2.18.0


2018-08-09 18:07:42

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 15/17] btrfs: reset zones of unused block groups

For an HMZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/extent-tree.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a5f5935315c8..26989f6fe591 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2025,6 +2025,25 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
ASSERT(btrfs_test_opt(fs_info, DEGRADED));
continue;
}
+
+ if (clear == BTRFS_CLEAR_OP_DISCARD &&
+ btrfs_dev_is_sequential(stripe->dev,
+ stripe->physical) &&
+ stripe->length == stripe->dev->zone_size) {
+ ret = blkdev_reset_zones(stripe->dev->bdev,
+ stripe->physical >> 9,
+ stripe->length >> 9,
+ GFP_NOFS);
+ if (ret)
+ discarded_bytes += stripe->length;
+ else
+ break;
+ set_bit(stripe->physical >>
+ stripe->dev->zone_size_shift,
+ stripe->dev->empty_zones);
+ continue;
+ }
+
req_q = bdev_get_queue(stripe->dev->bdev);
if (clear == BTRFS_CLEAR_OP_DISCARD &&
!blk_queue_discard(req_q))
@@ -10958,7 +10977,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
spin_unlock(&space_info->lock);

/* DISCARD can flip during remount */
- trimming = btrfs_test_opt(fs_info, DISCARD);
+ trimming = btrfs_test_opt(fs_info, DISCARD) ||
+ btrfs_fs_incompat(fs_info, HMZONED);

/* Implicit trim during transaction commit. */
if (trimming)
--
2.18.0


2018-08-09 18:07:41

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 14/17] btrfs: redirty released extent buffers in sequential BGs

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On HMZONED drives, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks breaks the sequential write sequence expected by the
device.

This patch introduces a list of clean extent buffers that have been
released in a transaction. Btrfs consult the list before writing out and
waiting for the IOs, and it redirties a buffer if 1) it's in sequential BG,
2) it's in un-submit range, and 3) it's not under IO. Thus, such buffers
are marked for IO in btrfs_write_and_wait_transaction() to send proper bios
to the disk.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/disk-io.c | 23 +++++++++++++++++++++--
fs/btrfs/extent_io.c | 1 +
fs/btrfs/extent_io.h | 1 +
fs/btrfs/transaction.c | 32 ++++++++++++++++++++++++++++++++
fs/btrfs/transaction.h | 3 +++
5 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f79abd5e6b3a..aa69c167fd57 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1098,10 +1098,20 @@ struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 bytenr,
void clean_tree_block(struct btrfs_fs_info *fs_info,
struct extent_buffer *buf)
{
- if (btrfs_header_generation(buf) ==
- fs_info->running_transaction->transid) {
+ struct btrfs_transaction *cur_trans = fs_info->running_transaction;
+
+ if (btrfs_header_generation(buf) == cur_trans->transid) {
btrfs_assert_tree_locked(buf);

+ if (btrfs_fs_incompat(fs_info, HMZONED) &&
+ list_empty(&buf->release_list)) {
+ atomic_inc(&buf->refs);
+ spin_lock(&cur_trans->releasing_ebs_lock);
+ list_add_tail(&buf->release_list,
+ &cur_trans->releasing_ebs);
+ spin_unlock(&cur_trans->releasing_ebs_lock);
+ }
+
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)) {
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
-buf->len,
@@ -4484,6 +4494,15 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
btrfs_destroy_pinned_extent(fs_info,
fs_info->pinned_extents);

+ while (!list_empty(&cur_trans->releasing_ebs)) {
+ struct extent_buffer *eb;
+
+ eb = list_first_entry(&cur_trans->releasing_ebs,
+ struct extent_buffer, release_list);
+ list_del_init(&eb->release_list);
+ free_extent_buffer(eb);
+ }
+
cur_trans->state =TRANS_STATE_COMPLETED;
wake_up(&cur_trans->commit_wait);
}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 736d097d2851..31996c6a5d46 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4825,6 +4825,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
init_waitqueue_head(&eb->read_lock_wq);

btrfs_leak_debug_add(&eb->leak_list, &buffers);
+ INIT_LIST_HEAD(&eb->release_list);

spin_lock_init(&eb->refs_lock);
atomic_set(&eb->refs, 1);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index b4d03e677e1d..bcd9a068ed3b 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -192,6 +192,7 @@ struct extent_buffer {
*/
wait_queue_head_t read_lock_wq;
struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+ struct list_head release_list;
#ifdef CONFIG_BTRFS_DEBUG
struct list_head leak_list;
#endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3b84f5015029..5146e287917a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -273,6 +273,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
spin_lock_init(&cur_trans->dirty_bgs_lock);
INIT_LIST_HEAD(&cur_trans->deleted_bgs);
spin_lock_init(&cur_trans->dropped_roots_lock);
+ INIT_LIST_HEAD(&cur_trans->releasing_ebs);
+ spin_lock_init(&cur_trans->releasing_ebs_lock);
list_add_tail(&cur_trans->list, &fs_info->trans_list);
extent_io_tree_init(&cur_trans->dirty_pages,
fs_info->btree_inode);
@@ -2230,7 +2232,28 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)

wake_up(&fs_info->transaction_wait);

+ if (btrfs_fs_incompat(fs_info, HMZONED)) {
+ struct extent_buffer *eb;
+
+ list_for_each_entry(eb, &cur_trans->releasing_ebs,
+ release_list) {
+ struct btrfs_block_group_cache *cache;
+
+ cache = btrfs_lookup_block_group(fs_info, eb->start);
+ if (!cache)
+ continue;
+ spin_lock(&cache->submit_lock);
+ if (cache->alloc_type == BTRFS_ALLOC_SEQ &&
+ cache->submit_offset <= eb->start &&
+ !extent_buffer_under_io(eb))
+ set_extent_buffer_dirty(eb);
+ spin_unlock(&cache->submit_lock);
+ btrfs_put_block_group(cache);
+ }
+ }
+
ret = btrfs_write_and_wait_transaction(trans);
+
if (ret) {
btrfs_handle_fs_error(fs_info, ret,
"Error while writing out transaction");
@@ -2238,6 +2261,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
goto scrub_continue;
}

+ while (!list_empty(&cur_trans->releasing_ebs)) {
+ struct extent_buffer *eb;
+
+ eb = list_first_entry(&cur_trans->releasing_ebs,
+ struct extent_buffer, release_list);
+ list_del_init(&eb->release_list);
+ free_extent_buffer(eb);
+ }
+
ret = write_all_supers(fs_info, 0);
/*
* the super is written, we can safely allow the tree-loggers
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 4cbb1b55387d..d88c335dd78c 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -88,6 +88,9 @@ struct btrfs_transaction {
spinlock_t dropped_roots_lock;
struct btrfs_delayed_ref_root delayed_refs;
struct btrfs_fs_info *fs_info;
+
+ spinlock_t releasing_ebs_lock;
+ struct list_head releasing_ebs;
};

#define __TRANS_FREEZABLE (1U << 0)
--
2.18.0


2018-08-09 18:07:44

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 16/17] btrfs: wait existing extents before truncating

When truncating a file, file buffers which have already been allocated but
not yet written may be truncated. Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid this problem, always wait for write out of all unwritten buffers
before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/inode.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 05f5e05ccf37..d3f35f81834f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5193,6 +5193,17 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
btrfs_end_write_no_snapshotting(root);
btrfs_end_transaction(trans);
} else {
+ struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+ if (btrfs_fs_incompat(fs_info, HMZONED)) {
+ u64 sectormask = fs_info->sectorsize - 1;
+
+ ret = btrfs_wait_ordered_range(inode,
+ newsize & (~sectormask),
+ (u64)-1);
+ if (ret)
+ return ret;
+ }

/*
* We're truncating a file that used to have good data down to
--
2.18.0


2018-08-09 18:07:45

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 17/17] btrfs: enable to mount HMZONED incompat flag

This final patch adds the HMZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount HMZONED flagged file
system.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/ctree.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8f85c96cd262..46a243b2f111 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -271,7 +271,8 @@ struct btrfs_super_block {
BTRFS_FEATURE_INCOMPAT_RAID56 | \
BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \
- BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+ BTRFS_FEATURE_INCOMPAT_NO_HOLES | \
+ BTRFS_FEATURE_INCOMPAT_HMZONED)

#define BTRFS_FEATURE_INCOMPAT_SAFE_SET \
(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
--
2.18.0


2018-08-09 18:07:45

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 03/17] btrfs: Check and enable HMZONED mode

HMZONED mode cannot be used together with the RAID5/6 profile. Introduce
the function btrfs_check_hmzoned_mode() to check this. This function will
also check if HMZONED flag is enabled on the file system and if the file
system consists of zoned devices with equal zone size.

Additionally, as updates to the space cache are in-place, the space cache
cannot be located over sequential zones and there is no guarantees that the
device will have enough conventional zones to store this cache. Resolve
this problem by disabling completely the space cache. This does not
introduces any problems with sequential block groups: all the free space is
located after the allocation pointer and no free space before the pointer.
There is no need to have such cache.

Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/ctree.h | 3 ++
fs/btrfs/dev-replace.c | 7 ++++
fs/btrfs/disk-io.c | 7 ++++
fs/btrfs/super.c | 12 +++---
fs/btrfs/volumes.c | 87 ++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.h | 1 +
6 files changed, 112 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 66f1d3895bca..14f880126532 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -763,6 +763,9 @@ struct btrfs_fs_info {
struct btrfs_root *uuid_root;
struct btrfs_root *free_space_root;

+ /* Zone size when in HMZONED mode */
+ u64 zone_size;
+
/* the log root tree is a directory of all the other log roots */
struct btrfs_root *log_root_tree;

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index dec01970d8c5..839a35008fd8 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -202,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
return PTR_ERR(bdev);
}

+ if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+ !btrfs_fs_incompat(fs_info, HMZONED)) ||
+ (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
+ ret = -EINVAL;
+ goto error;
+ }
+
filemap_write_and_wait(bdev->bd_inode->i_mapping);

devices = &fs_info->fs_devices->devices;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5124c15705ce..14f284382ba7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3057,6 +3057,13 @@ int open_ctree(struct super_block *sb,

btrfs_free_extra_devids(fs_devices, 1);

+ ret = btrfs_check_hmzoned_mode(fs_info);
+ if (ret) {
+ btrfs_err(fs_info, "failed to init hmzoned mode: %d",
+ ret);
+ goto fail_block_groups;
+ }
+
ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
if (ret) {
btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5fdd95e3de05..cc812e459197 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
bool saved_compress_force;
int no_compress = 0;

- cache_gen = btrfs_super_cache_generation(info->super_copy);
- if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
- btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
- else if (cache_gen)
- btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+ if (!btrfs_fs_incompat(info, HMZONED)) {
+ cache_gen = btrfs_super_cache_generation(info->super_copy);
+ if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
+ btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
+ else if (cache_gen)
+ btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+ }

/*
* Even the options are empty, we still need to do extra check
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 35b3a2187653..ba7ebb80de4d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1293,6 +1293,80 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
return ret;
}

+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+ struct btrfs_device *device;
+ u64 hmzoned_devices = 0;
+ u64 nr_devices = 0;
+ u64 zone_size = 0;
+ int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
+ int ret = 0;
+
+ /* Count zoned devices */
+ list_for_each_entry(device, &fs_devices->devices, dev_list) {
+ if (!device->bdev)
+ continue;
+ if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
+ (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
+ incompat_hmzoned)) {
+ hmzoned_devices++;
+ if (!zone_size) {
+ zone_size = device->zone_size;
+ } else if (device->zone_size != zone_size) {
+ btrfs_err(fs_info,
+ "Zoned block devices must have equal zone sizes");
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+ nr_devices++;
+ }
+
+ if (!hmzoned_devices && incompat_hmzoned) {
+ /* No zoned block device, disable HMZONED */
+ btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ fs_info->zone_size = zone_size;
+
+ if (hmzoned_devices != nr_devices) {
+ btrfs_err(fs_info,
+ "zoned devices mixed with regular devices");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* RAID56 is not allowed */
+ if (btrfs_fs_incompat(fs_info, RAID56)) {
+ btrfs_err(fs_info, "HMZONED mode does not support RAID56");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * SPACE CACHE writing is not cowed. Disable that to avoid
+ * write errors in sequential zones.
+ */
+ if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
+ btrfs_info(fs_info,
+ "disabling disk space caching with HMZONED mode");
+ btrfs_clear_opt(fs_info->mount_opt, SPACE_CACHE);
+ }
+
+ btrfs_set_and_info(fs_info, NOTREELOG,
+ "disabling tree log with HMZONED mode");
+
+ btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
+ fs_info->zone_size);
+
+out:
+
+ return ret;
+}
+
static void btrfs_release_disk_super(struct page *page)
{
kunmap(page);
@@ -2471,6 +2545,13 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
if (IS_ERR(bdev))
return PTR_ERR(bdev);

+ if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+ !btrfs_fs_incompat(fs_info, HMZONED)) ||
+ (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
+ ret = -EINVAL;
+ goto error;
+ }
+
if (fs_devices->seeding) {
seeding_dev = 1;
down_write(&sb->s_umount);
@@ -2584,6 +2665,12 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
}
}

+ ret = btrfs_check_hmzoned_mode(fs_info);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ goto error_sysfs;
+ }
+
if (seeding_dev) {
mutex_lock(&fs_info->chunk_mutex);
ret = init_first_rw_device(trans, fs_info);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 13d59bff204f..58053d2e24aa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -416,6 +416,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
fmode_t flags, void *holder);
int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
struct blk_zone *zone, gfp_t gfp_mask);
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
struct btrfs_device *btrfs_scan_one_device(const char *path,
fmode_t flags, void *holder);
int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
--
2.18.0


2018-08-09 18:08:38

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 10/17] btrfs: split btrfs_map_bio()

This patch splits btrfs_map_bio() into two functions so that the following
patches can make use of the latter part of this function. The first part of
btrfs_map_bio() maps bios to a btrfs_bio and the second part submits the
mapped bios in btrfs_bio to the actual devices.

By splitting the function, we can now reuse the latter part to send
buffered btrfs_bio.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/volumes.c | 53 +++++++++++++++++++++++++++-------------------
1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ada13120c2cd..08d13da2553f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6435,17 +6435,44 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
}
}

+static void __btrfs_map_bio(struct btrfs_fs_info *fs_info, u64 logical,
+ struct btrfs_bio *bbio, int async_submit)
+{
+ struct btrfs_device *dev;
+ int dev_nr;
+ int total_devs;
+ struct bio *first_bio = bbio->orig_bio;
+ struct bio *bio = first_bio;
+
+ total_devs = bbio->num_stripes;
+ for (dev_nr = 0; dev_nr < total_devs; dev_nr++) {
+ dev = bbio->stripes[dev_nr].dev;
+ if (!dev || !dev->bdev ||
+ (bio_op(first_bio) == REQ_OP_WRITE &&
+ !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))) {
+ bbio_error(bbio, first_bio, logical);
+ continue;
+ }
+
+ if (dev_nr < total_devs - 1)
+ bio = btrfs_bio_clone(first_bio);
+ else
+ bio = first_bio;
+
+ submit_stripe_bio(bbio, bio, bbio->stripes[dev_nr].physical,
+ dev_nr, async_submit);
+ }
+ btrfs_bio_counter_dec(fs_info);
+}
+
blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
int mirror_num, int async_submit)
{
- struct btrfs_device *dev;
struct bio *first_bio = bio;
u64 logical = (u64)bio->bi_iter.bi_sector << 9;
u64 length = 0;
u64 map_length;
int ret;
- int dev_nr;
- int total_devs;
struct btrfs_bio *bbio = NULL;

length = bio->bi_iter.bi_size;
@@ -6459,7 +6486,6 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
return errno_to_blk_status(ret);
}

- total_devs = bbio->num_stripes;
bbio->orig_bio = first_bio;
bbio->private = first_bio->bi_private;
bbio->end_io = first_bio->bi_end_io;
@@ -6489,24 +6515,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
BUG();
}

- for (dev_nr = 0; dev_nr < total_devs; dev_nr++) {
- dev = bbio->stripes[dev_nr].dev;
- if (!dev || !dev->bdev ||
- (bio_op(first_bio) == REQ_OP_WRITE &&
- !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))) {
- bbio_error(bbio, first_bio, logical);
- continue;
- }
-
- if (dev_nr < total_devs - 1)
- bio = btrfs_bio_clone(first_bio);
- else
- bio = first_bio;
-
- submit_stripe_bio(bbio, bio, bbio->stripes[dev_nr].physical,
- dev_nr, async_submit);
- }
- btrfs_bio_counter_dec(fs_info);
+ __btrfs_map_bio(fs_info, logical, bbio, async_submit);
return BLK_STS_OK;
}

--
2.18.0


2018-08-09 18:08:51

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 12/17] btrfs: expire submit buffer on timeout

It is possible to have bios stalled in the submit buffer due to some bug or
device problem. In such such situation, btrfs stops working waiting for
buffered bios completions. To avoid such hang, add a worker that will
cancel the stalled bios after an expiration time out.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/async-thread.c | 1 +
fs/btrfs/async-thread.h | 1 +
fs/btrfs/ctree.h | 5 +++
fs/btrfs/disk-io.c | 7 +++-
fs/btrfs/extent-tree.c | 20 ++++++++++
fs/btrfs/super.c | 20 ++++++++++
fs/btrfs/volumes.c | 83 ++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 1 +
8 files changed, 136 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index d522494698fa..86735dfbabcc 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -109,6 +109,7 @@ BTRFS_WORK_HELPER(scrub_helper);
BTRFS_WORK_HELPER(scrubwrc_helper);
BTRFS_WORK_HELPER(scrubnc_helper);
BTRFS_WORK_HELPER(scrubparity_helper);
+BTRFS_WORK_HELPER(bio_expire_helper);

static struct __btrfs_workqueue *
__btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info, const char *name,
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index 7861c9feba5f..2c041f0668d4 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -54,6 +54,7 @@ BTRFS_WORK_HELPER_PROTO(scrub_helper);
BTRFS_WORK_HELPER_PROTO(scrubwrc_helper);
BTRFS_WORK_HELPER_PROTO(scrubnc_helper);
BTRFS_WORK_HELPER_PROTO(scrubparity_helper);
+BTRFS_WORK_HELPER_PROTO(bio_expire_helper);


struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ebbbf46aa540..8f85c96cd262 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -699,6 +699,10 @@ struct btrfs_block_group_cache {
spinlock_t submit_lock;
u64 submit_offset;
struct list_head submit_buffer;
+ struct btrfs_work work;
+ unsigned long last_submit;
+ int expired:1;
+ struct task_struct *expire_thread;
};

/* delayed seq elem */
@@ -974,6 +978,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *submit_workers;
struct btrfs_workqueue *caching_workers;
struct btrfs_workqueue *readahead_workers;
+ struct btrfs_workqueue *bio_expire_workers;

/*
* fixup workers take dirty pages that didn't properly go through
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6a014632ca1e..00fa6aca9bb5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2040,6 +2040,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
*/
btrfs_destroy_workqueue(fs_info->endio_meta_workers);
btrfs_destroy_workqueue(fs_info->endio_meta_write_workers);
+ btrfs_destroy_workqueue(fs_info->bio_expire_workers);
}

static void free_root_extent_buffers(struct btrfs_root *root)
@@ -2245,6 +2246,9 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
btrfs_alloc_workqueue(fs_info, "extent-refs", flags,
min_t(u64, fs_devices->num_devices,
max_active), 8);
+ fs_info->bio_expire_workers =
+ btrfs_alloc_workqueue(fs_info, "bio-expire", flags,
+ max_active, 0);

if (!(fs_info->workers && fs_info->delalloc_workers &&
fs_info->submit_workers && fs_info->flush_workers &&
@@ -2256,7 +2260,8 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
fs_info->caching_workers && fs_info->readahead_workers &&
fs_info->fixup_workers && fs_info->delayed_workers &&
fs_info->extent_workers &&
- fs_info->qgroup_rescan_workers)) {
+ fs_info->qgroup_rescan_workers &&
+ fs_info->bio_expire_workers)) {
return -ENOMEM;
}

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6b7b632b0791..a5f5935315c8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9745,6 +9745,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
block_group->cached == BTRFS_CACHE_ERROR)
free_excluded_extents(block_group);

+ if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
+ spin_lock(&block_group->submit_lock);
+ if (block_group->expire_thread)
+ wake_up_process(block_group->expire_thread);
+ spin_unlock(&block_group->submit_lock);
+ flush_work(&block_group->work.normal_work);
+ }
+
btrfs_remove_free_space_cache(block_group);
ASSERT(block_group->cached != BTRFS_CACHE_STARTED);
ASSERT(list_empty(&block_group->dirty_list));
@@ -10061,6 +10069,10 @@ btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
}

cache->submit_offset = logical + cache->alloc_offset;
+ btrfs_init_work(&cache->work, btrfs_bio_expire_helper,
+ expire_bios_fn, NULL, NULL);
+ cache->last_submit = 0;
+ cache->expired = 0;

out:
cache->alloc_type = alloc_type;
@@ -10847,6 +10859,14 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
}
spin_unlock(&fs_info->unused_bgs_lock);

+ if (block_group->alloc_type == BTRFS_ALLOC_SEQ) {
+ spin_lock(&block_group->submit_lock);
+ if (block_group->expire_thread)
+ wake_up_process(block_group->expire_thread);
+ spin_unlock(&block_group->submit_lock);
+ flush_work(&block_group->work.normal_work);
+ }
+
mutex_lock(&fs_info->delete_unused_bgs_mutex);

/* Don't want to race with allocators so take the groups_sem */
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index cc812e459197..4d1d6cc7cd59 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -154,6 +154,25 @@ void __btrfs_handle_fs_error(struct btrfs_fs_info *fs_info, const char *function
* completes. The next time when the filesystem is mounted writeable
* again, the device replace operation continues.
*/
+
+ /* expire pending bios in submit buffer */
+ if (btrfs_fs_incompat(fs_info, HMZONED)) {
+ struct btrfs_block_group_cache *block_group;
+ struct rb_node *node;
+
+ spin_lock(&fs_info->block_group_cache_lock);
+ for (node = rb_first(&fs_info->block_group_cache_tree); node;
+ node = rb_next(node)) {
+ block_group = rb_entry(node,
+ struct btrfs_block_group_cache,
+ cache_node);
+ spin_lock(&block_group->submit_lock);
+ if (block_group->expire_thread)
+ wake_up_process(block_group->expire_thread);
+ spin_unlock(&block_group->submit_lock);
+ }
+ spin_unlock(&fs_info->block_group_cache_lock);
+ }
}

#ifdef CONFIG_PRINTK
@@ -1730,6 +1749,7 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
btrfs_workqueue_set_max(fs_info->readahead_workers, new_pool_size);
btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers,
new_pool_size);
+ btrfs_workqueue_set_max(fs_info->bio_expire_workers, new_pool_size);
}

static inline void btrfs_remount_prepare(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ca03b7136892..0e68003a429d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6498,6 +6498,7 @@ static void __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info, u64 logical,
struct btrfs_block_group_cache *cache = NULL;
int sent;
LIST_HEAD(submit_list);
+ int should_queue = 1;

WARN_ON(bio_op(bbio->orig_bio) != REQ_OP_WRITE);

@@ -6512,7 +6513,21 @@ static void __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info, u64 logical,
bbio->need_seqwrite = 1;

spin_lock(&cache->submit_lock);
- if (cache->submit_offset == logical)
+
+ if (cache->expired) {
+ int i, total_devs = bbio->num_stripes;
+
+ spin_unlock(&cache->submit_lock);
+ btrfs_err(cache->fs_info,
+ "IO in expired block group %llu+%llu",
+ logical, length);
+ for (i = 0; i < total_devs; i++)
+ bbio_error(bbio, bbio->orig_bio, logical);
+ btrfs_put_block_group(cache);
+ return;
+ }
+
+ if (cache->submit_offset == logical || cache->expired)
goto send_bios;

if (cache->submit_offset > logical) {
@@ -6527,7 +6542,11 @@ static void __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info, u64 logical,

/* buffer the unaligned bio */
list_add_tail(&bbio->list, &cache->submit_buffer);
+ should_queue = !cache->last_submit;
+ cache->last_submit = jiffies;
spin_unlock(&cache->submit_lock);
+ if (should_queue)
+ btrfs_queue_work(fs_info->bio_expire_workers, &cache->work);
btrfs_put_block_group(cache);

return;
@@ -6561,6 +6580,14 @@ static void __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info, u64 logical,
}
}
} while (sent);
+
+ if (list_empty(&cache->submit_buffer)) {
+ should_queue = 0;
+ cache->last_submit = 0;
+ } else {
+ should_queue = !cache->last_submit;
+ cache->last_submit = jiffies;
+ }
spin_unlock(&cache->submit_lock);

/* send the collected bios */
@@ -6572,6 +6599,8 @@ static void __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info, u64 logical,

if (length)
goto loop;
+ if (should_queue)
+ btrfs_queue_work(fs_info->bio_expire_workers, &cache->work);
btrfs_put_block_group(cache);
}

@@ -6632,6 +6661,58 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
return BLK_STS_OK;
}

+void expire_bios_fn(struct btrfs_work *work)
+{
+ struct btrfs_block_group_cache *cache;
+ struct btrfs_bio *bbio, *next;
+ unsigned long expire_time, cur;
+ unsigned long expire = 90 * HZ;
+ LIST_HEAD(submit_list);
+
+ cache = container_of(work, struct btrfs_block_group_cache, work);
+ btrfs_get_block_group(cache);
+loop:
+ spin_lock(&cache->submit_lock);
+ cache->expire_thread = current;
+ if (list_empty(&cache->submit_buffer)) {
+ cache->last_submit = 0;
+ cache->expire_thread = NULL;
+ spin_unlock(&cache->submit_lock);
+ btrfs_put_block_group(cache);
+ return;
+ }
+ cur = jiffies;
+ expire_time = cache->last_submit + expire;
+ if (time_before(cur, expire_time) && !sb_rdonly(cache->fs_info->sb)) {
+ spin_unlock(&cache->submit_lock);
+ schedule_timeout_interruptible(expire_time - cur);
+ goto loop;
+ }
+
+ list_splice_init(&cache->submit_buffer, &submit_list);
+ cache->expired = 1;
+ cache->expire_thread = NULL;
+ spin_unlock(&cache->submit_lock);
+
+ btrfs_handle_fs_error(cache->fs_info, -EIO,
+ "bio submit buffer expired");
+ btrfs_err(cache->fs_info, "block group %llu submit pos %llu",
+ cache->key.objectid, cache->submit_offset);
+
+ list_for_each_entry_safe(bbio, next, &submit_list, list) {
+ u64 logical = (u64)bbio->orig_bio->bi_iter.bi_sector << 9;
+ int i, total_devs = bbio->num_stripes;
+
+ btrfs_err(cache->fs_info, "expiring %llu", logical);
+ list_del_init(&bbio->list);
+ for (i = 0; i < total_devs; i++)
+ bbio_error(bbio, bbio->orig_bio, logical);
+ }
+
+ cache->last_submit = 0;
+ btrfs_put_block_group(cache);
+}
+
struct btrfs_device *btrfs_find_device(struct btrfs_fs_info *fs_info, u64 devid,
u8 *uuid, u8 *fsid)
{
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 3db90f5395cd..2a3c046fa31b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -415,6 +415,7 @@ void btrfs_mapping_init(struct btrfs_mapping_tree *tree);
void btrfs_mapping_tree_free(struct btrfs_mapping_tree *tree);
blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
int mirror_num, int async_submit);
+void expire_bios_fn(struct btrfs_work *work);
int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
fmode_t flags, void *holder);
int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
--
2.18.0


2018-08-09 18:09:00

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 11/17] btrfs: introduce submit buffer

Sequential allocation is not enough to maintain sequential delivery of
write IOs to the device. Various features (async compress, async checksum,
...) of btrfs affect ordering of the IOs. This patch introduce submit
buffer to sort WRITE bios belonging to a block group and sort them out
sequentially in increasing block address to achieve sequential write
sequences with submit_stripe_bio().

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/ctree.h | 3 +
fs/btrfs/extent-tree.c | 5 ++
fs/btrfs/volumes.c | 121 +++++++++++++++++++++++++++++++++++++++--
fs/btrfs/volumes.h | 3 +
4 files changed, 128 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5060bcdcb72b..ebbbf46aa540 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -696,6 +696,9 @@ struct btrfs_block_group_cache {
*/
enum btrfs_alloc_type alloc_type;
u64 alloc_offset;
+ spinlock_t submit_lock;
+ u64 submit_offset;
+ struct list_head submit_buffer;
};

/* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d4355b9b494e..6b7b632b0791 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -105,6 +105,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
if (atomic_dec_and_test(&cache->count)) {
WARN_ON(cache->pinned > 0);
WARN_ON(cache->reserved > 0);
+ WARN_ON(!list_empty(&cache->submit_buffer));

/*
* If not empty, someone is still holding mutex of
@@ -10059,6 +10060,8 @@ btrfs_get_block_group_alloc_offset(struct btrfs_block_group_cache *cache)
goto out;
}

+ cache->submit_offset = logical + cache->alloc_offset;
+
out:
cache->alloc_type = alloc_type;
kfree(alloc_offsets);
@@ -10095,6 +10098,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,

atomic_set(&cache->count, 1);
spin_lock_init(&cache->lock);
+ spin_lock_init(&cache->submit_lock);
init_rwsem(&cache->data_rwsem);
INIT_LIST_HEAD(&cache->list);
INIT_LIST_HEAD(&cache->cluster_list);
@@ -10102,6 +10106,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
INIT_LIST_HEAD(&cache->ro_list);
INIT_LIST_HEAD(&cache->dirty_list);
INIT_LIST_HEAD(&cache->io_list);
+ INIT_LIST_HEAD(&cache->submit_buffer);
btrfs_init_free_space_ctl(cache);
atomic_set(&cache->trimming, 0);
mutex_init(&cache->free_space_lock);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 08d13da2553f..ca03b7136892 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -513,6 +513,8 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
spin_unlock(&device->io_lock);

while (pending) {
+ struct btrfs_bio *bbio;
+ struct completion *sent = NULL;

rmb();
/* we want to work on both lists, but do more bios on the
@@ -550,7 +552,12 @@ static noinline void run_scheduled_bios(struct btrfs_device *device)
sync_pending = 0;
}

+ bbio = cur->bi_private;
+ if (bbio)
+ sent = bbio->sent;
btrfsic_submit_bio(cur);
+ if (sent)
+ complete(sent);
num_run++;
batch_run++;

@@ -5542,6 +5549,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes)

atomic_set(&bbio->error, 0);
refcount_set(&bbio->refs, 1);
+ INIT_LIST_HEAD(&bbio->list);

return bbio;
}
@@ -6351,7 +6359,7 @@ static void btrfs_end_bio(struct bio *bio)
* the work struct is scheduled.
*/
static noinline void btrfs_schedule_bio(struct btrfs_device *device,
- struct bio *bio)
+ struct bio *bio, int need_seqwrite)
{
struct btrfs_fs_info *fs_info = device->fs_info;
int should_queue = 1;
@@ -6365,7 +6373,12 @@ static noinline void btrfs_schedule_bio(struct btrfs_device *device,

/* don't bother with additional async steps for reads, right now */
if (bio_op(bio) == REQ_OP_READ) {
+ struct btrfs_bio *bbio = bio->bi_private;
+ struct completion *sent = bbio->sent;
+
btrfsic_submit_bio(bio);
+ if (sent)
+ complete(sent);
return;
}

@@ -6373,7 +6386,7 @@ static noinline void btrfs_schedule_bio(struct btrfs_device *device,
bio->bi_next = NULL;

spin_lock(&device->io_lock);
- if (op_is_sync(bio->bi_opf))
+ if (op_is_sync(bio->bi_opf) && need_seqwrite == 0)
pending_bios = &device->pending_sync_bios;
else
pending_bios = &device->pending_bios;
@@ -6412,8 +6425,21 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,

btrfs_bio_counter_inc_noblocked(fs_info);

+ /* queue all bios into scheduler if sequential write is required */
+ if (bbio->need_seqwrite) {
+ if (!async) {
+ DECLARE_COMPLETION_ONSTACK(sent);
+
+ bbio->sent = &sent;
+ btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
+ wait_for_completion_io(&sent);
+ } else {
+ btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
+ }
+ return;
+ }
if (async)
- btrfs_schedule_bio(dev, bio);
+ btrfs_schedule_bio(dev, bio, bbio->need_seqwrite);
else
btrfsic_submit_bio(bio);
}
@@ -6465,6 +6491,90 @@ static void __btrfs_map_bio(struct btrfs_fs_info *fs_info, u64 logical,
btrfs_bio_counter_dec(fs_info);
}

+static void __btrfs_map_bio_zoned(struct btrfs_fs_info *fs_info, u64 logical,
+ struct btrfs_bio *bbio, int async_submit)
+{
+ u64 length = bbio->orig_bio->bi_iter.bi_size;
+ struct btrfs_block_group_cache *cache = NULL;
+ int sent;
+ LIST_HEAD(submit_list);
+
+ WARN_ON(bio_op(bbio->orig_bio) != REQ_OP_WRITE);
+
+ cache = btrfs_lookup_block_group(fs_info, logical);
+ if (!cache || cache->alloc_type != BTRFS_ALLOC_SEQ) {
+ if (cache)
+ btrfs_put_block_group(cache);
+ __btrfs_map_bio(fs_info, logical, bbio, async_submit);
+ return;
+ }
+
+ bbio->need_seqwrite = 1;
+
+ spin_lock(&cache->submit_lock);
+ if (cache->submit_offset == logical)
+ goto send_bios;
+
+ if (cache->submit_offset > logical) {
+ btrfs_info(fs_info, "sending unaligned bio... %llu+%llu %llu\n",
+ logical, length, cache->submit_offset);
+ spin_unlock(&cache->submit_lock);
+ WARN_ON(1);
+ btrfs_put_block_group(cache);
+ __btrfs_map_bio(fs_info, logical, bbio, async_submit);
+ return;
+ }
+
+ /* buffer the unaligned bio */
+ list_add_tail(&bbio->list, &cache->submit_buffer);
+ spin_unlock(&cache->submit_lock);
+ btrfs_put_block_group(cache);
+
+ return;
+
+send_bios:
+ spin_unlock(&cache->submit_lock);
+ /* send this bio */
+ __btrfs_map_bio(fs_info, logical, bbio, 1);
+
+loop:
+ /* and send previously buffered following bios */
+ spin_lock(&cache->submit_lock);
+ cache->submit_offset += length;
+ length = 0;
+ INIT_LIST_HEAD(&submit_list);
+
+ /* collect sequential bios into submit_list */
+ do {
+ struct btrfs_bio *next;
+
+ sent = 0;
+ list_for_each_entry_safe(bbio, next,
+ &cache->submit_buffer, list) {
+ struct bio *orig_bio = bbio->orig_bio;
+ u64 logical = (u64)orig_bio->bi_iter.bi_sector << 9;
+
+ if (logical == cache->submit_offset + length) {
+ sent = 1;
+ length += orig_bio->bi_iter.bi_size;
+ list_move_tail(&bbio->list, &submit_list);
+ }
+ }
+ } while (sent);
+ spin_unlock(&cache->submit_lock);
+
+ /* send the collected bios */
+ list_for_each_entry(bbio, &submit_list, list) {
+ __btrfs_map_bio(bbio->fs_info,
+ (u64)bbio->orig_bio->bi_iter.bi_sector << 9,
+ bbio, 1);
+ }
+
+ if (length)
+ goto loop;
+ btrfs_put_block_group(cache);
+}
+
blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
int mirror_num, int async_submit)
{
@@ -6515,7 +6625,10 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
BUG();
}

- __btrfs_map_bio(fs_info, logical, bbio, async_submit);
+ if (btrfs_fs_incompat(fs_info, HMZONED) && bio_op(bio) == REQ_OP_WRITE)
+ __btrfs_map_bio_zoned(fs_info, logical, bbio, async_submit);
+ else
+ __btrfs_map_bio(fs_info, logical, bbio, async_submit);
return BLK_STS_OK;
}

diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 58053d2e24aa..3db90f5395cd 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -317,6 +317,9 @@ struct btrfs_bio {
int mirror_num;
int num_tgtdevs;
int *tgtdev_map;
+ int need_seqwrite;
+ struct list_head list;
+ struct completion *sent;
/*
* logical block numbers for the start of each stripe
* The last one or two are p/q. These are sorted,
--
2.18.0


2018-08-09 18:09:13

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 02/17] btrfs: Get zone information of zoned block devices

If a zoned block device is found, get its zone information (number of zones
and zone size) using the new helper function btrfs_get_dev_zone(). To
avoid costly run-time zone reports commands to test the device zones type
during block allocation, attach the seqzones bitmap to the device structure
to indicate if a zone is sequential or accept random writes.

This patch also introduces the helper function btrfs_dev_is_sequential() to
test if the zone storing a block is a sequential write required zone.

Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/volumes.c | 146 +++++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.h | 32 ++++++++++
2 files changed, 178 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index da86706123ff..35b3a2187653 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -677,6 +677,134 @@ static void btrfs_free_stale_devices(const char *path,
}
}

+static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
+ struct blk_zone **zones,
+ unsigned int *nr_zones, gfp_t gfp_mask)
+{
+ struct blk_zone *z = *zones;
+ int ret;
+
+ if (!z) {
+ z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
+ if (!z)
+ return -ENOMEM;
+ }
+
+ ret = blkdev_report_zones(device->bdev, pos >> 9,
+ z, nr_zones, gfp_mask);
+ if (ret != 0) {
+ pr_err("BTRFS: Get zone at %llu failed %d\n",
+ pos, ret);
+ return ret;
+ }
+
+ *zones = z;
+
+ return 0;
+}
+
+static void btrfs_drop_dev_zonetypes(struct btrfs_device *device)
+{
+ kfree(device->seq_zones);
+ kfree(device->empty_zones);
+ device->seq_zones = NULL;
+ device->empty_zones = NULL;
+ device->nr_zones = 0;
+ device->zone_size = 0;
+ device->zone_size_shift = 0;
+}
+
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+ struct blk_zone *zone, gfp_t gfp_mask)
+{
+ unsigned int nr_zones = 1;
+ int ret;
+
+ ret = __btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
+ if (ret != 0 || !nr_zones)
+ return ret ? ret : -EIO;
+
+ return 0;
+}
+
+static int btrfs_get_dev_zonetypes(struct btrfs_device *device)
+{
+ struct block_device *bdev = device->bdev;
+ sector_t nr_sectors = bdev->bd_part->nr_sects;
+ sector_t sector = 0;
+ struct blk_zone *zones = NULL;
+ unsigned int i, n = 0, nr_zones;
+ int ret;
+
+ device->zone_size = 0;
+ device->zone_size_shift = 0;
+ device->nr_zones = 0;
+ device->seq_zones = NULL;
+ device->empty_zones = NULL;
+
+ if (!bdev_is_zoned(bdev))
+ return 0;
+
+ device->zone_size = (u64)bdev_zone_sectors(bdev) << 9;
+ device->zone_size_shift = ilog2(device->zone_size);
+ device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
+ if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
+ device->nr_zones++;
+
+ device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
+ sizeof(*device->seq_zones), GFP_KERNEL);
+ if (!device->seq_zones)
+ return -ENOMEM;
+
+ device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
+ sizeof(*device->empty_zones), GFP_KERNEL);
+ if (!device->empty_zones)
+ return -ENOMEM;
+
+#define BTRFS_REPORT_NR_ZONES 4096
+
+ /* Get zones type */
+ while (sector < nr_sectors) {
+ nr_zones = BTRFS_REPORT_NR_ZONES;
+ ret = __btrfs_get_dev_zones(device, sector << 9,
+ &zones, &nr_zones, GFP_KERNEL);
+ if (ret != 0 || !nr_zones) {
+ if (!ret)
+ ret = -EIO;
+ goto out;
+ }
+
+ for (i = 0; i < nr_zones; i++) {
+ if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+ set_bit(n, device->seq_zones);
+ if (zones[i].cond == BLK_ZONE_COND_EMPTY)
+ set_bit(n, device->empty_zones);
+ sector = zones[i].start + zones[i].len;
+ n++;
+ }
+ }
+
+ if (n != device->nr_zones) {
+ pr_err("BTRFS: Inconsistent number of zones (%u / %u)\n",
+ n, device->nr_zones);
+ ret = -EIO;
+ goto out;
+ }
+
+ pr_info("BTRFS: host-%s zoned block device, %u zones of %llu sectors\n",
+ bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
+ device->nr_zones, device->zone_size >> 9);
+
+out:
+ kfree(zones);
+
+ if (ret)
+ btrfs_drop_dev_zonetypes(device);
+
+ return ret;
+}
+
+
static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
struct btrfs_device *device, fmode_t flags,
void *holder)
@@ -726,6 +854,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
device->mode = flags;

+ /* Get zone type information of zoned block devices */
+ if (bdev_is_zoned(bdev)) {
+ ret = btrfs_get_dev_zonetypes(device);
+ if (ret != 0)
+ goto error_brelse;
+ }
+
fs_devices->open_devices++;
if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -1012,6 +1147,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
}

blkdev_put(device->bdev, device->mode);
+ btrfs_drop_dev_zonetypes(device);
}

static void btrfs_close_one_device(struct btrfs_device *device)
@@ -2439,6 +2575,15 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
mutex_unlock(&fs_info->chunk_mutex);
mutex_unlock(&fs_devices->device_list_mutex);

+ /* Get zone type information of zoned block devices */
+ if (bdev_is_zoned(bdev)) {
+ ret = btrfs_get_dev_zonetypes(device);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ goto error_sysfs;
+ }
+ }
+
if (seeding_dev) {
mutex_lock(&fs_info->chunk_mutex);
ret = init_first_rw_device(trans, fs_info);
@@ -2504,6 +2649,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;

error_sysfs:
+ btrfs_drop_dev_zonetypes(device);
btrfs_sysfs_rm_device_link(fs_devices, device);
mutex_lock(&fs_info->fs_devices->device_list_mutex);
mutex_lock(&fs_info->chunk_mutex);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 23e9285d88de..13d59bff204f 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -61,6 +61,16 @@ struct btrfs_device {

struct block_device *bdev;

+ /*
+ * Number of zones, zone size and types of zones if bdev is a
+ * zoned block device.
+ */
+ u64 zone_size;
+ u8 zone_size_shift;
+ u32 nr_zones;
+ unsigned long *seq_zones;
+ unsigned long *empty_zones;
+
/* the mode sent to blkdev_get */
fmode_t mode;

@@ -404,6 +414,8 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
int mirror_num, int async_submit);
int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
fmode_t flags, void *holder);
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+ struct blk_zone *zone, gfp_t gfp_mask);
struct btrfs_device *btrfs_scan_one_device(const char *path,
fmode_t flags, void *holder);
int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
@@ -466,6 +478,26 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
u64 chunk_offset, u64 chunk_size);
int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);

+static inline int btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
+{
+ unsigned int zno = pos >> device->zone_size_shift;
+
+ if (!device->seq_zones)
+ return 1;
+
+ return test_bit(zno, device->seq_zones);
+}
+
+static inline int btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
+{
+ unsigned int zno = pos >> device->zone_size_shift;
+
+ if (!device->empty_zones)
+ return 0;
+
+ return test_bit(zno, device->empty_zones);
+}
+
static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
int index)
{
--
2.18.0


2018-08-09 18:09:30

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 06/17] btrfs: disable direct IO in HMZONED mode

Direct write I/Os can be directed at existing extents that have already
been written. Such write requests are prohibited on host-managed zoned
block devices. So disable direct IO support for a volume with HMZONED mode
enabled.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/inode.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 212fa71317d6..05f5e05ccf37 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8523,6 +8523,9 @@ static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
unsigned int blocksize_mask = fs_info->sectorsize - 1;
ssize_t retval = -EINVAL;

+ if (btrfs_fs_incompat(fs_info, HMZONED))
+ goto out;
+
if (offset & blocksize_mask)
goto out;

--
2.18.0


2018-08-09 18:09:34

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 05/17] btrfs: disable fallocate in HMZONED mode

fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write constraint
of host-managed zoned block devices to the application, which would break
the POSIX semantic for the fallocated file. To avoid this, report
fallocate() as not supported when in HMZONED mode.

Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/file.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 095f0bb86bb7..6f4546ccb57d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2993,6 +2993,10 @@ static long btrfs_fallocate(struct file *file, int mode,
alloc_end = round_up(offset + len, blocksize);
cur_offset = alloc_start;

+ /* Do not allow fallocate in HMZONED mode */
+ if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED))
+ return -EOPNOTSUPP;
+
/* Make sure we aren't being give some crap mode */
if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
FALLOC_FL_ZERO_RANGE))
--
2.18.0


2018-08-09 18:12:24

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 01/12] btrfs-progs: build: Check zoned block device support

If the kernel supports zoned block devices, the file
/usr/include/linux/blkzoned.h will be present. Check this and define
BTRFS_ZONED if the file is present.

If it present, enables HMZONED feature, if not disable it.

Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
---
configure.ac | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/configure.ac b/configure.ac
index df02f206..616d62a1 100644
--- a/configure.ac
+++ b/configure.ac
@@ -207,6 +207,18 @@ else
AC_DEFINE([HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE], [0], [We did not define FIEMAP_EXTENT_SHARED])
fi

+AC_CHECK_HEADER(linux/blkzoned.h, [blkzoned_found=yes], [blkzoned_found=no])
+AC_ARG_ENABLE([zoned],
+ AS_HELP_STRING([--disable-zoned], [disable zoned block device support]),
+ [], [enable_zoned=$blkzoned_found]
+)
+
+AS_IF([test "x$enable_zoned" = xyes], [
+ AC_CHECK_HEADER(linux/blkzoned.h, [],
+ [AC_MSG_ERROR([Couldn't find linux/blkzoned.h])])
+ AC_DEFINE([BTRFS_ZONED], [1], [enable zoned block device support])
+])
+
dnl Define <NAME>_LIBS= and <NAME>_CFLAGS= by pkg-config
dnl
dnl The default PKG_CHECK_MODULES() action-if-not-found is end the
@@ -308,6 +320,7 @@ AC_MSG_RESULT([
btrfs-restore zstd: ${enable_zstd}
Python bindings: ${enable_python}
Python interpreter: ${PYTHON}
+ zoned device: ${enable_zoned}

Type 'make' to compile.
])
--
2.18.0


2018-08-09 18:12:35

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 03/12] btrfs-progs: add new HMZONED feature flag

With this feature enabled, a zoned block device aware btrfs allocates block
groups aligned to the device zones and always write in sequential zones at
the zone write pointer position.

Enabling this feature also force disable conversion from ext4 volumes.

Note: this flag can be moved to COMPAT_RO, so that older kernel can read
but not write zoned block devices formatted with btrfs.

Signed-off-by: Naohiro Aota <[email protected]>
---
cmds-inspect-dump-super.c | 3 ++-
ctree.h | 4 +++-
fsfeatures.c | 8 ++++++++
fsfeatures.h | 2 +-
libbtrfsutil/btrfs.h | 1 +
5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index e965267c..4e365ead 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -228,7 +228,8 @@ static struct readable_flag_entry incompat_flags_array[] = {
DEF_INCOMPAT_FLAG_ENTRY(EXTENDED_IREF),
DEF_INCOMPAT_FLAG_ENTRY(RAID56),
DEF_INCOMPAT_FLAG_ENTRY(SKINNY_METADATA),
- DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES)
+ DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES),
+ DEF_INCOMPAT_FLAG_ENTRY(HMZONED)
};
static const int incompat_flags_num = sizeof(incompat_flags_array) /
sizeof(struct readable_flag_entry);
diff --git a/ctree.h b/ctree.h
index 4719962d..6d805ecd 100644
--- a/ctree.h
+++ b/ctree.h
@@ -489,6 +489,7 @@ struct btrfs_super_block {
#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7)
#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
#define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED (1ULL << 10)

#define BTRFS_FEATURE_COMPAT_SUPP 0ULL

@@ -509,7 +510,8 @@ struct btrfs_super_block {
BTRFS_FEATURE_INCOMPAT_RAID56 | \
BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS | \
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \
- BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+ BTRFS_FEATURE_INCOMPAT_NO_HOLES | \
+ BTRFS_FEATURE_INCOMPAT_HMZONED)

/*
* A leaf is full of items. offset and size tell us where to find
diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f..53396dd4 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -86,6 +86,14 @@ static const struct btrfs_fs_feature {
VERSION_TO_STRING2(4,0),
NULL, 0,
"no explicit hole extents for files" },
+#ifdef BTRFS_ZONED
+ { "hmzoned", BTRFS_FEATURE_INCOMPAT_HMZONED,
+ "hmzoned",
+ NULL, 0,
+ NULL, 0,
+ NULL, 0,
+ "support Host-Managed Zoned devices" },
+#endif
/* Keep this one last */
{ "list-all", BTRFS_FEATURE_LIST_ALL, NULL }
};
diff --git a/fsfeatures.h b/fsfeatures.h
index 3cc9452a..0918ee1a 100644
--- a/fsfeatures.h
+++ b/fsfeatures.h
@@ -25,7 +25,7 @@
| BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)

/*
- * Avoid multi-device features (RAID56) and mixed block groups
+ * Avoid multi-device features (RAID56), mixed block groups, and hmzoned device
*/
#define BTRFS_CONVERT_ALLOWED_FEATURES \
(BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF \
diff --git a/libbtrfsutil/btrfs.h b/libbtrfsutil/btrfs.h
index c293f6bf..c6a60fbc 100644
--- a/libbtrfsutil/btrfs.h
+++ b/libbtrfsutil/btrfs.h
@@ -268,6 +268,7 @@ struct btrfs_ioctl_fs_info_args {
#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7)
#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8)
#define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_HMZONED (1ULL << 10)

struct btrfs_ioctl_feature_flags {
__u64 compat_flags;
--
2.18.0


2018-08-09 18:12:37

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 02/12] btrfs-progs: utils: Introduce queue_param

Introduce the queue_param function to get a device request queue
parameter and this function to test if the device is an SSD in
is_ssd().

Signed-off-by: Damien Le Moal <[email protected]>
[Naohiro] fixed error return value
Signed-off-by: Naohiro Aota <[email protected]>
---
mkfs/main.c | 40 ++--------------------------------------
utils.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
utils.h | 1 +
3 files changed, 49 insertions(+), 38 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index b76462a7..83969b4b 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -435,49 +435,13 @@ static int zero_output_file(int out_fd, u64 size)

static int is_ssd(const char *file)
{
- blkid_probe probe;
- char wholedisk[PATH_MAX];
- char sysfs_path[PATH_MAX];
- dev_t devno;
- int fd;
char rotational;
int ret;

- probe = blkid_new_probe_from_filename(file);
- if (!probe)
+ ret = queue_param(file, "rotational", &rotational, 1);
+ if (ret < 1)
return 0;

- /* Device number of this disk (possibly a partition) */
- devno = blkid_probe_get_devno(probe);
- if (!devno) {
- blkid_free_probe(probe);
- return 0;
- }
-
- /* Get whole disk name (not full path) for this devno */
- ret = blkid_devno_to_wholedisk(devno,
- wholedisk, sizeof(wholedisk), NULL);
- if (ret) {
- blkid_free_probe(probe);
- return 0;
- }
-
- snprintf(sysfs_path, PATH_MAX, "/sys/block/%s/queue/rotational",
- wholedisk);
-
- blkid_free_probe(probe);
-
- fd = open(sysfs_path, O_RDONLY);
- if (fd < 0) {
- return 0;
- }
-
- if (read(fd, &rotational, 1) < 1) {
- close(fd);
- return 0;
- }
- close(fd);
-
return rotational == '0';
}

diff --git a/utils.c b/utils.c
index d4395b1f..2212692c 100644
--- a/utils.c
+++ b/utils.c
@@ -65,6 +65,52 @@ static unsigned short rand_seed[3];

struct btrfs_config bconf;

+/*
+ * Get a device request queue parameter.
+ */
+int queue_param(const char *file, const char *param, char *buf, size_t len)
+{
+ blkid_probe probe;
+ char wholedisk[PATH_MAX];
+ char sysfs_path[PATH_MAX];
+ dev_t devno;
+ int fd;
+ int ret;
+
+ probe = blkid_new_probe_from_filename(file);
+ if (!probe)
+ return 0;
+
+ /* Device number of this disk (possibly a partition) */
+ devno = blkid_probe_get_devno(probe);
+ if (!devno) {
+ blkid_free_probe(probe);
+ return 0;
+ }
+
+ /* Get whole disk name (not full path) for this devno */
+ ret = blkid_devno_to_wholedisk(devno,
+ wholedisk, sizeof(wholedisk), NULL);
+ if (ret) {
+ blkid_free_probe(probe);
+ return 0;
+ }
+
+ snprintf(sysfs_path, PATH_MAX, "/sys/block/%s/queue/%s",
+ wholedisk, param);
+
+ blkid_free_probe(probe);
+
+ fd = open(sysfs_path, O_RDONLY);
+ if (fd < 0)
+ return 0;
+
+ len = read(fd, buf, len);
+ close(fd);
+
+ return len;
+}
+
/*
* Discard the given range in one go
*/
diff --git a/utils.h b/utils.h
index b6c00cfa..ac333095 100644
--- a/utils.h
+++ b/utils.h
@@ -120,6 +120,7 @@ int get_label(const char *btrfs_dev, char *label);
int set_label(const char *btrfs_dev, const char *label);

char *__strncpy_null(char *dest, const char *src, size_t n);
+int queue_param(const char *file, const char *param, char *buf, size_t len);
int is_block_device(const char *file);
int is_mount_point(const char *file);
int is_path_exist(const char *file);
--
2.18.0


2018-08-09 18:12:38

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 04/12] btrfs-progs: Introduce zone block device helper functions

This patch introduce several zone related functions: btrfs_get_zones() to
get zone information from the specified device and put the information in
zinfo, and zone_is_random_write() to check if a zone accept random writes.

Signed-off-by: Naohiro Aota <[email protected]>
---
utils.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
utils.h | 16 +++++
volumes.h | 28 ++++++++
3 files changed, 238 insertions(+)

diff --git a/utils.c b/utils.c
index 2212692c..71fc044a 100644
--- a/utils.c
+++ b/utils.c
@@ -359,6 +359,200 @@ out:
return ret;
}

+enum btrfs_zoned_model zoned_model(const char *file)
+{
+ char model[32];
+ int ret;
+
+ ret = queue_param(file, "zoned", model, sizeof(model));
+ if (ret <= 0)
+ return ZONED_NONE;
+
+ if (strncmp(model, "host-aware", 10) == 0)
+ return ZONED_HOST_AWARE;
+ if (strncmp(model, "host-managed", 12) == 0)
+ return ZONED_HOST_MANAGED;
+
+ return ZONED_NONE;
+}
+
+size_t zone_size(const char *file)
+{
+ char chunk[32];
+ int ret;
+
+ ret = queue_param(file, "chunk_sectors", chunk, sizeof(chunk));
+ if (ret <= 0)
+ return 0;
+
+ return strtoul((const char *)chunk, NULL, 10) << 9;
+}
+
+#ifdef BTRFS_ZONED
+int zone_is_random_write(struct btrfs_zone_info *zinfo, u64 bytenr)
+{
+ unsigned int zno;
+
+ if (zinfo->model == ZONED_NONE)
+ return 1;
+
+ zno = bytenr / zinfo->zone_size;
+
+ /*
+ * Only sequential write required zones on host-managed
+ * devices cannot be written randomly.
+ */
+ return zinfo->zones[zno].type != BLK_ZONE_TYPE_SEQWRITE_REQ;
+}
+
+#define BTRFS_REPORT_NR_ZONES 8192
+
+static int btrfs_get_zones(int fd, const char *file, u64 block_count,
+ struct btrfs_zone_info *zinfo)
+{
+ size_t zone_bytes = zone_size(file);
+ size_t rep_size;
+ u64 sector = 0;
+ struct blk_zone_report *rep;
+ struct blk_zone *zone;
+ unsigned int i, n = 0;
+ int ret;
+
+ /*
+ * Zones are guaranteed (by the kernel) to be a power of 2 number of
+ * sectors. Check this here and make sure that zones are not too
+ * small.
+ */
+ if (!zone_bytes || (zone_bytes & (zone_bytes - 1))) {
+ error("ERROR: Illegal zone size %zu (not a power of 2)\n",
+ zone_bytes);
+ exit(1);
+ }
+ if (zone_bytes < BTRFS_MKFS_SYSTEM_GROUP_SIZE) {
+ error("ERROR: Illegal zone size %zu (smaller than %d)\n",
+ zone_bytes,
+ BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+ exit(1);
+ }
+
+ /* Allocate the zone information array */
+ zinfo->zone_size = zone_bytes;
+ zinfo->nr_zones = block_count / zone_bytes;
+ if (block_count & (zone_bytes - 1))
+ zinfo->nr_zones++;
+ zinfo->zones = calloc(zinfo->nr_zones, sizeof(struct blk_zone));
+ if (!zinfo->zones) {
+ error("No memory for zone information\n");
+ exit(1);
+ }
+
+ /* Allocate a zone report */
+ rep_size = sizeof(struct blk_zone_report) +
+ sizeof(struct blk_zone) * BTRFS_REPORT_NR_ZONES;
+ rep = malloc(rep_size);
+ if (!rep) {
+ error("No memory for zones report\n");
+ exit(1);
+ }
+
+ /* Get zone information */
+ zone = (struct blk_zone *)(rep + 1);
+ while (n < zinfo->nr_zones) {
+
+ memset(rep, 0, rep_size);
+ rep->sector = sector;
+ rep->nr_zones = BTRFS_REPORT_NR_ZONES;
+
+ ret = ioctl(fd, BLKREPORTZONE, rep);
+ if (ret != 0) {
+ error("ioctl BLKREPORTZONE failed (%s)\n",
+ strerror(errno));
+ exit(1);
+ }
+
+ if (!rep->nr_zones)
+ break;
+
+ for (i = 0; i < rep->nr_zones; i++) {
+ if (n >= zinfo->nr_zones)
+ break;
+ memcpy(&zinfo->zones[n], &zone[i],
+ sizeof(struct blk_zone));
+ sector = zone[i].start + zone[i].len;
+ n++;
+ }
+
+ }
+
+ /*
+ * We need at least one random write zone (a conventional zone or
+ * a sequential write preferred zone on a host-aware device).
+ */
+ if (!zone_is_random_write(zinfo, 0)) {
+ error("ERROR: No conventional zone at block 0\n");
+ exit(1);
+ }
+
+ zinfo->nr_zones = n;
+
+ free(rep);
+
+ return 0;
+}
+
+#endif
+
+int btrfs_get_zone_info(int fd, const char *file, int hmzoned,
+ struct btrfs_zone_info *zinfo)
+{
+ struct stat st;
+ int ret;
+
+ memset(zinfo, 0, sizeof(struct btrfs_zone_info));
+
+ ret = fstat(fd, &st);
+ if (ret < 0) {
+ error("unable to stat %s\n", file);
+ return 1;
+ }
+
+ if (!S_ISBLK(st.st_mode))
+ return 0;
+
+ /* Check zone model */
+ zinfo->model = zoned_model(file);
+ if (zinfo->model == ZONED_NONE)
+ return 0;
+
+ if (zinfo->model == ZONED_HOST_MANAGED && !hmzoned) {
+ error("%s: host-managed zoned block device (enable zone block device support with -O hmzoned)\n",
+ file);
+ return -1;
+ }
+
+ if (!hmzoned) {
+ /* Treat host-aware devices as regular devices */
+ zinfo->model = ZONED_NONE;
+ return 0;
+ }
+
+#ifdef BTRFS_ZONED
+ /* Get zone information */
+ ret = btrfs_get_zones(fd, file, btrfs_device_size(fd, &st), zinfo);
+ if (ret != 0)
+ return ret;
+#else
+ error("%s: Unsupported host-%s zoned block device\n",
+ file, zinfo->model == ZONED_HOST_MANAGED ? "managed" : "aware");
+ if (zinfo->model == ZONED_HOST_MANAGED)
+ return -1;
+
+ printf("%s: heandling host-aware block device as a regular disk\n",
+ file);
+#endif
+ return 0;
+}
+
int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
u64 max_block_count, unsigned opflags)
{
diff --git a/utils.h b/utils.h
index ac333095..47f6b101 100644
--- a/utils.h
+++ b/utils.h
@@ -68,6 +68,7 @@ void units_set_base(unsigned *units, unsigned base);
#define PREP_DEVICE_ZERO_END (1U << 0)
#define PREP_DEVICE_DISCARD (1U << 1)
#define PREP_DEVICE_VERBOSE (1U << 2)
+#define PREP_DEVICE_HMZONED (1U << 3)

#define SEEN_FSID_HASH_SIZE 256
struct seen_fsid {
@@ -77,10 +78,25 @@ struct seen_fsid {
int fd;
};

+struct btrfs_zone_info;
+
+enum btrfs_zoned_model zoned_model(const char *file);
+size_t zone_size(const char *file);
int btrfs_make_root_dir(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u64 objectid);
int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
u64 max_block_count, unsigned opflags);
+int btrfs_get_zone_info(int fd, const char *file, int hmzoned,
+ struct btrfs_zone_info *zinfo);
+#ifdef BTRFS_ZONED
+int zone_is_random_write(struct btrfs_zone_info *zinfo, u64 bytenr);
+#else
+static inline int zone_is_random_write(struct btrfs_zone_info *zinfo,
+ u64 bytenr)
+{
+ return 1;
+}
+#endif
int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
struct btrfs_root *root, int fd, const char *path,
u64 block_count, u32 io_width, u32 io_align,
diff --git a/volumes.h b/volumes.h
index b4ea93f0..bad688e5 100644
--- a/volumes.h
+++ b/volumes.h
@@ -22,12 +22,40 @@
#include "kerncompat.h"
#include "ctree.h"

+#ifdef BTRFS_ZONED
+#include <linux/blkzoned.h>
+#else
+struct blk_zone {
+ int dummy;
+};
+#endif
+
+/*
+ * Zoned block device models.
+ */
+enum btrfs_zoned_model {
+ ZONED_NONE = 0,
+ ZONED_HOST_AWARE,
+ ZONED_HOST_MANAGED,
+};
+
+/*
+ * Zone information for a zoned block device.
+ */
+struct btrfs_zone_info {
+ enum btrfs_zoned_model model;
+ size_t zone_size;
+ struct blk_zone *zones;
+ unsigned int nr_zones;
+};
+
#define BTRFS_STRIPE_LEN SZ_64K

struct btrfs_device {
struct list_head dev_list;
struct btrfs_root *dev_root;
struct btrfs_fs_devices *fs_devices;
+ struct btrfs_zone_info zinfo;

u64 total_ios;

--
2.18.0


2018-08-09 18:12:40

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 05/12] btrfs-progs: load and check zone information

This patch checks if a device added to btrfs is a zoned block device. If it
is, load zones information and the zone size for the device.

For a btrfs volume composed of multiple zoned block devices, all devices
must have the same zone size.

Signed-off-by: Naohiro Aota <[email protected]>
---
utils.c | 10 ++++++++++
volumes.c | 18 ++++++++++++++++++
volumes.h | 3 +++
3 files changed, 31 insertions(+)

diff --git a/utils.c b/utils.c
index 71fc044a..a2172a82 100644
--- a/utils.c
+++ b/utils.c
@@ -250,6 +250,16 @@ int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
goto out;
}

+ ret = btrfs_get_zone_info(fd, path, fs_info->fs_devices->hmzoned,
+ &device->zinfo);
+ if (ret)
+ goto out;
+ if (device->zinfo.zone_size != fs_info->fs_devices->zone_size) {
+ error("Device zone size differ\n");
+ ret = -EINVAL;
+ goto out;
+ }
+
disk_super = (struct btrfs_super_block *)buf;
dev_item = &disk_super->dev_item;

diff --git a/volumes.c b/volumes.c
index d81b348e..2ec27cd7 100644
--- a/volumes.c
+++ b/volumes.c
@@ -160,6 +160,8 @@ static int device_list_add(const char *path,
struct btrfs_device *device;
struct btrfs_fs_devices *fs_devices;
u64 found_transid = btrfs_super_generation(disk_super);
+ int hmzoned = btrfs_super_incompat_flags(disk_super) &
+ BTRFS_FEATURE_INCOMPAT_HMZONED;

fs_devices = find_fsid(disk_super->fsid);
if (!fs_devices) {
@@ -237,6 +239,8 @@ static int device_list_add(const char *path,
if (fs_devices->lowest_devid > devid) {
fs_devices->lowest_devid = devid;
}
+ if (hmzoned)
+ fs_devices->hmzoned = 1;
*fs_devices_ret = fs_devices;
return 0;
}
@@ -307,6 +311,8 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags)
struct btrfs_device *device;
int ret;

+ fs_devices->zone_size = 0;
+
list_for_each_entry(device, &fs_devices->devices, dev_list) {
if (!device->name) {
printk("no name for device %llu, skip it now\n", device->devid);
@@ -330,6 +336,18 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, int flags)
device->fd = fd;
if (flags & O_RDWR)
device->writeable = 1;
+
+ ret = btrfs_get_zone_info(fd, device->name, fs_devices->hmzoned,
+ &device->zinfo);
+ if (ret != 0)
+ goto fail;
+ if (!fs_devices->zone_size) {
+ fs_devices->zone_size = device->zinfo.zone_size;
+ } else if (device->zinfo.zone_size != fs_devices->zone_size) {
+ fprintf(stderr, "Device zone size differ\n");
+ ret = -EINVAL;
+ goto fail;
+ }
}
return 0;
fail:
diff --git a/volumes.h b/volumes.h
index bad688e5..36a6f44b 100644
--- a/volumes.h
+++ b/volumes.h
@@ -111,6 +111,9 @@ struct btrfs_fs_devices {

int seeding;
struct btrfs_fs_devices *seed;
+
+ u64 zone_size;
+ unsigned int hmzoned:1;
};

struct btrfs_bio_stripe {
--
2.18.0


2018-08-09 18:12:46

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 06/12] btrfs-progs: avoid writing super block to sequential zones

It is not possible to write a super block copy in sequential write required
zones as this prevents in-place updates required for super blocks. This
patch limits super block possible locations to zones accepting random
writes. In particular, the zone containing the first block of the device or
partition being formatted must accept random writes.

Signed-off-by: Naohiro Aota <[email protected]>
---
disk-io.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/disk-io.c b/disk-io.c
index 26e4f6e9..127d8cf4 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1523,6 +1523,7 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
struct btrfs_super_block *sb,
struct btrfs_device *device)
{
+ struct btrfs_zone_info *zinfo = &device->zinfo;
u64 bytenr;
u32 crc;
int i, ret;
@@ -1534,6 +1535,11 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE);
btrfs_csum_final(crc, &sb->csum[0]);

+ if (!zone_is_random_write(zinfo, fs_info->super_bytenr)) {
+ ret = -EIO;
+ goto write_err;
+ }
+
/*
* super_copy is BTRFS_SUPER_INFO_SIZE bytes and is
* zero filled, we can use it directly
@@ -1550,6 +1556,8 @@ static int write_dev_supers(struct btrfs_fs_info *fs_info,
bytenr = btrfs_sb_offset(i);
if (bytenr + BTRFS_SUPER_INFO_SIZE > device->total_bytes)
break;
+ if (!zone_is_random_write(zinfo, bytenr))
+ continue;

btrfs_set_super_bytenr(sb, bytenr);

--
2.18.0


2018-08-09 18:13:05

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 12/12] btrfs-progs: do sequential allocation

Ensures that block allocation in sequential write required zones is always
done sequentially using an allocation pointer which is the zone write
pointer plus the number of blocks already allocated but not yet written.
For conventional zones, the legacy behavior is used.

Signed-off-by: Naohiro Aota <[email protected]>
---
ctree.h | 17 +++++
extent-tree.c | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++
transaction.c | 16 +++++
3 files changed, 219 insertions(+)

diff --git a/ctree.h b/ctree.h
index 6d805ecd..5324f7b9 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1062,15 +1062,32 @@ struct btrfs_space_info {
struct list_head list;
};

+/* Block group allocation types */
+enum btrfs_alloc_type {
+
+ /* Regular first fit allocation */
+ BTRFS_ALLOC_FIT = 0,
+
+ /*
+ * Sequential allocation: this is for HMZONED mode and
+ * will result in ignoring free space before a block
+ * group allocation offset.
+ */
+ BTRFS_ALLOC_SEQ = 1,
+};
+
struct btrfs_block_group_cache {
struct cache_extent cache;
struct btrfs_key key;
struct btrfs_block_group_item item;
struct btrfs_space_info *space_info;
struct btrfs_free_space_ctl *free_space_ctl;
+ enum btrfs_alloc_type alloc_type;
u64 bytes_super;
u64 pinned;
u64 flags;
+ u64 alloc_offset;
+ u64 write_offset;
int cached;
int ro;
};
diff --git a/extent-tree.c b/extent-tree.c
index 5d49af5a..01660864 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -256,6 +256,14 @@ again:
if (cache->ro || !block_group_bits(cache, data))
goto new_group;

+ if (cache->alloc_type == BTRFS_ALLOC_SEQ) {
+ if (cache->key.offset - cache->alloc_offset < num)
+ goto new_group;
+ *start_ret = cache->key.objectid + cache->alloc_offset;
+ cache->alloc_offset += num;
+ return 0;
+ }
+
while(1) {
ret = find_first_extent_bit(&root->fs_info->free_space_cache,
last, &start, &end, EXTENT_DIRTY);
@@ -282,6 +290,7 @@ out:
(unsigned long long)search_start);
return -ENOENT;
}
+ printf("nospace\n");
return -ENOSPC;

new_group:
@@ -3143,6 +3152,176 @@ error:
return ret;
}

+#ifdef BTRFS_ZONED
+static int
+btrfs_get_block_group_alloc_offset(struct btrfs_fs_info *fs_info,
+ struct btrfs_block_group_cache *cache)
+{
+ struct btrfs_device *device;
+ struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
+ struct cache_extent *ce;
+ struct map_lookup *map;
+ u64 logical = cache->key.objectid;
+ u64 length = cache->key.offset;
+ u64 physical = 0;
+ int ret = 0;
+ int i;
+ u64 zone_size = fs_info->fs_devices->zone_size;
+ u64 *alloc_offsets = NULL;
+
+ if (!btrfs_fs_incompat(fs_info, HMZONED))
+ return 0;
+
+ /* Sanity check */
+ if (!IS_ALIGNED(length, zone_size)) {
+ fprintf(stderr, "unaligned block group at %llu", logical);
+ return -EIO;
+ }
+
+ /* Get the chunk mapping */
+ ce = search_cache_extent(&map_tree->cache_tree, logical);
+ if (!ce) {
+ fprintf(stderr, "failed to find block group at %llu", logical);
+ return -ENOENT;
+ }
+ map = container_of(ce, struct map_lookup, ce);
+
+ /*
+ * Get the zone type: if the group is mapped to a non-sequential zone,
+ * there is no need for the allocation offset (fit allocation is OK).
+ */
+ device = map->stripes[0].dev;
+ physical = map->stripes[0].physical;
+ if (!zone_is_random_write(&device->zinfo, physical))
+ cache->alloc_type = BTRFS_ALLOC_SEQ;
+
+ /* check block group mapping */
+ alloc_offsets = calloc(map->num_stripes, sizeof(*alloc_offsets));
+ for (i = 0; i < map->num_stripes; i++) {
+ int is_sequential;
+ struct blk_zone zone;
+
+ device = map->stripes[i].dev;
+ physical = map->stripes[i].physical;
+
+ is_sequential = !zone_is_random_write(&device->zinfo, physical);
+ if ((is_sequential && cache->alloc_type != BTRFS_ALLOC_SEQ) ||
+ (!is_sequential && cache->alloc_type == BTRFS_ALLOC_SEQ)) {
+ fprintf(stderr,
+ "found block group of mixed zone types");
+ ret = -EIO;
+ goto out;
+ }
+
+ if (!is_sequential)
+ continue;
+
+ WARN_ON(!IS_ALIGNED(physical, zone_size));
+ zone = device->zinfo.zones[physical / zone_size];
+
+ /*
+ * The group is mapped to a sequential zone. Get the zone write
+ * pointer to determine the allocation offset within the zone.
+ */
+ switch (zone.cond) {
+ case BLK_ZONE_COND_OFFLINE:
+ case BLK_ZONE_COND_READONLY:
+ fprintf(stderr, "Offline/readonly zone %llu",
+ physical / fs_info->fs_devices->zone_size);
+ ret = -EIO;
+ goto out;
+ case BLK_ZONE_COND_EMPTY:
+ alloc_offsets[i] = 0;
+ break;
+ case BLK_ZONE_COND_FULL:
+ alloc_offsets[i] = zone_size;
+ break;
+ default:
+ /* Partially used zone */
+ alloc_offsets[i] = ((zone.wp - zone.start) << 9);
+ break;
+ }
+ }
+
+ if (cache->alloc_type != BTRFS_ALLOC_SEQ)
+ goto out;
+
+ switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+ case 0: /* single */
+ case BTRFS_BLOCK_GROUP_DUP:
+ case BTRFS_BLOCK_GROUP_RAID1:
+ for (i = 1; i < map->num_stripes; i++) {
+ if (alloc_offsets[i] != alloc_offsets[0]) {
+ fprintf(stderr,
+ "zones' write pointers mismatch\n");
+ ret = -EIO;
+ goto out;
+ }
+ }
+ cache->alloc_offset = alloc_offsets[0];
+ break;
+ case BTRFS_BLOCK_GROUP_RAID0:
+ cache->alloc_offset = alloc_offsets[0];
+ for (i = 1; i < map->num_stripes; i++) {
+ cache->alloc_offset += alloc_offsets[i];
+ if (alloc_offsets[0] < alloc_offsets[i]) {
+ fprintf(stderr,
+ "zones' write pointers mismatch\n");
+ ret = -EIO;
+ goto out;
+ }
+ }
+ break;
+ case BTRFS_BLOCK_GROUP_RAID10:
+ cache->alloc_offset = 0;
+ for (i = 0; i < map->num_stripes / map->sub_stripes; i++) {
+ int j;
+ int base;
+
+ base = i*map->sub_stripes;
+ for (j = 1; j < map->sub_stripes; j++) {
+ if (alloc_offsets[base] !=
+ alloc_offsets[base+j]) {
+ fprintf(stderr,
+ "zones' write pointer mismatch\n");
+ ret = -EIO;
+ goto out;
+ }
+ }
+
+ if (alloc_offsets[0] < alloc_offsets[base]) {
+ fprintf(stderr,
+ "zones' write pointer mismatch\n");
+ ret = -EIO;
+ goto out;
+ }
+ cache->alloc_offset += alloc_offsets[base];
+ }
+ break;
+ case BTRFS_BLOCK_GROUP_RAID5:
+ case BTRFS_BLOCK_GROUP_RAID6:
+ /* RAID5/6 is not supported yet */
+ default:
+ fprintf(stderr, "Unsupported profile %llu\n",
+ map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK);
+ ret = -EINVAL;
+ goto out;
+ }
+
+out:
+ cache->write_offset = cache->alloc_offset;
+ free(alloc_offsets);
+ return ret;
+}
+#else
+static int
+btrfs_get_block_group_alloc_offset(struct btrfs_fs_info *fs_info,
+ struct btrfs_block_group_cache *cache)
+{
+ return 0;
+}
+#endif
+
int btrfs_read_block_groups(struct btrfs_root *root)
{
struct btrfs_path *path;
@@ -3226,6 +3405,10 @@ int btrfs_read_block_groups(struct btrfs_root *root)
BUG_ON(ret);
cache->space_info = space_info;

+ ret = btrfs_get_block_group_alloc_offset(info, cache);
+ if (ret)
+ goto error;
+
/* use EXTENT_LOCKED to prevent merging */
set_extent_bits(block_group_cache, found_key.objectid,
found_key.objectid + found_key.offset - 1,
@@ -3255,6 +3438,9 @@ btrfs_add_block_group(struct btrfs_fs_info *fs_info, u64 bytes_used, u64 type,
cache->key.objectid = chunk_offset;
cache->key.offset = size;

+ ret = btrfs_get_block_group_alloc_offset(fs_info, cache);
+ BUG_ON(ret);
+
cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
btrfs_set_block_group_used(&cache->item, bytes_used);
btrfs_set_block_group_chunk_objectid(&cache->item,
diff --git a/transaction.c b/transaction.c
index ecafbb15..0e49b8b7 100644
--- a/transaction.c
+++ b/transaction.c
@@ -115,16 +115,32 @@ int __commit_transaction(struct btrfs_trans_handle *trans,
{
u64 start;
u64 end;
+ u64 next = 0;
struct btrfs_fs_info *fs_info = root->fs_info;
struct extent_buffer *eb;
struct extent_io_tree *tree = &fs_info->extent_cache;
+ struct btrfs_block_group_cache *bg = NULL;
int ret;

while(1) {
+again:
ret = find_first_extent_bit(tree, 0, &start, &end,
EXTENT_DIRTY);
if (ret)
break;
+ bg = btrfs_lookup_first_block_group(fs_info, start);
+ BUG_ON(!bg);
+ if (bg->alloc_type == BTRFS_ALLOC_SEQ &&
+ bg->key.objectid + bg->write_offset < start) {
+ next = bg->key.objectid + bg->write_offset;
+ BUG_ON(next + fs_info->nodesize > start);
+ eb = btrfs_find_create_tree_block(fs_info, next);
+ btrfs_mark_buffer_dirty(eb);
+ free_extent_buffer(eb);
+ goto again;
+ }
+ if (bg->alloc_type == BTRFS_ALLOC_SEQ)
+ bg->write_offset += (end + 1 - start);
while(start <= end) {
eb = find_first_extent_buffer(tree, start);
BUG_ON(!eb || eb->start != start);
--
2.18.0


2018-08-09 18:13:26

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 08/12] btrfs-progs: volume: align chunk allocation to zones

To facilitate support for zoned block devices in the extent buffer
allocation, a zoned block device chunk is always aligned to a zone of the
device. With this, the zone write pointer location simply becomes a hint to
allocate new buffers.

Signed-off-by: Naohiro Aota <[email protected]>
---
volumes.c | 34 ++++++++++++++++++++++++++++++----
1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/volumes.c b/volumes.c
index 2ec27cd7..ba3b45d2 100644
--- a/volumes.c
+++ b/volumes.c
@@ -379,6 +379,14 @@ int btrfs_scan_one_device(int fd, const char *path,
return ret;
}

+/* zone size is ensured to be power of 2 */
+static u64 btrfs_zone_align(struct btrfs_zone_info *zinfo, u64 val)
+{
+ if (zinfo && zinfo->zone_size)
+ return (val + zinfo->zone_size - 1) & ~(zinfo->zone_size - 1);
+ return val;
+}
+
/*
* find_free_dev_extent_start - find free space in the specified device
* @device: the device which we search the free space in
@@ -425,6 +433,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
*/
min_search_start = max(root->fs_info->alloc_start, (u64)SZ_1M);
search_start = max(search_start, min_search_start);
+ search_start = btrfs_zone_align(&device->zinfo, search_start);

path = btrfs_alloc_path();
if (!path)
@@ -507,7 +516,8 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
extent_end = key.offset + btrfs_dev_extent_length(l,
dev_extent);
if (extent_end > search_start)
- search_start = extent_end;
+ search_start = btrfs_zone_align(&device->zinfo,
+ extent_end);
next:
path->slots[0]++;
cond_resched();
@@ -560,6 +570,9 @@ static int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
struct extent_buffer *leaf;
struct btrfs_key key;

+ /* Align to zone for a zoned block device */
+ *start = btrfs_zone_align(&device->zinfo, *start);
+
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -1030,9 +1043,15 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
btrfs_super_stripesize(info->super_copy));
}

- /* we don't want a chunk larger than 10% of the FS */
- percent_max = div_factor(btrfs_super_total_bytes(info->super_copy), 1);
- max_chunk_size = min(percent_max, max_chunk_size);
+ if (info->fs_devices->hmzoned) {
+ /* Zoned mode uses zone aligned chunks */
+ calc_size = info->fs_devices->zone_size;
+ max_chunk_size = calc_size * num_stripes;
+ } else {
+ /* we don't want a chunk larger than 10% of the FS */
+ percent_max = div_factor(btrfs_super_total_bytes(info->super_copy), 1);
+ max_chunk_size = min(percent_max, max_chunk_size);
+ }

again:
if (chunk_bytes_by_type(type, calc_size, num_stripes, sub_stripes) >
@@ -1112,7 +1131,9 @@ again:
*num_bytes = chunk_bytes_by_type(type, calc_size,
num_stripes, sub_stripes);
index = 0;
+ dev_offset = 0;
while(index < num_stripes) {
+ size_t zone_size = device->zinfo.zone_size;
struct btrfs_stripe *stripe;
BUG_ON(list_empty(&private_devs));
cur = private_devs.next;
@@ -1123,11 +1144,16 @@ again:
(index == num_stripes - 1))
list_move_tail(&device->dev_list, dev_list);

+ if (device->zinfo.zone_size)
+ calc_size = device->zinfo.zone_size;
+
ret = btrfs_alloc_dev_extent(trans, device, key.offset,
calc_size, &dev_offset, 0);
if (ret < 0)
goto out_chunk_map;

+ WARN_ON(zone_size && !IS_ALIGNED(dev_offset, zone_size));
+
device->bytes_used += calc_size;
ret = btrfs_update_device(trans, device);
if (ret < 0)
--
2.18.0


2018-08-09 18:13:53

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 07/12] btrfs-progs: support discarding zoned device

All zones of zoned block devices should be reset before writing. Support
this by considering zone reset as a special case of block discard and block
zeroing. Of note is that only zones accepting random writes can be zeroed.

Signed-off-by: Naohiro Aota <[email protected]>
---
utils.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 88 insertions(+), 6 deletions(-)

diff --git a/utils.c b/utils.c
index a2172a82..79a45d92 100644
--- a/utils.c
+++ b/utils.c
@@ -123,6 +123,37 @@ static int discard_range(int fd, u64 start, u64 len)
return 0;
}

+/*
+ * Discard blocks in the zones of a zoned block device.
+ * Process this with zone size granularity so that blocks in
+ * conventional zones are discarded using discard_range and
+ * blocks in sequential zones are discarded though a zone reset.
+ */
+static int discard_zones(int fd, struct btrfs_zone_info *zinfo)
+{
+#ifdef BTRFS_ZONED
+ unsigned int i;
+
+ /* Zone size granularity */
+ for (i = 0; i < zinfo->nr_zones; i++) {
+ if (zinfo->zones[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+ discard_range(fd, zinfo->zones[i].start << 9,
+ zinfo->zone_size);
+ } else if (zinfo->zones[i].cond != BLK_ZONE_COND_EMPTY) {
+ struct blk_zone_range range = {
+ zinfo->zones[i].start,
+ zinfo->zone_size >> 9 };
+ if (ioctl(fd, BLKRESETZONE, &range) < 0)
+ return errno;
+ }
+ }
+
+ return 0;
+#else
+ return -EIO;
+#endif
+}
+
/*
* Discard blocks in the given range in 1G chunks, the process is interruptible
*/
@@ -205,8 +236,38 @@ static int zero_blocks(int fd, off_t start, size_t len)

#define ZERO_DEV_BYTES SZ_2M

+static int zero_zone_blocks(int fd, struct btrfs_zone_info *zinfo,
+ off_t start, size_t len)
+{
+ size_t zone_len = zinfo->zone_size;
+ off_t ofst = start;
+ size_t count;
+ int ret;
+
+ /* Make sure that zero_blocks does not write sequential zones */
+ while (len > 0) {
+
+ /* Limit zero_blocks to a single zone */
+ count = min_t(size_t, len, zone_len);
+ if (count > zone_len - (ofst & (zone_len - 1)))
+ count = zone_len - (ofst & (zone_len - 1));
+
+ if (zone_is_random_write(zinfo, ofst)) {
+ ret = zero_blocks(fd, ofst, count);
+ if (ret != 0)
+ return ret;
+ }
+
+ len -= count;
+ ofst += count;
+ }
+
+ return 0;
+}
+
/* don't write outside the device by clamping the region to the device size */
-static int zero_dev_clamped(int fd, off_t start, ssize_t len, u64 dev_size)
+static int zero_dev_clamped(int fd, struct btrfs_zone_info *zinfo,
+ off_t start, ssize_t len, u64 dev_size)
{
off_t end = max(start, start + len);

@@ -219,6 +280,9 @@ static int zero_dev_clamped(int fd, off_t start, ssize_t len, u64 dev_size)
start = min_t(u64, start, dev_size);
end = min_t(u64, end, dev_size);

+ if (zinfo->model != ZONED_NONE)
+ return zero_zone_blocks(fd, zinfo, start, end - start);
+
return zero_blocks(fd, start, end - start);
}

@@ -566,6 +630,7 @@ int btrfs_get_zone_info(int fd, const char *file, int hmzoned,
int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
u64 max_block_count, unsigned opflags)
{
+ struct btrfs_zone_info zinfo;
u64 block_count;
struct stat st;
int i, ret;
@@ -584,13 +649,30 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
if (max_block_count)
block_count = min(block_count, max_block_count);

+ ret = btrfs_get_zone_info(fd, file, opflags & PREP_DEVICE_HMZONED,
+ &zinfo);
+ if (ret < 0)
+ return 1;
+
if (opflags & PREP_DEVICE_DISCARD) {
/*
* We intentionally ignore errors from the discard ioctl. It
* is not necessary for the mkfs functionality but just an
- * optimization.
+ * optimization. However, we cannot ignore zone discard (reset)
+ * errors for a zoned block device as this could result in the
+ * inability to write to non-empty sequential zones of the
+ * device.
*/
- if (discard_range(fd, 0, 0) == 0) {
+ if (zinfo.model != ZONED_NONE) {
+ printf("Resetting device zones %s (%u zones) ...\n",
+ file, zinfo.nr_zones);
+ if (discard_zones(fd, &zinfo)) {
+ fprintf(stderr,
+ "ERROR: failed to reset device '%s' zones\n",
+ file);
+ return 1;
+ }
+ } else if (discard_range(fd, 0, 0) == 0) {
if (opflags & PREP_DEVICE_VERBOSE)
printf("Performing full device TRIM %s (%s) ...\n",
file, pretty_size(block_count));
@@ -598,12 +680,12 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
}
}

- ret = zero_dev_clamped(fd, 0, ZERO_DEV_BYTES, block_count);
+ ret = zero_dev_clamped(fd, &zinfo, 0, ZERO_DEV_BYTES, block_count);
for (i = 0 ; !ret && i < BTRFS_SUPER_MIRROR_MAX; i++)
- ret = zero_dev_clamped(fd, btrfs_sb_offset(i),
+ ret = zero_dev_clamped(fd, &zinfo, btrfs_sb_offset(i),
BTRFS_SUPER_INFO_SIZE, block_count);
if (!ret && (opflags & PREP_DEVICE_ZERO_END))
- ret = zero_dev_clamped(fd, block_count - ZERO_DEV_BYTES,
+ ret = zero_dev_clamped(fd, &zinfo, block_count - ZERO_DEV_BYTES,
ZERO_DEV_BYTES, block_count);

if (ret < 0) {
--
2.18.0


2018-08-09 18:13:53

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 11/12] btrfs-progs: replace: disable in HMZONED device

As show in the kernel patches, device replace feature needs more works to
complete. Disable the feature for now.

Signed-off-by: Naohiro Aota <[email protected]>
---
cmds-replace.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/cmds-replace.c b/cmds-replace.c
index 1fa80284..642fbd4b 100644
--- a/cmds-replace.c
+++ b/cmds-replace.c
@@ -116,6 +116,7 @@ static const char *const cmd_replace_start_usage[] = {

static int cmd_replace_start(int argc, char **argv)
{
+ struct btrfs_ioctl_feature_flags feature_flags;
struct btrfs_ioctl_dev_replace_args start_args = {0};
struct btrfs_ioctl_dev_replace_args status_args = {0};
int ret;
@@ -123,6 +124,7 @@ static int cmd_replace_start(int argc, char **argv)
int c;
int fdmnt = -1;
int fddstdev = -1;
+ int hmzoned;
char *path;
char *srcdev;
char *dstdev = NULL;
@@ -200,6 +202,13 @@ static int cmd_replace_start(int argc, char **argv)
goto leave_with_error;
}

+ ret = ioctl(fdmnt, BTRFS_IOC_GET_FEATURES, &feature_flags);
+ if (ret) {
+ error("error getting feature flags '%s': %m", path);
+ return 1;
+ }
+ hmzoned = feature_flags.incompat_flags & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
if (string_is_numerical(srcdev)) {
struct btrfs_ioctl_fs_info_args fi_args;
struct btrfs_ioctl_dev_info_args *di_args = NULL;
@@ -238,6 +247,12 @@ static int cmd_replace_start(int argc, char **argv)
goto leave_with_error;
}

+ if (hmzoned) {
+ error("cannot replace device on HMZONED file system '%s'",
+ dstdev);
+ goto leave_with_error;
+ }
+
ret = test_dev_for_mkfs(dstdev, force_using_targetdev);
if (ret)
goto leave_with_error;
--
2.18.0


2018-08-09 18:14:07

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 10/12] btrfs-progs: device-add: support HMZONED device

This patch check if the target file system is flagged as HMZONED. If it is,
the device to be added is flagged PREP_DEVICE_HMZONED. Also add checks to
prevent mixing non-zoned devices and zoned devices.

Signed-off-by: Naohiro Aota <[email protected]>
---
cmds-device.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index 2a05f70a..10696bf7 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -56,6 +56,9 @@ static int cmd_device_add(int argc, char **argv)
int discard = 1;
int force = 0;
int last_dev;
+ int res;
+ int hmzoned;
+ struct btrfs_ioctl_feature_flags feature_flags;

optind = 0;
while (1) {
@@ -91,12 +94,33 @@ static int cmd_device_add(int argc, char **argv)
if (fdmnt < 0)
return 1;

+ res = ioctl(fdmnt, BTRFS_IOC_GET_FEATURES, &feature_flags);
+ if (res) {
+ error("error getting feature flags '%s': %m", mntpnt);
+ return 1;
+ }
+ hmzoned = feature_flags.incompat_flags & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
for (i = optind; i < last_dev; i++){
struct btrfs_ioctl_vol_args ioctl_args;
- int devfd, res;
+ int devfd;
u64 dev_block_count = 0;
char *path;

+ if (hmzoned && zoned_model(argv[i]) == ZONED_NONE) {
+ error("cannot add non-zoned device to HMZONED file system '%s'",
+ argv[i]);
+ ret++;
+ continue;
+ }
+
+ if (!hmzoned && zoned_model(argv[i]) == ZONED_HOST_MANAGED) {
+ error("cannot add host managed zoned device to non-HMZONED file system '%s'",
+ argv[i]);
+ ret++;
+ continue;
+ }
+
res = test_dev_for_mkfs(argv[i], force);
if (res) {
ret++;
@@ -112,7 +136,8 @@ static int cmd_device_add(int argc, char **argv)

res = btrfs_prepare_device(devfd, argv[i], &dev_block_count, 0,
PREP_DEVICE_ZERO_END | PREP_DEVICE_VERBOSE |
- (discard ? PREP_DEVICE_DISCARD : 0));
+ (discard ? PREP_DEVICE_DISCARD : 0) |
+ (hmzoned ? PREP_DEVICE_HMZONED : 0));
close(devfd);
if (res) {
ret++;
--
2.18.0


2018-08-09 18:14:15

by Naohiro Aota

[permalink] [raw]
Subject: [RFC PATCH 09/12] btrfs-progs: mkfs: Zoned block device support

This patch makes the size of the temporary system group chunk equal to the
device zone size. It also enables PREP_DEVICE_HMZONED if the user enables
the HMZONED feature.

Enabling HMZONED feature is done using option "-O hmzoned". This feature is
incompatible for now with source directory setup.

Signed-off-by: Naohiro Aota <[email protected]>
---
mkfs/common.c | 12 +++++++-----
mkfs/common.h | 1 +
mkfs/main.c | 45 +++++++++++++++++++++++++++++++++++++++------
3 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/mkfs/common.c b/mkfs/common.c
index 0ace262b..d01402c8 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -152,6 +152,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
int skinny_metadata = !!(cfg->features &
BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA);
u64 num_bytes;
+ u64 system_group_size;

buf = malloc(sizeof(*buf) + max(cfg->sectorsize, cfg->nodesize));
if (!buf)
@@ -312,12 +313,14 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
btrfs_set_item_offset(buf, btrfs_item_nr(nritems), itemoff);
btrfs_set_item_size(buf, btrfs_item_nr(nritems), item_size);

+ system_group_size = (cfg->features & BTRFS_FEATURE_INCOMPAT_HMZONED) ?
+ cfg->zone_size : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+
dev_item = btrfs_item_ptr(buf, nritems, struct btrfs_dev_item);
btrfs_set_device_id(buf, dev_item, 1);
btrfs_set_device_generation(buf, dev_item, 0);
btrfs_set_device_total_bytes(buf, dev_item, num_bytes);
- btrfs_set_device_bytes_used(buf, dev_item,
- BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+ btrfs_set_device_bytes_used(buf, dev_item, system_group_size);
btrfs_set_device_io_align(buf, dev_item, cfg->sectorsize);
btrfs_set_device_io_width(buf, dev_item, cfg->sectorsize);
btrfs_set_device_sector_size(buf, dev_item, cfg->sectorsize);
@@ -345,7 +348,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
btrfs_set_item_size(buf, btrfs_item_nr(nritems), item_size);

chunk = btrfs_item_ptr(buf, nritems, struct btrfs_chunk);
- btrfs_set_chunk_length(buf, chunk, BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+ btrfs_set_chunk_length(buf, chunk, system_group_size);
btrfs_set_chunk_owner(buf, chunk, BTRFS_EXTENT_TREE_OBJECTID);
btrfs_set_chunk_stripe_len(buf, chunk, BTRFS_STRIPE_LEN);
btrfs_set_chunk_type(buf, chunk, BTRFS_BLOCK_GROUP_SYSTEM);
@@ -411,8 +414,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
(unsigned long)btrfs_dev_extent_chunk_tree_uuid(dev_extent),
BTRFS_UUID_SIZE);

- btrfs_set_dev_extent_length(buf, dev_extent,
- BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+ btrfs_set_dev_extent_length(buf, dev_extent, system_group_size);
nritems++;

btrfs_set_header_bytenr(buf, cfg->blocks[MKFS_DEV_TREE]);
diff --git a/mkfs/common.h b/mkfs/common.h
index 28912906..d0e4c7b2 100644
--- a/mkfs/common.h
+++ b/mkfs/common.h
@@ -53,6 +53,7 @@ struct btrfs_mkfs_config {
u64 features;
/* Size of the filesystem in bytes */
u64 num_bytes;
+ u64 zone_size;

/* Output fields, set during creation */

diff --git a/mkfs/main.c b/mkfs/main.c
index 83969b4b..f940eba1 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -60,8 +60,12 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
u64 bytes_used;
u64 chunk_start = 0;
u64 chunk_size = 0;
+ u64 system_group_size = 0;
int ret;

+ system_group_size = fs_info->fs_devices->hmzoned ?
+ fs_info->fs_devices->zone_size : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+
trans = btrfs_start_transaction(root, 1);
BUG_ON(IS_ERR(trans));
bytes_used = btrfs_super_bytes_used(fs_info->super_copy);
@@ -74,8 +78,8 @@ static int create_metadata_block_groups(struct btrfs_root *root, int mixed,
ret = btrfs_make_block_group(trans, fs_info, bytes_used,
BTRFS_BLOCK_GROUP_SYSTEM,
BTRFS_BLOCK_RESERVED_1M_FOR_SUPER,
- BTRFS_MKFS_SYSTEM_GROUP_SIZE);
- allocation->system += BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+ system_group_size);
+ allocation->system += system_group_size;
if (ret)
return ret;

@@ -700,6 +704,7 @@ int main(int argc, char **argv)
int metadata_profile_opt = 0;
int discard = 1;
int ssd = 0;
+ int hmzoned = 0;
int force_overwrite = 0;
char *source_dir = NULL;
bool source_dir_set = false;
@@ -713,6 +718,7 @@ int main(int argc, char **argv)
u64 features = BTRFS_MKFS_DEFAULT_FEATURES;
struct mkfs_allocation allocation = { 0 };
struct btrfs_mkfs_config mkfs_cfg;
+ u64 system_group_size;

while(1) {
int c;
@@ -835,6 +841,8 @@ int main(int argc, char **argv)
if (dev_cnt == 0)
print_usage(1);

+ hmzoned = features & BTRFS_FEATURE_INCOMPAT_HMZONED;
+
if (source_dir_set && dev_cnt > 1) {
error("the option -r is limited to a single device");
goto error;
@@ -844,6 +852,11 @@ int main(int argc, char **argv)
goto error;
}

+ if (source_dir_set && hmzoned) {
+ error("The -r and hmzoned feature are incompatible\n");
+ exit(1);
+ }
+
if (*fs_uuid) {
uuid_t dummy_uuid;

@@ -875,6 +888,16 @@ int main(int argc, char **argv)

file = argv[optind++];
ssd = is_ssd(file);
+ if (hmzoned) {
+ if (zoned_model(file) == ZONED_NONE) {
+ error("%s: not a zoned block device\n", file);
+ exit(1);
+ }
+ if (!zone_size(file)) {
+ error("%s: zone size undefined\n", file);
+ exit(1);
+ }
+ }

/*
* Set default profiles according to number of added devices.
@@ -1026,7 +1049,8 @@ int main(int argc, char **argv)
ret = btrfs_prepare_device(fd, file, &dev_block_count, block_count,
(zero_end ? PREP_DEVICE_ZERO_END : 0) |
(discard ? PREP_DEVICE_DISCARD : 0) |
- (verbose ? PREP_DEVICE_VERBOSE : 0));
+ (verbose ? PREP_DEVICE_VERBOSE : 0) |
+ (hmzoned ? PREP_DEVICE_HMZONED : 0));
if (ret)
goto error;
if (block_count && block_count > dev_block_count) {
@@ -1037,9 +1061,11 @@ int main(int argc, char **argv)
}

/* To create the first block group and chunk 0 in make_btrfs */
- if (dev_block_count < BTRFS_MKFS_SYSTEM_GROUP_SIZE) {
+ system_group_size = hmzoned ?
+ zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
+ if (dev_block_count < system_group_size) {
error("device is too small to make filesystem, must be at least %llu",
- (unsigned long long)BTRFS_MKFS_SYSTEM_GROUP_SIZE);
+ (unsigned long long)system_group_size);
goto error;
}

@@ -1055,6 +1081,7 @@ int main(int argc, char **argv)
mkfs_cfg.sectorsize = sectorsize;
mkfs_cfg.stripesize = stripesize;
mkfs_cfg.features = features;
+ mkfs_cfg.zone_size = zone_size(file);

ret = make_btrfs(fd, &mkfs_cfg);
if (ret) {
@@ -1064,6 +1091,7 @@ int main(int argc, char **argv)

fs_info = open_ctree_fs_info(file, 0, 0, 0,
OPEN_CTREE_WRITES | OPEN_CTREE_TEMPORARY_SUPER);
+
if (!fs_info) {
error("open ctree failed");
goto error;
@@ -1137,7 +1165,8 @@ int main(int argc, char **argv)
block_count,
(verbose ? PREP_DEVICE_VERBOSE : 0) |
(zero_end ? PREP_DEVICE_ZERO_END : 0) |
- (discard ? PREP_DEVICE_DISCARD : 0));
+ (discard ? PREP_DEVICE_DISCARD : 0) |
+ (hmzoned ? PREP_DEVICE_HMZONED : 0));
if (ret) {
goto error;
}
@@ -1234,6 +1263,10 @@ raid_groups:
btrfs_group_profile_str(metadata_profile),
pretty_size(allocation.system));
printf("SSD detected: %s\n", ssd ? "yes" : "no");
+ printf("Zoned device: %s\n", hmzoned ? "yes" : "no");
+ if (hmzoned)
+ printf("Zone size: %s\n",
+ pretty_size(fs_info->fs_devices->zone_size));
btrfs_parse_features_to_string(features_buf, features);
printf("Incompat features: %s", features_buf);
printf("\n");
--
2.18.0


2018-08-10 07:07:03

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
>
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position. This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
>
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint. The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.
>
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
>
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
>
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
>
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
>
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
>
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
>
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
>
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
>
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
>
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
>
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
>
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
>
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
>
Thanks for doing all the work.
However, the patches don't apply cleanly to current master branch.
Can you please rebase them?

Thanks.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-10 07:27:32

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
>
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position. This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
>
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint. The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.
>
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
>
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
>
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
>
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
>
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
>
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
>
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
>
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
>
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
>
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
>
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
>
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
>
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
>
> Naohiro Aota (17):
> btrfs: introduce HMZONED feature flag
> btrfs: Get zone information of zoned block devices
> btrfs: Check and enable HMZONED mode
> btrfs: limit super block locations in HMZONED mode
> btrfs: disable fallocate in HMZONED mode
> btrfs: disable direct IO in HMZONED mode
> btrfs: disable device replace in HMZONED mode
> btrfs: align extent allocation to zone boundary
> btrfs: do sequential allocation on HMZONED drives
> btrfs: split btrfs_map_bio()
> btrfs: introduce submit buffer
> btrfs: expire submit buffer on timeout
> btrfs: avoid sync IO prioritization on checksum in HMZONED mode
> btrfs: redirty released extent buffers in sequential BGs
> btrfs: reset zones of unused block groups
> btrfs: wait existing extents before truncating
> btrfs: enable to mount HMZONED incompat flag
>
And unfortunately this series fails to boot for me:

BTRFS error (device nvme0n1p5): zoned devices mixed with regular devices
BTRFS error (device nvme0n1p5): failed to init hmzoned mode: -22
BTRFS error (device nvme0n1p5): open_ctree failed

Needless to say, /dev/nvme0n1p5 is _not_ a zoned device.
Nor has the zoned device a btrfs superblock ATM.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-10 07:30:24

by Qu Wenruo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support



On 8/10/18 2:04 AM, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
>
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position.

Not familiar with zoned block device, especially for the sequential case.

Is that sequential case tape like?

> This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
>
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint. The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.

This looks like it would cause a lot of holes for metadata block groups.
It would be better to avoid metadata block allocation in such sequential
zone.
(And that would need the infrastructure to make extent allocator
priority-aware)

>
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
>
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
>
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
>
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
>
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
>
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
>
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
>
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
>
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
>
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
>
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
>
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
>
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
>
> Naohiro Aota (17):
> btrfs: introduce HMZONED feature flag
> btrfs: Get zone information of zoned block devices
> btrfs: Check and enable HMZONED mode
> btrfs: limit super block locations in HMZONED mode
> btrfs: disable fallocate in HMZONED mode
> btrfs: disable direct IO in HMZONED mode
> btrfs: disable device replace in HMZONED mode
> btrfs: align extent allocation to zone boundary

According to the patch name, I though it's about extent allocation, but
in fact it's about dev extent allocation.
Renaming the patch would make more sense.

> btrfs: do sequential allocation on HMZONED drives

And this is the patch modifying extent allocator.

Despite that, the support zoned storage looks pretty interesting and
have something in common with planned priority-aware extent allocator.

Thanks,
Qu

> btrfs: split btrfs_map_bio()
> btrfs: introduce submit buffer
> btrfs: expire submit buffer on timeout
> btrfs: avoid sync IO prioritization on checksum in HMZONED mode
> btrfs: redirty released extent buffers in sequential BGs
> btrfs: reset zones of unused block groups
> btrfs: wait existing extents before truncating
> btrfs: enable to mount HMZONED incompat flag
>
> fs/btrfs/async-thread.c | 1 +
> fs/btrfs/async-thread.h | 1 +
> fs/btrfs/ctree.h | 36 ++-
> fs/btrfs/dev-replace.c | 10 +
> fs/btrfs/disk-io.c | 48 +++-
> fs/btrfs/extent-tree.c | 281 +++++++++++++++++-
> fs/btrfs/extent_io.c | 1 +
> fs/btrfs/extent_io.h | 1 +
> fs/btrfs/file.c | 4 +
> fs/btrfs/free-space-cache.c | 36 +++
> fs/btrfs/free-space-cache.h | 10 +
> fs/btrfs/inode.c | 14 +
> fs/btrfs/super.c | 32 ++-
> fs/btrfs/sysfs.c | 2 +
> fs/btrfs/transaction.c | 32 +++
> fs/btrfs/transaction.h | 3 +
> fs/btrfs/volumes.c | 551 ++++++++++++++++++++++++++++++++++--
> fs/btrfs/volumes.h | 37 +++
> include/uapi/linux/btrfs.h | 1 +
> 19 files changed, 1061 insertions(+), 40 deletions(-)
>


Attachments:
signature.asc (499.00 B)
OpenPGP digital signature

2018-08-10 08:29:54

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [RFC PATCH 02/17] btrfs: Get zone information of zoned block devices



On 9.08.2018 21:04, Naohiro Aota wrote:
> If a zoned block device is found, get its zone information (number of zones
> and zone size) using the new helper function btrfs_get_dev_zone(). To
> avoid costly run-time zone reports commands to test the device zones type
> during block allocation, attach the seqzones bitmap to the device structure
> to indicate if a zone is sequential or accept random writes.
>
> This patch also introduces the helper function btrfs_dev_is_sequential() to
> test if the zone storing a block is a sequential write required zone.
>
> Signed-off-by: Damien Le Moal <[email protected]>
> Signed-off-by: Naohiro Aota <[email protected]>
> ---
> fs/btrfs/volumes.c | 146 +++++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/volumes.h | 32 ++++++++++
> 2 files changed, 178 insertions(+)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index da86706123ff..35b3a2187653 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -677,6 +677,134 @@ static void btrfs_free_stale_devices(const char *path,
> }
> }
>
> +static int __btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
> + struct blk_zone **zones,
> + unsigned int *nr_zones, gfp_t gfp_mask)
> +{
> + struct blk_zone *z = *zones;
> + int ret;
> +
> + if (!z) {
> + z = kcalloc(*nr_zones, sizeof(struct blk_zone), GFP_KERNEL);
> + if (!z)
> + return -ENOMEM;
> + }
> +
> + ret = blkdev_report_zones(device->bdev, pos >> 9,
> + z, nr_zones, gfp_mask);
> + if (ret != 0) {
> + pr_err("BTRFS: Get zone at %llu failed %d\n",
> + pos, ret);

For errors please use btrfs_err, you have fs_info instance from the
passed btrfs_device struct.
> + return ret;
> + }
> +
> + *zones = z;
> +
> + return 0;
> +}
> +
> +static void btrfs_drop_dev_zonetypes(struct btrfs_device *device)

nit: I'd rather have this function named btrfs_destroy_dev_zonetypes but
have not strong preference either ways. It just seems to the wider
convention in the code.

> +{
> + kfree(device->seq_zones);
> + kfree(device->empty_zones);
> + device->seq_zones = NULL;
> + device->empty_zones = NULL;
> + device->nr_zones = 0;
> + device->zone_size = 0;
> + device->zone_size_shift = 0;
> +}
> +
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> + struct blk_zone *zone, gfp_t gfp_mask)
> +{
> + unsigned int nr_zones = 1;
> + int ret;
> +
> + ret = __btrfs_get_dev_zones(device, pos, &zone, &nr_zones, gfp_mask);
> + if (ret != 0 || !nr_zones)
> + return ret ? ret : -EIO;
> +
> + return 0;
> +}

This helper seems unused, why not just merge it with __btrfs_get_dev_zones?

> +
> +static int btrfs_get_dev_zonetypes(struct btrfs_device *device)
> +{
> + struct block_device *bdev = device->bdev;
> + sector_t nr_sectors = bdev->bd_part->nr_sects;
> + sector_t sector = 0;
> + struct blk_zone *zones = NULL;
> + unsigned int i, n = 0, nr_zones;
> + int ret;
> +
> + device->zone_size = 0;
> + device->zone_size_shift = 0;
> + device->nr_zones = 0;
> + device->seq_zones = NULL;
> + device->empty_zones = NULL;
> +
> + if (!bdev_is_zoned(bdev))
> + return 0;
> +

Calling this function is already predicated on the above check being
true so this seems a bit redundant. So either leave this check here and
remove it from the 2 call sites (in btrfs_init_new_device and
btrfs_open_one_device) or do the opposite.

> + device->zone_size = (u64)bdev_zone_sectors(bdev) << 9;
> + device->zone_size_shift = ilog2(device->zone_size);
> + device->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
> + if (nr_sectors & (bdev_zone_sectors(bdev) - 1))
> + device->nr_zones++;
> +
> + device->seq_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> + sizeof(*device->seq_zones), GFP_KERNEL);
> + if (!device->seq_zones)
> + return -ENOMEM;
> +
> + device->empty_zones = kcalloc(BITS_TO_LONGS(device->nr_zones),
> + sizeof(*device->empty_zones), GFP_KERNEL);
> + if (!device->empty_zones)
> + return -ENOMEM;
> +
> +#define BTRFS_REPORT_NR_ZONES 4096
> +
> + /* Get zones type */
> + while (sector < nr_sectors) {
> + nr_zones = BTRFS_REPORT_NR_ZONES;
> + ret = __btrfs_get_dev_zones(device, sector << 9,

blkdev_report_zones' (which is called from __btrfs_get_dev_zones) second
argument is a sector number, yet you first convert the sector to a byte
and then do again the opposite shift to prepare the argument for the
function. Just pass straight the sector and if you need the byte pos for
printing the error do the necessary shift in the btrfs_error statement.


Furthermore, wouldn't the code be more obvious if you just factor out
the allocation of the zones buffer from __btrfs_get_dev_zones above this
loop, afterwards __btrfs_get_dev_zones can be open coded as a single
call to blkdev_report_zones and everything will be obvious just from the
body of this loop.

> + &zones, &nr_zones, GFP_KERNEL);
> + if (ret != 0 || !nr_zones) {
> + if (!ret)
> + ret = -EIO;
> + goto out;
> + }
> +
> + for (i = 0; i < nr_zones; i++) {
> + if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> + set_bit(n, device->seq_zones);
> + if (zones[i].cond == BLK_ZONE_COND_EMPTY)
> + set_bit(n, device->empty_zones);
> + sector = zones[i].start + zones[i].len;
> + n++;
> + }
> + }
> +
> + if (n != device->nr_zones) {
> + pr_err("BTRFS: Inconsistent number of zones (%u / %u)\n",
> + n, device->nr_zones);

btrfs_err
> + ret = -EIO;
> + goto out;
> + }
> +
> + pr_info("BTRFS: host-%s zoned block device, %u zones of %llu sectors\n",
> + bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
> + device->nr_zones, device->zone_size >> 9);
> +
btrfs_info

> +out:
> + kfree(zones);
> +
> + if (ret)
> + btrfs_drop_dev_zonetypes(device);
> +
> + return ret;
> +}
> +
> +
> static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> struct btrfs_device *device, fmode_t flags,
> void *holder)
> @@ -726,6 +854,13 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
> clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
> device->mode = flags;
>
> + /* Get zone type information of zoned block devices */
> + if (bdev_is_zoned(bdev)) {
> + ret = btrfs_get_dev_zonetypes(device);
> + if (ret != 0)
> + goto error_brelse;
> + }
> +
> fs_devices->open_devices++;
> if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
> device->devid != BTRFS_DEV_REPLACE_DEVID) {
> @@ -1012,6 +1147,7 @@ static void btrfs_close_bdev(struct btrfs_device *device)
> }
>
> blkdev_put(device->bdev, device->mode);
> + btrfs_drop_dev_zonetypes(device);
> }
>
> static void btrfs_close_one_device(struct btrfs_device *device)
> @@ -2439,6 +2575,15 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> mutex_unlock(&fs_info->chunk_mutex);
> mutex_unlock(&fs_devices->device_list_mutex);
>
> + /* Get zone type information of zoned block devices */
> + if (bdev_is_zoned(bdev)) {
> + ret = btrfs_get_dev_zonetypes(device);
> + if (ret) {
> + btrfs_abort_transaction(trans, ret);
> + goto error_sysfs;
> + }
> + }
> +
> if (seeding_dev) {
> mutex_lock(&fs_info->chunk_mutex);
> ret = init_first_rw_device(trans, fs_info);
> @@ -2504,6 +2649,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
> return ret;
>
> error_sysfs:
> + btrfs_drop_dev_zonetypes(device);
> btrfs_sysfs_rm_device_link(fs_devices, device);
> mutex_lock(&fs_info->fs_devices->device_list_mutex);
> mutex_lock(&fs_info->chunk_mutex);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 23e9285d88de..13d59bff204f 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -61,6 +61,16 @@ struct btrfs_device {
>
> struct block_device *bdev;
>
> + /*
> + * Number of zones, zone size and types of zones if bdev is a
> + * zoned block device.
> + */
> + u64 zone_size;
> + u8 zone_size_shift;
> + u32 nr_zones;
> + unsigned long *seq_zones;
> + unsigned long *empty_zones;
> +
> /* the mode sent to blkdev_get */
> fmode_t mode;
>
> @@ -404,6 +414,8 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
> int mirror_num, int async_submit);
> int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
> fmode_t flags, void *holder);
> +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
> + struct blk_zone *zone, gfp_t gfp_mask);
> struct btrfs_device *btrfs_scan_one_device(const char *path,
> fmode_t flags, void *holder);
> int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
> @@ -466,6 +478,26 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
> u64 chunk_offset, u64 chunk_size);
> int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
>
> +static inline int btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
> +{
> + unsigned int zno = pos >> device->zone_size_shift;
> +
> + if (!device->seq_zones)
> + return 1;
> +
> + return test_bit(zno, device->seq_zones);
> +}
> +
> +static inline int btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
> +{
> + unsigned int zno = pos >> device->zone_size_shift;
> +
> + if (!device->empty_zones)
> + return 0;
> +
> + return test_bit(zno, device->empty_zones);
> +}
> +
> static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
> int index)
> {
>

2018-08-10 08:30:37

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support



On 9.08.2018 21:04, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
>
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position. This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
>
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint. The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.
>
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
>
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
>
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
>
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
>
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
>
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
>
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
>
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
>
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
>
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
>
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
>
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
>
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
>
> Naohiro Aota (17):
> btrfs: introduce HMZONED feature flag
> btrfs: Get zone information of zoned block devices
> btrfs: Check and enable HMZONED mode
> btrfs: limit super block locations in HMZONED mode
> btrfs: disable fallocate in HMZONED mode
> btrfs: disable direct IO in HMZONED mode
> btrfs: disable device replace in HMZONED mode
> btrfs: align extent allocation to zone boundary
> btrfs: do sequential allocation on HMZONED drives
> btrfs: split btrfs_map_bio()
> btrfs: introduce submit buffer
> btrfs: expire submit buffer on timeout
> btrfs: avoid sync IO prioritization on checksum in HMZONED mode
> btrfs: redirty released extent buffers in sequential BGs
> btrfs: reset zones of unused block groups
> btrfs: wait existing extents before truncating
> btrfs: enable to mount HMZONED incompat flag
>
> fs/btrfs/async-thread.c | 1 +
> fs/btrfs/async-thread.h | 1 +
> fs/btrfs/ctree.h | 36 ++-
> fs/btrfs/dev-replace.c | 10 +
> fs/btrfs/disk-io.c | 48 +++-
> fs/btrfs/extent-tree.c | 281 +++++++++++++++++-
> fs/btrfs/extent_io.c | 1 +
> fs/btrfs/extent_io.h | 1 +
> fs/btrfs/file.c | 4 +
> fs/btrfs/free-space-cache.c | 36 +++
> fs/btrfs/free-space-cache.h | 10 +
> fs/btrfs/inode.c | 14 +
> fs/btrfs/super.c | 32 ++-
> fs/btrfs/sysfs.c | 2 +
> fs/btrfs/transaction.c | 32 +++
> fs/btrfs/transaction.h | 3 +
> fs/btrfs/volumes.c | 551 ++++++++++++++++++++++++++++++++++--
> fs/btrfs/volumes.h | 37 +++
> include/uapi/linux/btrfs.h | 1 +
> 19 files changed, 1061 insertions(+), 40 deletions(-)
>

There are multiple places where you do naked shifts by
ilog2(sectorsize). There is a perfectly well named define: SECTOR_SHIFT
which a lot more informative for someone who doesn't necessarily have
experience with linux storage/fs layers. Please fix such occurrences of
magic values shifting.

2018-08-10 08:31:08

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support



On 10.08.2018 10:53, Nikolay Borisov wrote:
>
>
> On 9.08.2018 21:04, Naohiro Aota wrote:
>> This series adds zoned block device support to btrfs.
>>
>> A zoned block device consists of a number of zones. Zones are either
>> conventional and accepting random writes or sequential and requiring that
>> writes be issued in LBA order from each zone write pointer position. This
>> patch series ensures that the sequential write constraint of sequential
>> zones is respected while fundamentally not changing BtrFS block and I/O
>> management for block stored in conventional zones.
>>
>> To achieve this, the default dev extent size of btrfs is changed on zoned
>> block devices so that dev extents are always aligned to a zone. Allocation
>> of blocks within a block group is changed so that the allocation is always
>> sequential from the beginning of the block groups. To do so, an allocation
>> pointer is added to block groups and used as the allocation hint. The
>> allocation changes also ensures that block freed below the allocation
>> pointer are ignored, resulting in sequential block allocation regardless of
>> the block group usage.
>>
>> While the introduction of the allocation pointer ensure that blocks will be
>> allocated sequentially, I/Os to write out newly allocated blocks may be
>> issued out of order, causing errors when writing to sequential zones. This
>> problem s solved by introducing a submit_buffer() function and changes to
>> the internal I/O scheduler to ensure in-order issuing of write I/Os for
>> each chunk and corresponding to the block allocation order in the chunk.
>>
>> The zones of a chunk are reset to allow reusing of the zone only when the
>> block group is being freed, that is, when all the extents of the block group
>> are unused.
>>
>> For btrfs volumes composed of multiple zoned disks, restrictions are added
>> to ensure that all disks have the same zone size. This matches the existing
>> constraint that all dev extents in a chunk must have the same size.
>>
>> It requires zoned block devices to test the patchset. Even if you don't
>> have zone devices, you can use tcmu-runner [1] to emulate zoned block
>> devices. It can export emulated zoned block devices via iSCSI. Please see
>> the README.md of tcmu-runner [2] for howtos to generate a zoned block
>> device on tcmu-runner.
>>
>> [1] https://github.com/open-iscsi/tcmu-runner
>> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
>>
>> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
>> the btrfs volume was formatted for use on zoned block devices.
>>
>> Patches 2 and 3 implement functions to gather information on the zones of
>> the device (zones type and write pointer position).
>>
>> Patch 4 restrict the possible locations of super blocks to conventional
>> zones to preserve the existing update in-place mechanism for the super
>> blocks.
>>
>> Patches 5 to 7 disable features which are not compatible with the sequential
>> write constraints of zoned block devices. This includes fallocate and
>> direct I/O support. Device replace is also disabled for now.
>>
>> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
>> implement sequential block allocation in block groups and chunks.
>>
>> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
>> write I/O delivery to the device zones.
>>
>> Patches 13 to 16 modify several parts of btrfs to handle free blocks
>> without breaking the sequential block allocation and sequential write order
>> as well as zone reset for unused chunks.
>>
>> Finally, patch 17 adds the HMZONED feature to the list of supported
>> features.
>>
>> Naohiro Aota (17):
>> btrfs: introduce HMZONED feature flag
>> btrfs: Get zone information of zoned block devices
>> btrfs: Check and enable HMZONED mode
>> btrfs: limit super block locations in HMZONED mode
>> btrfs: disable fallocate in HMZONED mode
>> btrfs: disable direct IO in HMZONED mode
>> btrfs: disable device replace in HMZONED mode
>> btrfs: align extent allocation to zone boundary
>> btrfs: do sequential allocation on HMZONED drives
>> btrfs: split btrfs_map_bio()
>> btrfs: introduce submit buffer
>> btrfs: expire submit buffer on timeout
>> btrfs: avoid sync IO prioritization on checksum in HMZONED mode
>> btrfs: redirty released extent buffers in sequential BGs
>> btrfs: reset zones of unused block groups
>> btrfs: wait existing extents before truncating
>> btrfs: enable to mount HMZONED incompat flag
>>
>> fs/btrfs/async-thread.c | 1 +
>> fs/btrfs/async-thread.h | 1 +
>> fs/btrfs/ctree.h | 36 ++-
>> fs/btrfs/dev-replace.c | 10 +
>> fs/btrfs/disk-io.c | 48 +++-
>> fs/btrfs/extent-tree.c | 281 +++++++++++++++++-
>> fs/btrfs/extent_io.c | 1 +
>> fs/btrfs/extent_io.h | 1 +
>> fs/btrfs/file.c | 4 +
>> fs/btrfs/free-space-cache.c | 36 +++
>> fs/btrfs/free-space-cache.h | 10 +
>> fs/btrfs/inode.c | 14 +
>> fs/btrfs/super.c | 32 ++-
>> fs/btrfs/sysfs.c | 2 +
>> fs/btrfs/transaction.c | 32 +++
>> fs/btrfs/transaction.h | 3 +
>> fs/btrfs/volumes.c | 551 ++++++++++++++++++++++++++++++++++--
>> fs/btrfs/volumes.h | 37 +++
>> include/uapi/linux/btrfs.h | 1 +
>> 19 files changed, 1061 insertions(+), 40 deletions(-)
>>
>
> There are multiple places where you do naked shifts by
> ilog2(sectorsize). There is a perfectly well named define: SECTOR_SHIFT
> which a lot more informative for someone who doesn't necessarily have
> experience with linux storage/fs layers. Please fix such occurrences of
> magic values shifting.
>

And Hannes just reminded me that this lannded in commit :
233bde21aa43 ("block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into
<linux/blkdev.h>")

This March so it might fairly recent depending on the tree you've based
your work on.




2018-08-10 12:27:09

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC PATCH 03/17] btrfs: Check and enable HMZONED mode

On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> HMZONED mode cannot be used together with the RAID5/6 profile. Introduce
> the function btrfs_check_hmzoned_mode() to check this. This function will
> also check if HMZONED flag is enabled on the file system and if the file
> system consists of zoned devices with equal zone size.
>
> Additionally, as updates to the space cache are in-place, the space cache
> cannot be located over sequential zones and there is no guarantees that the
> device will have enough conventional zones to store this cache. Resolve
> this problem by disabling completely the space cache. This does not
> introduces any problems with sequential block groups: all the free space is
> located after the allocation pointer and no free space before the pointer.
> There is no need to have such cache.
>
> Signed-off-by: Damien Le Moal <[email protected]>
> Signed-off-by: Naohiro Aota <[email protected]>
> ---
> fs/btrfs/ctree.h | 3 ++
> fs/btrfs/dev-replace.c | 7 ++++
> fs/btrfs/disk-io.c | 7 ++++
> fs/btrfs/super.c | 12 +++---
> fs/btrfs/volumes.c | 87 ++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/volumes.h | 1 +
> 6 files changed, 112 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 66f1d3895bca..14f880126532 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -763,6 +763,9 @@ struct btrfs_fs_info {
> struct btrfs_root *uuid_root;
> struct btrfs_root *free_space_root;
>
> + /* Zone size when in HMZONED mode */
> + u64 zone_size;
> +
> /* the log root tree is a directory of all the other log roots */
> struct btrfs_root *log_root_tree;
>
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index dec01970d8c5..839a35008fd8 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -202,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
> return PTR_ERR(bdev);
> }
>
> + if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
> + !btrfs_fs_incompat(fs_info, HMZONED)) ||
> + (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
> + ret = -EINVAL;
> + goto error;
> + }
> +
> filemap_write_and_wait(bdev->bd_inode->i_mapping);
>
> devices = &fs_info->fs_devices->devices;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5124c15705ce..14f284382ba7 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3057,6 +3057,13 @@ int open_ctree(struct super_block *sb,
>
> btrfs_free_extra_devids(fs_devices, 1);
>
> + ret = btrfs_check_hmzoned_mode(fs_info);
> + if (ret) {
> + btrfs_err(fs_info, "failed to init hmzoned mode: %d",
> + ret);
> + goto fail_block_groups;
> + }
> +
> ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
> if (ret) {
> btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 5fdd95e3de05..cc812e459197 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
> bool saved_compress_force;
> int no_compress = 0;
>
> - cache_gen = btrfs_super_cache_generation(info->super_copy);
> - if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> - btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> - else if (cache_gen)
> - btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> + if (!btrfs_fs_incompat(info, HMZONED)) {
> + cache_gen = btrfs_super_cache_generation(info->super_copy);
> + if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> + btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> + else if (cache_gen)
> + btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> + }
>
> /*
> * Even the options are empty, we still need to do extra check
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 35b3a2187653..ba7ebb80de4d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1293,6 +1293,80 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
> return ret;
> }
>
> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
> +{
> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> + struct btrfs_device *device;
> + u64 hmzoned_devices = 0;
> + u64 nr_devices = 0;
> + u64 zone_size = 0;
> + int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
> + int ret = 0;
> +
> + /* Count zoned devices */
> + list_for_each_entry(device, &fs_devices->devices, dev_list) {
> + if (!device->bdev)
> + continue;
> + if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
> + (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
> + incompat_hmzoned)) {
> + hmzoned_devices++;
> + if (!zone_size) {
> + zone_size = device->zone_size;
> + } else if (device->zone_size != zone_size) {
> + btrfs_err(fs_info,
> + "Zoned block devices must have equal zone sizes");
> + ret = -EINVAL;
> + goto out;
> + }
> + }
> + nr_devices++;
> + }
> +
> + if (!hmzoned_devices && incompat_hmzoned) {
> + /* No zoned block device, disable HMZONED */
> + btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
> + ret = -EINVAL;
> + goto out;
> + }
> +
> + fs_info->zone_size = zone_size;
> +
> + if (hmzoned_devices != nr_devices) {
> + btrfs_err(fs_info,
> + "zoned devices mixed with regular devices");
> + ret = -EINVAL;
> + goto out;
> + }
> +
This breaks existing setups; as we're not checking if the device
specified by fs_info is a zoned device we'll fail here for normal devices.

You need this patch to fix it:

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 43eaf0142062..8609776c9a9e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1314,6 +1314,9 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info
*fs_info)
int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
int ret = 0;

+ if (!incompat_hmzoned)
+ return 0;
+
/* Count zoned devices */
list_for_each_entry(device, &fs_devices->devices, dev_list) {
if (!device->bdev)


Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-10 13:17:42

by Naohiro Aota

[permalink] [raw]
Subject: Re: [RFC PATCH 03/17] btrfs: Check and enable HMZONED mode

On Fri, Aug 10, 2018 at 02:25:33PM +0200, Hannes Reinecke wrote:
> On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> > HMZONED mode cannot be used together with the RAID5/6 profile. Introduce
> > the function btrfs_check_hmzoned_mode() to check this. This function will
> > also check if HMZONED flag is enabled on the file system and if the file
> > system consists of zoned devices with equal zone size.
> >
> > Additionally, as updates to the space cache are in-place, the space cache
> > cannot be located over sequential zones and there is no guarantees that the
> > device will have enough conventional zones to store this cache. Resolve
> > this problem by disabling completely the space cache. This does not
> > introduces any problems with sequential block groups: all the free space is
> > located after the allocation pointer and no free space before the pointer.
> > There is no need to have such cache.
> >
> > Signed-off-by: Damien Le Moal <[email protected]>
> > Signed-off-by: Naohiro Aota <[email protected]>
> > ---
> > fs/btrfs/ctree.h | 3 ++
> > fs/btrfs/dev-replace.c | 7 ++++
> > fs/btrfs/disk-io.c | 7 ++++
> > fs/btrfs/super.c | 12 +++---
> > fs/btrfs/volumes.c | 87 ++++++++++++++++++++++++++++++++++++++++++
> > fs/btrfs/volumes.h | 1 +
> > 6 files changed, 112 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 66f1d3895bca..14f880126532 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -763,6 +763,9 @@ struct btrfs_fs_info {
> > struct btrfs_root *uuid_root;
> > struct btrfs_root *free_space_root;
> >
> > + /* Zone size when in HMZONED mode */
> > + u64 zone_size;
> > +
> > /* the log root tree is a directory of all the other log roots */
> > struct btrfs_root *log_root_tree;
> >
> > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> > index dec01970d8c5..839a35008fd8 100644
> > --- a/fs/btrfs/dev-replace.c
> > +++ b/fs/btrfs/dev-replace.c
> > @@ -202,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
> > return PTR_ERR(bdev);
> > }
> >
> > + if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
> > + !btrfs_fs_incompat(fs_info, HMZONED)) ||
> > + (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
> > + ret = -EINVAL;
> > + goto error;
> > + }
> > +
> > filemap_write_and_wait(bdev->bd_inode->i_mapping);
> >
> > devices = &fs_info->fs_devices->devices;
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 5124c15705ce..14f284382ba7 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -3057,6 +3057,13 @@ int open_ctree(struct super_block *sb,
> >
> > btrfs_free_extra_devids(fs_devices, 1);
> >
> > + ret = btrfs_check_hmzoned_mode(fs_info);
> > + if (ret) {
> > + btrfs_err(fs_info, "failed to init hmzoned mode: %d",
> > + ret);
> > + goto fail_block_groups;
> > + }
> > +
> > ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
> > if (ret) {
> > btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index 5fdd95e3de05..cc812e459197 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
> > bool saved_compress_force;
> > int no_compress = 0;
> >
> > - cache_gen = btrfs_super_cache_generation(info->super_copy);
> > - if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> > - btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> > - else if (cache_gen)
> > - btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> > + if (!btrfs_fs_incompat(info, HMZONED)) {
> > + cache_gen = btrfs_super_cache_generation(info->super_copy);
> > + if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> > + btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> > + else if (cache_gen)
> > + btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> > + }
> >
> > /*
> > * Even the options are empty, we still need to do extra check
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 35b3a2187653..ba7ebb80de4d 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -1293,6 +1293,80 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
> > return ret;
> > }
> >
> > +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
> > +{
> > + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
> > + struct btrfs_device *device;
> > + u64 hmzoned_devices = 0;
> > + u64 nr_devices = 0;
> > + u64 zone_size = 0;
> > + int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
> > + int ret = 0;
> > +
> > + /* Count zoned devices */
> > + list_for_each_entry(device, &fs_devices->devices, dev_list) {
> > + if (!device->bdev)
> > + continue;
> > + if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
> > + (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
> > + incompat_hmzoned)) {
> > + hmzoned_devices++;
> > + if (!zone_size) {
> > + zone_size = device->zone_size;
> > + } else if (device->zone_size != zone_size) {
> > + btrfs_err(fs_info,
> > + "Zoned block devices must have equal zone sizes");
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > + }
> > + nr_devices++;
> > + }
> > +
> > + if (!hmzoned_devices && incompat_hmzoned) {
> > + /* No zoned block device, disable HMZONED */
> > + btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> > + fs_info->zone_size = zone_size;
> > +
> > + if (hmzoned_devices != nr_devices) {
> > + btrfs_err(fs_info,
> > + "zoned devices mixed with regular devices");
> > + ret = -EINVAL;
> > + goto out;
> > + }
> > +
> This breaks existing setups; as we're not checking if the device
> specified by fs_info is a zoned device we'll fail here for normal devices.

Ah, I forgot to deel with the normal devices when I convert HMZONED
mount flag to incompat flag.

> You need this patch to fix it:

Thank you for fixing this. It's exactly what I wanted to do. I'll fix
in the next version.

Regards,
Naohiro

> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 43eaf0142062..8609776c9a9e 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1314,6 +1314,9 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info
> *fs_info)
> int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
> int ret = 0;
>
> + if (!incompat_hmzoned)
> + return 0;
> +
> /* Count zoned devices */
> list_for_each_entry(device, &fs_devices->devices, dev_list) {
> if (!device->bdev)
>
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke zSeries & Storage
> [email protected] +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
> GF: F. Imend?rffer, J. Smithard, D. Upmanyu, G. Norton
> HRB 21284 (AG N?rnberg)

2018-08-10 13:42:24

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC PATCH 03/17] btrfs: Check and enable HMZONED mode

On 08/10/2018 03:15 PM, Naohiro Aota wrote:
> On Fri, Aug 10, 2018 at 02:25:33PM +0200, Hannes Reinecke wrote:
>> On 08/09/2018 08:04 PM, Naohiro Aota wrote:
>>> HMZONED mode cannot be used together with the RAID5/6 profile. Introduce
>>> the function btrfs_check_hmzoned_mode() to check this. This function will
>>> also check if HMZONED flag is enabled on the file system and if the file
>>> system consists of zoned devices with equal zone size.
>>>
>>> Additionally, as updates to the space cache are in-place, the space cache
>>> cannot be located over sequential zones and there is no guarantees that the
>>> device will have enough conventional zones to store this cache. Resolve
>>> this problem by disabling completely the space cache. This does not
>>> introduces any problems with sequential block groups: all the free space is
>>> located after the allocation pointer and no free space before the pointer.
>>> There is no need to have such cache.
>>>
>>> Signed-off-by: Damien Le Moal <[email protected]>
>>> Signed-off-by: Naohiro Aota <[email protected]>
>>> ---
>>> fs/btrfs/ctree.h | 3 ++
>>> fs/btrfs/dev-replace.c | 7 ++++
>>> fs/btrfs/disk-io.c | 7 ++++
>>> fs/btrfs/super.c | 12 +++---
>>> fs/btrfs/volumes.c | 87 ++++++++++++++++++++++++++++++++++++++++++
>>> fs/btrfs/volumes.h | 1 +
>>> 6 files changed, 112 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 66f1d3895bca..14f880126532 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -763,6 +763,9 @@ struct btrfs_fs_info {
>>> struct btrfs_root *uuid_root;
>>> struct btrfs_root *free_space_root;
>>>
>>> + /* Zone size when in HMZONED mode */
>>> + u64 zone_size;
>>> +
>>> /* the log root tree is a directory of all the other log roots */
>>> struct btrfs_root *log_root_tree;
>>>
>>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>>> index dec01970d8c5..839a35008fd8 100644
>>> --- a/fs/btrfs/dev-replace.c
>>> +++ b/fs/btrfs/dev-replace.c
>>> @@ -202,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>>> return PTR_ERR(bdev);
>>> }
>>>
>>> + if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
>>> + !btrfs_fs_incompat(fs_info, HMZONED)) ||
>>> + (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
>>> + ret = -EINVAL;
>>> + goto error;
>>> + }
>>> +
>>> filemap_write_and_wait(bdev->bd_inode->i_mapping);
>>>
>>> devices = &fs_info->fs_devices->devices;
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 5124c15705ce..14f284382ba7 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3057,6 +3057,13 @@ int open_ctree(struct super_block *sb,
>>>
>>> btrfs_free_extra_devids(fs_devices, 1);
>>>
>>> + ret = btrfs_check_hmzoned_mode(fs_info);
>>> + if (ret) {
>>> + btrfs_err(fs_info, "failed to init hmzoned mode: %d",
>>> + ret);
>>> + goto fail_block_groups;
>>> + }
>>> +
>>> ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
>>> if (ret) {
>>> btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index 5fdd95e3de05..cc812e459197 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>>> bool saved_compress_force;
>>> int no_compress = 0;
>>>
>>> - cache_gen = btrfs_super_cache_generation(info->super_copy);
>>> - if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>>> - btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
>>> - else if (cache_gen)
>>> - btrfs_set_opt(info->mount_opt, SPACE_CACHE);
>>> + if (!btrfs_fs_incompat(info, HMZONED)) {
>>> + cache_gen = btrfs_super_cache_generation(info->super_copy);
>>> + if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>>> + btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
>>> + else if (cache_gen)
>>> + btrfs_set_opt(info->mount_opt, SPACE_CACHE);
>>> + }
>>>
>>> /*
>>> * Even the options are empty, we still need to do extra check
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 35b3a2187653..ba7ebb80de4d 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -1293,6 +1293,80 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
>>> return ret;
>>> }
>>>
>>> +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
>>> +{
>>> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> + struct btrfs_device *device;
>>> + u64 hmzoned_devices = 0;
>>> + u64 nr_devices = 0;
>>> + u64 zone_size = 0;
>>> + int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
>>> + int ret = 0;
>>> +
>>> + /* Count zoned devices */
>>> + list_for_each_entry(device, &fs_devices->devices, dev_list) {
>>> + if (!device->bdev)
>>> + continue;
>>> + if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
>>> + (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
>>> + incompat_hmzoned)) {
>>> + hmzoned_devices++;
>>> + if (!zone_size) {
>>> + zone_size = device->zone_size;
>>> + } else if (device->zone_size != zone_size) {
>>> + btrfs_err(fs_info,
>>> + "Zoned block devices must have equal zone sizes");
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> + }
>>> + nr_devices++;
>>> + }
>>> +
>>> + if (!hmzoned_devices && incompat_hmzoned) {
>>> + /* No zoned block device, disable HMZONED */
>>> + btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>>> + fs_info->zone_size = zone_size;
>>> +
>>> + if (hmzoned_devices != nr_devices) {
>>> + btrfs_err(fs_info,
>>> + "zoned devices mixed with regular devices");
>>> + ret = -EINVAL;
>>> + goto out;
>>> + }
>>> +
>> This breaks existing setups; as we're not checking if the device
>> specified by fs_info is a zoned device we'll fail here for normal devices.
>
> Ah, I forgot to deel with the normal devices when I convert HMZONED
> mount flag to incompat flag.
>
>> You need this patch to fix it:
>
> Thank you for fixing this. It's exactly what I wanted to do. I'll fix
> in the next version.
>
Thanks.

Other than that it seems to be holding up quite well; did a full 'git
clone && make oldconfig && make -j 16' on the upstream linux kernel with
no problems at all.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-10 14:25:53

by Naohiro Aota

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On Fri, Aug 10, 2018 at 09:04:59AM +0200, Hannes Reinecke wrote:
> On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> > This series adds zoned block device support to btrfs.
> >
> > A zoned block device consists of a number of zones. Zones are either
> > conventional and accepting random writes or sequential and requiring that
> > writes be issued in LBA order from each zone write pointer position. This
> > patch series ensures that the sequential write constraint of sequential
> > zones is respected while fundamentally not changing BtrFS block and I/O
> > management for block stored in conventional zones.
> >
> > To achieve this, the default dev extent size of btrfs is changed on zoned
> > block devices so that dev extents are always aligned to a zone. Allocation
> > of blocks within a block group is changed so that the allocation is always
> > sequential from the beginning of the block groups. To do so, an allocation
> > pointer is added to block groups and used as the allocation hint. The
> > allocation changes also ensures that block freed below the allocation
> > pointer are ignored, resulting in sequential block allocation regardless of
> > the block group usage.
> >
> > While the introduction of the allocation pointer ensure that blocks will be
> > allocated sequentially, I/Os to write out newly allocated blocks may be
> > issued out of order, causing errors when writing to sequential zones. This
> > problem s solved by introducing a submit_buffer() function and changes to
> > the internal I/O scheduler to ensure in-order issuing of write I/Os for
> > each chunk and corresponding to the block allocation order in the chunk.
> >
> > The zones of a chunk are reset to allow reusing of the zone only when the
> > block group is being freed, that is, when all the extents of the block group
> > are unused.
> >
> > For btrfs volumes composed of multiple zoned disks, restrictions are added
> > to ensure that all disks have the same zone size. This matches the existing
> > constraint that all dev extents in a chunk must have the same size.
> >
> > It requires zoned block devices to test the patchset. Even if you don't
> > have zone devices, you can use tcmu-runner [1] to emulate zoned block
> > devices. It can export emulated zoned block devices via iSCSI. Please see
> > the README.md of tcmu-runner [2] for howtos to generate a zoned block
> > device on tcmu-runner.
> >
> > [1] https://github.com/open-iscsi/tcmu-runner
> > [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
> >
> > Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> > the btrfs volume was formatted for use on zoned block devices.
> >
> > Patches 2 and 3 implement functions to gather information on the zones of
> > the device (zones type and write pointer position).
> >
> > Patch 4 restrict the possible locations of super blocks to conventional
> > zones to preserve the existing update in-place mechanism for the super
> > blocks.
> >
> > Patches 5 to 7 disable features which are not compatible with the sequential
> > write constraints of zoned block devices. This includes fallocate and
> > direct I/O support. Device replace is also disabled for now.
> >
> > Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> > implement sequential block allocation in block groups and chunks.
> >
> > Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> > write I/O delivery to the device zones.
> >
> > Patches 13 to 16 modify several parts of btrfs to handle free blocks
> > without breaking the sequential block allocation and sequential write order
> > as well as zone reset for unused chunks.
> >
> > Finally, patch 17 adds the HMZONED feature to the list of supported
> > features.
> >
> Thanks for doing all the work.
> However, the patches don't apply cleanly to current master branch.
> Can you please rebase them?

I'm currently basing on
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
for-next branch, since my previous bug-fix patch 266e010932ce ("btrfs:
revert fs_devices state on error of btrfs_init_new_device") is
necessary to avoid use-after-free bug in error handling path of
btrfs_init_new_device() in the patch 2. I'm sorry for not mentioning
it.

I'll rebase on the master branch when the patch reach the master.

Regards,
Naohiro

> Thanks.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke zSeries & Storage
> [email protected] +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
> GF: F. Imend?rffer, J. Smithard, D. Upmanyu, G. Norton
> HRB 21284 (AG N?rnberg)

2018-08-13 18:44:15

by David Sterba

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.

Yay, thanks!

As this a RFC, I'll give you some. The code looks ok for what it claims
to do, I'll skip style and unimportant implementation details for now as
there are bigger questions.

The zoned devices bring some constraints so not all filesystem features
cannot be expected to work, so this rules out any form of in-place
updates like NODATACOW.

Then there's list of 'how will zoned device work with feature X'?

You disable fallocate and DIO. I haven't looked closer at the fallocate
case, but DIO could work in the sense that open() will open the file but
any write will fallback to buffered writes. This is implemented so it
would need to be wired together.

Mixed device types are not allowed, and I tend to agree with that,
though this could work in principle. Just that the chunk allocator
would have to be aware of the device types and tweaked to allocate from
the same group. The btrfs code is not ready for that in terms of the
allocator capabilities and configuration options.

Device replace is disabled, but the changlog suggests there's a way to
make it work, so it's a matter of implementation. And this should be
implemented at the time of merge.

RAID5/6 + zoned support is highly desired and lack of it could be
considered a NAK for the whole series. The drive sizes are expected to
be several terabytes, that sounds be too risky to lack the redundancy
options (RAID1 is not sufficient here).

The changelog does not explain why this does not or cannot work, so I
cannot reason about that or possibly suggest workarounds or solutions.
But I think it should work in principle.

As this is first post and RFC I don't expect that everything is
implemented, but at least the known missing points should be documented.
You've implemented lots of the low-level zoned support and extent
allocation, so even if the raid56 might be difficult, it should be the
smaller part.

2018-08-13 19:22:15

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On 08/13/2018 08:42 PM, David Sterba wrote:
> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>> This series adds zoned block device support to btrfs.
>
> Yay, thanks!
>
> As this a RFC, I'll give you some. The code looks ok for what it claims
> to do, I'll skip style and unimportant implementation details for now as
> there are bigger questions.
>
> The zoned devices bring some constraints so not all filesystem features
> cannot be expected to work, so this rules out any form of in-place
> updates like NODATACOW.
>
> Then there's list of 'how will zoned device work with feature X'?
>
> You disable fallocate and DIO. I haven't looked closer at the fallocate
> case, but DIO could work in the sense that open() will open the file but
> any write will fallback to buffered writes. This is implemented so it
> would need to be wired together.
>
> Mixed device types are not allowed, and I tend to agree with that,
> though this could work in principle. Just that the chunk allocator
> would have to be aware of the device types and tweaked to allocate from
> the same group. The btrfs code is not ready for that in terms of the
> allocator capabilities and configuration options.
>
> Device replace is disabled, but the changlog suggests there's a way to
> make it work, so it's a matter of implementation. And this should be
> implemented at the time of merge.
>
How would a device replace work in general?
While I do understand that device replace is possible with RAID
thingies, I somewhat fail to see how could do a device replacement
without RAID functionality.
Is it even possible?
If so, how would it be different from a simple umount?

> RAID5/6 + zoned support is highly desired and lack of it could be
> considered a NAK for the whole series. The drive sizes are expected to
> be several terabytes, that sounds be too risky to lack the redundancy
> options (RAID1 is not sufficient here).
>
That really depends on the allocator.
If we can make the RAID code to work with zone-sized stripes it should
be pretty trivial. I can have a look at that; RAID support was on my
agenda anyway (albeit for MD, not for btrfs).

> The changelog does not explain why this does not or cannot work, so I
> cannot reason about that or possibly suggest workarounds or solutions.
> But I think it should work in principle.
>
As mentioned, it really should work for zone-sized stripes. I'm not sure
we can make it to work with stripes less than zone sizes.

> As this is first post and RFC I don't expect that everything is
> implemented, but at least the known missing points should be documented.
> You've implemented lots of the low-level zoned support and extent
> allocation, so even if the raid56 might be difficult, it should be the
> smaller part.
>
FYI, I've run a simple stress-test on a zoned device (git clone linus &&
make) and haven't found any issue with those; compilation ran without a
problem, and with quite decent speed.
Good job!

Cheers,

Hannes

2018-08-13 19:30:15

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On 2018-08-13 15:20, Hannes Reinecke wrote:
> On 08/13/2018 08:42 PM, David Sterba wrote:
>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>>> This series adds zoned block device support to btrfs.
>>
>> Yay, thanks!
>>
>> As this a RFC, I'll give you some. The code looks ok for what it claims
>> to do, I'll skip style and unimportant implementation details for now as
>> there are bigger questions.
>>
>> The zoned devices bring some constraints so not all filesystem features
>> cannot be expected to work, so this rules out any form of in-place
>> updates like NODATACOW.
>>
>> Then there's list of 'how will zoned device work with feature X'?
>>
>> You disable fallocate and DIO. I haven't looked closer at the fallocate
>> case, but DIO could work in the sense that open() will open the file but
>> any write will fallback to buffered writes. This is implemented so it
>> would need to be wired together.
>>
>> Mixed device types are not allowed, and I tend to agree with that,
>> though this could work in principle.  Just that the chunk allocator
>> would have to be aware of the device types and tweaked to allocate from
>> the same group. The btrfs code is not ready for that in terms of the
>> allocator capabilities and configuration options.
>>
>> Device replace is disabled, but the changlog suggests there's a way to
>> make it work, so it's a matter of implementation. And this should be
>> implemented at the time of merge.
>>
> How would a device replace work in general?
> While I do understand that device replace is possible with RAID
> thingies, I somewhat fail to see how could do a device replacement
> without RAID functionality.
> Is it even possible?
> If so, how would it be different from a simple umount?
Device replace is implemented in largely the same manner as most other
live data migration tools (for example, LVM2's pvmove command).

In short, when you issue a replace command for a given device, all
writes that would go to that device are instead sent to the new device.
While this is happening, old data is copied over from the old device to
the new one. Once all the data is copied, the old device is released
(and it's BTRFS signature wiped), and the new device has it's device ID
updated to that of the old device.

This is possible largely because of the COW infrastructure, but it's
implemented in a way that doesn't entirely depend on it (otherwise it
wouldn't work for NOCOW files).

Handling this on zoned devices is not likely to be easy though, you
would functionally have to freeze I/O that would hit the device being
replaced so that you don't accidentally write to a sequential zone out
of order.
>
>> RAID5/6 + zoned support is highly desired and lack of it could be
>> considered a NAK for the whole series. The drive sizes are expected to
>> be several terabytes, that sounds be too risky to lack the redundancy
>> options (RAID1 is not sufficient here).
>>
> That really depends on the allocator.
> If we can make the RAID code to work with zone-sized stripes it should
> be pretty trivial. I can have a look at that; RAID support was on my
> agenda anyway (albeit for MD, not for btrfs).
>
>> The changelog does not explain why this does not or cannot work, so I
>> cannot reason about that or possibly suggest workarounds or solutions.
>> But I think it should work in principle.
>>
> As mentioned, it really should work for zone-sized stripes. I'm not sure
> we can make it to work with stripes less than zone sizes.
>
>> As this is first post and RFC I don't expect that everything is
>> implemented, but at least the known missing points should be documented.
>> You've implemented lots of the low-level zoned support and extent
>> allocation, so even if the raid56 might be difficult, it should be the
>> smaller part.
>>
> FYI, I've run a simple stress-test on a zoned device (git clone linus &&
> make) and haven't found any issue with those; compilation ran without a
> problem, and with quite decent speed.
> Good job!
>
> Cheers,
>
> Hannes


2018-08-14 07:43:35

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On 08/13/2018 09:29 PM, Austin S. Hemmelgarn wrote:
> On 2018-08-13 15:20, Hannes Reinecke wrote:
>> On 08/13/2018 08:42 PM, David Sterba wrote:
>>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>>>> This series adds zoned block device support to btrfs.
>>>
>>> Yay, thanks!
>>>
[ .. ]
>>> Device replace is disabled, but the changlog suggests there's a way to
>>> make it work, so it's a matter of implementation. And this should be
>>> implemented at the time of merge.
>>>
>> How would a device replace work in general?
>> While I do understand that device replace is possible with RAID
>> thingies, I somewhat fail to see how could do a device replacement
>> without RAID functionality.
>> Is it even possible?
>> If so, how would it be different from a simple umount?
> Device replace is implemented in largely the same manner as most other
> live data migration tools (for example, LVM2's pvmove command).
>
> In short, when you issue a replace command for a given device, all
> writes that would go to that device are instead sent to the new device.
> While this is happening, old data is copied over from the old device to
> the new one.  Once all the data is copied, the old device is released
> (and it's BTRFS signature wiped), and the new device has it's device ID
> updated to that of the old device.
>
> This is possible largely because of the COW infrastructure, but it's
> implemented in a way that doesn't entirely depend on it (otherwise it
> wouldn't work for NOCOW files).
>
> Handling this on zoned devices is not likely to be easy though, you
> would functionally have to freeze I/O that would hit the device being
> replaced so that you don't accidentally write to a sequential zone out
> of order.

Ah. Oh. Hmm.

It would be possible in principle if we freeze accesses to any partially
filled zones on the original device. Then all new writes will be going
into new/empty zones on the new disks, and we can copy over the old data
with no issue at all.
We end up with some partially filled zones on the new disk, but they
really should be cleaned up eventually either by the allocator filling
up the partially filled zones or once garbage collection clears out
stale zones.

However, I fear the required changes to the btrfs allocator are beyond
my btrfs knowledge :-(

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

2018-08-15 11:26:39

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On 2018-08-14 03:41, Hannes Reinecke wrote:
> On 08/13/2018 09:29 PM, Austin S. Hemmelgarn wrote:
>> On 2018-08-13 15:20, Hannes Reinecke wrote:
>>> On 08/13/2018 08:42 PM, David Sterba wrote:
>>>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>>>>> This series adds zoned block device support to btrfs.
>>>>
>>>> Yay, thanks!
>>>>
> [ .. ]
>>>> Device replace is disabled, but the changlog suggests there's a way to
>>>> make it work, so it's a matter of implementation. And this should be
>>>> implemented at the time of merge.
>>>>
>>> How would a device replace work in general?
>>> While I do understand that device replace is possible with RAID
>>> thingies, I somewhat fail to see how could do a device replacement
>>> without RAID functionality.
>>> Is it even possible?
>>> If so, how would it be different from a simple umount?
>> Device replace is implemented in largely the same manner as most other
>> live data migration tools (for example, LVM2's pvmove command).
>>
>> In short, when you issue a replace command for a given device, all
>> writes that would go to that device are instead sent to the new device.
>> While this is happening, old data is copied over from the old device to
>> the new one.  Once all the data is copied, the old device is released
>> (and it's BTRFS signature wiped), and the new device has it's device ID
>> updated to that of the old device.
>>
>> This is possible largely because of the COW infrastructure, but it's
>> implemented in a way that doesn't entirely depend on it (otherwise it
>> wouldn't work for NOCOW files).
>>
>> Handling this on zoned devices is not likely to be easy though, you
>> would functionally have to freeze I/O that would hit the device being
>> replaced so that you don't accidentally write to a sequential zone out
>> of order.
>
> Ah. Oh. Hmm.
>
> It would be possible in principle if we freeze accesses to any partially
> filled zones on the original device. Then all new writes will be going
> into new/empty zones on the new disks, and we can copy over the old data
> with no issue at all.
> We end up with some partially filled zones on the new disk, but they
> really should be cleaned up eventually either by the allocator filling
> up the partially filled zones or once garbage collection clears out
> stale zones.
>
> However, I fear the required changes to the btrfs allocator are beyond
> my btrfs knowledge :-(
The easy short term solution is to just disallow the replace command
(with the intent of getting it working in the future), but ensure that
the older style add/remove method works. That uses the balance code
internally, so it should honor any restrictions on block placement for
the new device, and therefore should be pretty easy to get working.


2018-08-16 16:37:20

by Naohiro Aota

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

On Fri, Aug 10, 2018 at 03:28:21PM +0800, Qu Wenruo wrote:
>
>
> On 8/10/18 2:04 AM, Naohiro Aota wrote:
> > This series adds zoned block device support to btrfs.
> >
> > A zoned block device consists of a number of zones. Zones are either
> > conventional and accepting random writes or sequential and requiring that
> > writes be issued in LBA order from each zone write pointer position.
>
> Not familiar with zoned block device, especially for the sequential case.
>
> Is that sequential case tape like?

It's somewhat similar but not the same as tape drives. In the tape
drives, you still *can* write in random access patters, though it's
much slow. In sequential required zones, it is always enforced to
write sequentially in a zone. Violating sequential write rule results
I/O error.

One user of sequential write required zone is Host-Managed "Shingled
Magnetic Recording" (SMR) HDDs [1]. They increase the volume capacity
by overlapping the tracks. As a result, writing to tracks overwrites
adjacent tracks. Such physical nature forces the sequential write
pattern.

[1] https://en.wikipedia.org/wiki/Shingled_magnetic_recording

> > This
> > patch series ensures that the sequential write constraint of sequential
> > zones is respected while fundamentally not changing BtrFS block and I/O
> > management for block stored in conventional zones.
> >
> > To achieve this, the default dev extent size of btrfs is changed on zoned
> > block devices so that dev extents are always aligned to a zone. Allocation
> > of blocks within a block group is changed so that the allocation is always
> > sequential from the beginning of the block groups. To do so, an allocation
> > pointer is added to block groups and used as the allocation hint. The
> > allocation changes also ensures that block freed below the allocation
> > pointer are ignored, resulting in sequential block allocation regardless of
> > the block group usage.
>
> This looks like it would cause a lot of holes for metadata block groups.
> It would be better to avoid metadata block allocation in such sequential
> zone.
> (And that would need the infrastructure to make extent allocator
> priority-aware)

Yes, it would introduce holes in metadata block groups. I agree it is
desirable to allocate metadata blocks from conventional
(non-sequential) zones.

However, it's sometime impossible to allocate metadata blocks from
conventional zones, since the number of conventional zones is
generally smaller than sequential zones in some zoned block devices
like SMR HDDs (to achieve higher volume capacity).

While this patch series ensures metadata/data can be allocated in any
type of zone and everything works in any zones, we will be able to
improve metadata allocation by making the extent allocator
priority/zone-type aware in the future.

> > [...]
> > Naohiro Aota (17):
> > btrfs: introduce HMZONED feature flag
> > btrfs: Get zone information of zoned block devices
> > btrfs: Check and enable HMZONED mode
> > btrfs: limit super block locations in HMZONED mode
> > btrfs: disable fallocate in HMZONED mode
> > btrfs: disable direct IO in HMZONED mode
> > btrfs: disable device replace in HMZONED mode
> > btrfs: align extent allocation to zone boundary
>
> According to the patch name, I though it's about extent allocation, but
> in fact it's about dev extent allocation.
> Renaming the patch would make more sense.
>
> > btrfs: do sequential allocation on HMZONED drives
>
> And this is the patch modifying extent allocator.

Thanks. I will fix the names of the patches in the next version.

> Despite that, the support zoned storage looks pretty interesting and
> have something in common with planned priority-aware extent allocator.
>
> Thanks,
> Qu

Regards,
Naohiro

2018-08-28 10:35:24

by Naohiro Aota

[permalink] [raw]
Subject: Re: [RFC PATCH 00/17] btrfs zoned block device support

Thank you for your review!

On Mon, Aug 13, 2018 at 08:42:52PM +0200, David Sterba wrote:
> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
> > This series adds zoned block device support to btrfs.
>
> Yay, thanks!
>
> As this a RFC, I'll give you some. The code looks ok for what it claims
> to do, I'll skip style and unimportant implementation details for now as
> there are bigger questions.
>
> The zoned devices bring some constraints so not all filesystem features
> cannot be expected to work, so this rules out any form of in-place
> updates like NODATACOW.
>
> Then there's list of 'how will zoned device work with feature X'?

Here is the current HMZONED status list based on https://btrfs.wiki.kernel.org/index.php/Status

Performance
Trim | OK
Autodefrag | OK
Defrag | OK
fallocate | Disabled. cannot reserve region in sequential zones
direct IO | Disabled. falling back to buffered IO

Compression | OK

Reliability
Auto-repair | not working. need to rewrite the corrupted extent
Scrub | not working. need to rewrite the corrupted extent
Scrub + RAID56 | not working (RAID56)
nodatacow | should be disabled. (noticed it's not disabled now)
Device replace | disabled for now (need to handle write pointer issues, WIP patch)
Degraded mount | OK

Block group profile
Single | OK
DUP | OK
RAID0 | OK
RAID1 | OK
RAID10 | OK
RAID56 | Disabled for now. need to avoid partial parity write.
Mixed BG | OK

Administration | OK

Misc
Free space tree | Disabled. not necessary for sequential allocator
no-holes | OK
skinny-metadata | OK
extended-refs | OK

> You disable fallocate and DIO. I haven't looked closer at the fallocate
> case, but DIO could work in the sense that open() will open the file but
> any write will fallback to buffered writes. This is implemented so it
> would need to be wired together.

Actually, it's working like that. When check_direct_IO() returns
-EINVAL, btrfs_direct_IO() still returns 0. As a result, the callers
fall back to buffered IO.

I will reword the commit subject and log to reflect the actual
behavior. Also I will relax the condition to disable only direct write
IOs.

> Mixed device types are not allowed, and I tend to agree with that,
> though this could work in principle. Just that the chunk allocator
> would have to be aware of the device types and tweaked to allocate from
> the same group. The btrfs code is not ready for that in terms of the
> allocator capabilities and configuration options.

Yes it will work if the allocator is improved to notice device type,
zone type and zone size.

> Device replace is disabled, but the changlog suggests there's a way to
> make it work, so it's a matter of implementation. And this should be
> implemented at the time of merge.

I have a WIP patch to support device replace. But it fails after
device replacing due to write pointer mismatch. I'm debugging the
code, so the following version may enable the feature.

> RAID5/6 + zoned support is highly desired and lack of it could be
> considered a NAK for the whole series. The drive sizes are expected to
> be several terabytes, that sounds be too risky to lack the redundancy
> options (RAID1 is not sufficient here).
>
> The changelog does not explain why this does not or cannot work, so I
> cannot reason about that or possibly suggest workarounds or solutions.
> But I think it should work in principle.
>
> As this is first post and RFC I don't expect that everything is
> implemented, but at least the known missing points should be documented.
> You've implemented lots of the low-level zoned support and extent
> allocation, so even if the raid56 might be difficult, it should be the
> smaller part.

I was leaving RAID56 for the future, since I'm not get used to raid56
code and the its write path (raid56_parity_write) seems to be
separated from the other's (submit_stripe_bio).

I quick checked if RAID5 is working on current HMZONED patch. But even
with simple sequential workload using dd, it made IO failures because
partial parity writes introduced overwrite IOs, which violate the
sequential write rule. At a quick glance at the raid56 code, I'm
currently not sure how we can avoid partial parity write while
dispatching necessary IOs on transaction commit.

Regards,
Naohiro