btrfs zoned block device support
This series adds zoned block device support to btrfs.
A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring that
writes be issued in LBA order from each zone write pointer position. This
patch series ensures that the sequential write constraint of sequential
zones is respected while fundamentally not changing BtrFS block and I/O
management for block stored in conventional zones.
To achieve this, the default chunk size of btrfs is changed on zoned block
devices so that chunks are always aligned to a zone. Allocation of blocks
within a chunk is changed so that the allocation is always sequential from
the beginning of the chunks. To do so, an allocation pointer is added to
block groups and used as the allocation hint. The allocation changes also
ensures that block freed below the allocation pointer are ignored,
resulting in sequential block allocation regardless of the chunk usage.
While the introduction of the allocation pointer ensure that blocks will be
allocated sequentially, I/Os to write out newly allocated blocks may be
issued out of order, causing errors when writing to sequential zones. This
problem s solved by introducing a submit_buffer() function and changes to
the internal I/O scheduler to ensure in-order issuing of write I/Os for
each chunk and corresponding to the block allocation order in the chunk.
The zone of a chunk is reset to allow reuse of the zone only when the block
group is being freed, that is, when all the chunks of the block group are
unused.
For btrfs volumes composed of multiple zoned disks, restrictions are added
to ensure that all disks have the same zone size. This matches the existing
constraint that all chunks in a block group must have the same size.
As discussed with Chris Mason in LSFMM, we enabled device replacing in
HMZONED mode. But still drop fallocate for now.
Patch 1 introduces the HMZONED incompatible feature flag to indicate that
the btrfs volume was formatted for use on zoned block devices.
Patches 2 and 3 implement functions to gather information on the zones of
the device (zones type and write pointer position).
Patches 4 and 5 disable features which are not compatible with the
sequential write constraints of zoned block devices. This includes
fallocate and direct I/O support.
Patches 6 and 7 tweak the extent buffer allocation for HMZONED mode to
implement sequential block allocation in block groups and chunks.
Patch 8 mark block group read only when write pointers of devices which
compose e.g. RAID1 block group devices are mismatch.
Patch 9 restrict the possible locations of super blocks to conventional
zones to preserve the existing update in-place mechanism for the super
blocks.
Patches 10 to 12 implement the new submit buffer I/O path to ensure
sequential write I/O delivery to the device zones.
Patches 13 to 17 modify several parts of btrfs to handle free blocks
without breaking the sequential block allocation and sequential write order
as well as zone reset for unused chunks.
Patch 18 add support for device replacing.
Finally, patch 19 adds the HMZONED feature to the list of supported
features.
This series applies on kdave/for-5.2-rc2.
Changelog
v2:
- Add support for dev-replace
-- To support dev-replace, moved submit_buffer one layer up. It now
handles bio instead of btrfs_bio.
-- Mark unmirrored Block Group readonly only when there is writable
mirrored BGs. Necessary to handle degraded RAID.
- Expire worker use vanilla delayed_work instead of btrfs's async-thread
- Device extent allocator now ensure that region is on the same zone type.
- Add delayed allocation shrinking.
- Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
- Fix
-- Use SECTOR_SHIFT (Nikolay)
-- Use btrfs_err (Nikolay)
Naohiro Aota (19):
btrfs: introduce HMZONED feature flag
btrfs: Get zone information of zoned block devices
btrfs: Check and enable HMZONED mode
btrfs: disable fallocate in HMZONED mode
btrfs: disable direct IO in HMZONED mode
btrfs: align dev extent allocation to zone boundary
btrfs: do sequential extent allocation in HMZONED mode
btrfs: make unmirroed BGs readonly only if we have at least one
writable BG
btrfs: limit super block locations in HMZONED mode
btrfs: rename btrfs_map_bio()
btrfs: introduce submit buffer
btrfs: expire submit buffer on timeout
btrfs: avoid sync IO prioritization on checksum in HMZONED mode
btrfs: redirty released extent buffers in sequential BGs
btrfs: reset zones of unused block groups
btrfs: wait existing extents before truncating
btrfs: shrink delayed allocation size in HMZONED mode
btrfs: support dev-replace in HMZONED mode
btrfs: enable to mount HMZONED incompat flag
fs/btrfs/ctree.h | 47 ++-
fs/btrfs/dev-replace.c | 103 ++++++
fs/btrfs/disk-io.c | 49 ++-
fs/btrfs/disk-io.h | 1 +
fs/btrfs/extent-tree.c | 479 +++++++++++++++++++++++-
fs/btrfs/extent_io.c | 28 ++
fs/btrfs/extent_io.h | 2 +
fs/btrfs/file.c | 4 +
fs/btrfs/free-space-cache.c | 33 ++
fs/btrfs/free-space-cache.h | 5 +
fs/btrfs/inode.c | 14 +
fs/btrfs/scrub.c | 171 +++++++++
fs/btrfs/super.c | 30 +-
fs/btrfs/sysfs.c | 2 +
fs/btrfs/transaction.c | 35 ++
fs/btrfs/transaction.h | 3 +
fs/btrfs/volumes.c | 684 ++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 37 ++
include/trace/events/btrfs.h | 43 +++
include/uapi/linux/btrfs.h | 1 +
20 files changed, 1734 insertions(+), 37 deletions(-)
--
2.21.0
When in HMZONED mode, make sure that device super blocks are located in
randomly writable zones of zoned block devices. That is, do not write super
blocks in sequential write required zones of host-managed zoned block
devices as update would not be possible.
Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/disk-io.c | 11 +++++++++++
fs/btrfs/disk-io.h | 1 +
fs/btrfs/extent-tree.c | 4 ++++
fs/btrfs/scrub.c | 2 ++
4 files changed, 18 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7c1404c76768..ddbb02906042 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
return latest;
}
+int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
+{
+ /* any address is good on a regular (zone_size == 0) device */
+ /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
+ return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
+}
+
/*
* Write superblock @sb to the @device. Do not wait for completion, all the
* buffer heads we write are pinned.
@@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
if (bytenr + BTRFS_SUPER_INFO_SIZE >=
device->commit_total_bytes)
break;
+ if (!btrfs_check_super_location(device, bytenr))
+ continue;
btrfs_set_super_bytenr(sb, bytenr);
@@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
if (bytenr + BTRFS_SUPER_INFO_SIZE >=
device->commit_total_bytes)
break;
+ if (!btrfs_check_super_location(device, bytenr))
+ continue;
bh = __find_get_block(device->bdev,
bytenr / BTRFS_BDEV_BLOCKSIZE,
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index a0161aa1ea0b..70e97cd6fa76 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
struct page *page, size_t pg_offset, u64 start, u64 len,
int create);
int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
+int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
int __init btrfs_end_io_wq_init(void);
void __cold btrfs_end_io_wq_exit(void);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d41d840fe5c..ae2c895d08c4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
return ret;
}
+ /* we won't have super stripes in sequential zones */
+ if (cache->alloc_type == BTRFS_ALLOC_SEQ)
+ return 0;
+
for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
bytenr = btrfs_sb_offset(i);
ret = btrfs_rmap_block(fs_info, cache->key.objectid,
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f7b29f9db5e2..36ad4fad7eaf 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
if (bytenr + BTRFS_SUPER_INFO_SIZE >
scrub_dev->commit_total_bytes)
break;
+ if (!btrfs_check_super_location(scrub_dev, bytenr))
+ continue;
ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
--
2.21.0
In a write heavy workload, the following scenario can occur:
1. mark page #0 to page #2 (and their corresponding extent region) as dirty
and candidate for delayed allocation
pages 0 1 2 3 4
dirty o o o - -
towrite - - - - -
delayed o o o - -
alloc
2. extent_write_cache_pages() mark dirty pages as TOWRITE
pages 0 1 2 3 4
dirty o o o - -
towrite o o o - -
delayed o o o - -
alloc
3. Meanwhile, another write dirties page #3 and page #4
pages 0 1 2 3 4
dirty o o o o o
towrite o o o - -
delayed o o o o o
alloc
4. find_lock_delalloc_range() decide to allocate a region to write page #0
to page #4
5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged
pages (#0 to #2)
So the above process leaves page #3 and page #4 behind. Usually, the
periodic dirty flush kicks write IOs for page #3 and #4. However, if we try
to mount a subvolume at this timing, mount process takes s_umount write
lock to block the periodic flush to come in.
To deal with the problem, shrink the delayed allocation region to have only
expected to be written pages.
Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c73c69e2bef4..ea582ff85c73 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
delalloc_start = delalloc_end + 1;
continue;
}
+
+ if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) &&
+ (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) &&
+ ((delalloc_start >> PAGE_SHIFT) <
+ (delalloc_end >> PAGE_SHIFT))) {
+ unsigned long i;
+ unsigned long end_index = delalloc_end >> PAGE_SHIFT;
+
+ for (i = delalloc_start >> PAGE_SHIFT;
+ i <= end_index; i++)
+ if (!xa_get_mark(&inode->i_mapping->i_pages, i,
+ PAGECACHE_TAG_TOWRITE))
+ break;
+
+ if (i <= end_index) {
+ u64 unlock_start = (u64)i << PAGE_SHIFT;
+
+ if (i == delalloc_start >> PAGE_SHIFT)
+ unlock_start += PAGE_SIZE;
+
+ unlock_extent(tree, unlock_start, delalloc_end);
+ __unlock_for_delalloc(inode, page, unlock_start,
+ delalloc_end);
+ delalloc_end = unlock_start - 1;
+ }
+ }
+
ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
delalloc_end, &page_started, nr_written, wbc);
/* File system has been set read-only */
--
2.21.0
HMZONED mode cannot be used together with the RAID5/6 profile for now.
Introduce the function btrfs_check_hmzoned_mode() to check this. This
function will also check if HMZONED flag is enabled on the file system and
if the file system consists of zoned devices with equal zone size.
Additionally, as updates to the space cache are in-place, the space cache
cannot be located over sequential zones and there is no guarantees that the
device will have enough conventional zones to store this cache. Resolve
this problem by disabling completely the space cache. This does not
introduces any problems with sequential block groups: all the free space is
located after the allocation pointer and no free space before the pointer.
There is no need to have such cache.
Signed-off-by: Damien Le Moal <[email protected]>
Signed-off-by: Naohiro Aota <[email protected]>
---
fs/btrfs/ctree.h | 3 ++
fs/btrfs/dev-replace.c | 7 +++
fs/btrfs/disk-io.c | 7 +++
fs/btrfs/super.c | 12 ++---
fs/btrfs/volumes.c | 99 ++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/volumes.h | 1 +
6 files changed, 124 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b81c331b28fa..6c00101407e4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -806,6 +806,9 @@ struct btrfs_fs_info {
struct btrfs_root *uuid_root;
struct btrfs_root *free_space_root;
+ /* Zone size when in HMZONED mode */
+ u64 zone_size;
+
/* the log root tree is a directory of all the other log roots */
struct btrfs_root *log_root_tree;
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index ee0989c7e3a9..fbe5ea2a04ed 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -201,6 +201,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
return PTR_ERR(bdev);
}
+ if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+ !btrfs_fs_incompat(fs_info, HMZONED)) ||
+ (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
+ ret = -EINVAL;
+ goto error;
+ }
+
filemap_write_and_wait(bdev->bd_inode->i_mapping);
devices = &fs_info->fs_devices->devices;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 663efce22d98..7c1404c76768 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3086,6 +3086,13 @@ int open_ctree(struct super_block *sb,
btrfs_free_extra_devids(fs_devices, 1);
+ ret = btrfs_check_hmzoned_mode(fs_info);
+ if (ret) {
+ btrfs_err(fs_info, "failed to init hmzoned mode: %d",
+ ret);
+ goto fail_block_groups;
+ }
+
ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
if (ret) {
btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2c66d9ea6a3b..740a701f16c5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
bool saved_compress_force;
int no_compress = 0;
- cache_gen = btrfs_super_cache_generation(info->super_copy);
- if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
- btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
- else if (cache_gen)
- btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+ if (!btrfs_fs_incompat(info, HMZONED)) {
+ cache_gen = btrfs_super_cache_generation(info->super_copy);
+ if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
+ btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
+ else if (cache_gen)
+ btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+ }
/*
* Even the options are empty, we still need to do extra check
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b673178718e3..b6f367d19dc9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1524,6 +1524,83 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
return ret;
}
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info)
+{
+ struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+ struct btrfs_device *device;
+ u64 hmzoned_devices = 0;
+ u64 nr_devices = 0;
+ u64 zone_size = 0;
+ int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED);
+ int ret = 0;
+
+ /* Count zoned devices */
+ list_for_each_entry(device, &fs_devices->devices, dev_list) {
+ if (!device->bdev)
+ continue;
+ if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM ||
+ (bdev_zoned_model(device->bdev) == BLK_ZONED_HA &&
+ incompat_hmzoned)) {
+ hmzoned_devices++;
+ if (!zone_size) {
+ zone_size = device->zone_size;
+ } else if (device->zone_size != zone_size) {
+ btrfs_err(fs_info,
+ "Zoned block devices must have equal zone sizes");
+ ret = -EINVAL;
+ goto out;
+ }
+ }
+ nr_devices++;
+ }
+
+ if (!hmzoned_devices && incompat_hmzoned) {
+ /* No zoned block device, disable HMZONED */
+ btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (!hmzoned_devices && !incompat_hmzoned)
+ goto out;
+
+ fs_info->zone_size = zone_size;
+
+ if (hmzoned_devices != nr_devices) {
+ btrfs_err(fs_info,
+ "zoned devices mixed with regular devices");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* RAID56 is not allowed */
+ if (btrfs_fs_incompat(fs_info, RAID56)) {
+ btrfs_err(fs_info, "HMZONED mode does not support RAID56");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * SPACE CACHE writing is not cowed. Disable that to avoid
+ * write errors in sequential zones.
+ */
+ if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
+ btrfs_info(fs_info,
+ "disabling disk space caching with HMZONED mode");
+ btrfs_clear_opt(fs_info->mount_opt, SPACE_CACHE);
+ }
+
+ btrfs_set_and_info(fs_info, NOTREELOG,
+ "disabling tree log with HMZONED mode");
+
+ btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B",
+ fs_info->zone_size);
+
+out:
+
+ return ret;
+}
+
static void btrfs_release_disk_super(struct page *page)
{
kunmap(page);
@@ -2695,6 +2772,13 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
if (IS_ERR(bdev))
return PTR_ERR(bdev);
+ if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+ !btrfs_fs_incompat(fs_info, HMZONED)) ||
+ (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
+ ret = -EINVAL;
+ goto error;
+ }
+
if (fs_devices->seeding) {
seeding_dev = 1;
down_write(&sb->s_umount);
@@ -2816,6 +2900,21 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
}
}
+ /* Get zone type information of zoned block devices */
+ if (bdev_is_zoned(bdev)) {
+ ret = btrfs_get_dev_zonetypes(device);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ goto error_sysfs;
+ }
+ }
+
+ ret = btrfs_check_hmzoned_mode(fs_info);
+ if (ret) {
+ btrfs_abort_transaction(trans, ret);
+ goto error_sysfs;
+ }
+
ret = btrfs_add_dev_item(trans, device);
if (ret) {
btrfs_abort_transaction(trans, ret);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1599641e216c..f66755e43669 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -432,6 +432,7 @@ struct btrfs_device *btrfs_scan_one_device(const char *path,
fmode_t flags, void *holder);
int btrfs_forget_devices(const char *path);
int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
+int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info);
void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
void btrfs_assign_next_active_device(struct btrfs_device *device,
struct btrfs_device *this_dev);
--
2.21.0
On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
> btrfs zoned block device support
>
> This series adds zoned block device support to btrfs.
The overall design sounds ok.
I skimmed through the patches and the biggest task I see is how to make
the hmzoned adjustments and branches less visible, ie. there are too
many if (hmzoned) { do something } standing out. But that's merely a
matter of wrappers and maybe an abstraction here and there.
How can I test the zoned devices backed by files (or regular disks)? I
searched for some concrete example eg. for qemu or dm-zoned, but closest
match was a text description in libzbc README that it's possible to
implement. All other howtos expect a real zoned device.
Merge target is 5.3 or later, we'll see how things will go. I'm
expecting that we might need some time to get feedback about the
usability as there's no previous work widely used that we can build on
top of.
On Fri, Jun 07, 2019 at 10:10:23PM +0900, Naohiro Aota wrote:
> In a write heavy workload, the following scenario can occur:
>
> 1. mark page #0 to page #2 (and their corresponding extent region) as dirty
> and candidate for delayed allocation
>
> pages 0 1 2 3 4
> dirty o o o - -
> towrite - - - - -
> delayed o o o - -
> alloc
>
> 2. extent_write_cache_pages() mark dirty pages as TOWRITE
>
> pages 0 1 2 3 4
> dirty o o o - -
> towrite o o o - -
> delayed o o o - -
> alloc
>
> 3. Meanwhile, another write dirties page #3 and page #4
>
> pages 0 1 2 3 4
> dirty o o o o o
> towrite o o o - -
> delayed o o o o o
> alloc
>
> 4. find_lock_delalloc_range() decide to allocate a region to write page #0
> to page #4
> 5. but, extent_write_cache_pages() only initiate write to TOWRITE tagged
> pages (#0 to #2)
>
> So the above process leaves page #3 and page #4 behind. Usually, the
> periodic dirty flush kicks write IOs for page #3 and #4. However, if we try
> to mount a subvolume at this timing, mount process takes s_umount write
> lock to block the periodic flush to come in.
>
> To deal with the problem, shrink the delayed allocation region to have only
> expected to be written pages.
>
> Signed-off-by: Naohiro Aota <[email protected]>
> ---
> fs/btrfs/extent_io.c | 27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c73c69e2bef4..ea582ff85c73 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3310,6 +3310,33 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode,
> delalloc_start = delalloc_end + 1;
> continue;
> }
> +
> + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED) &&
> + (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) &&
> + ((delalloc_start >> PAGE_SHIFT) <
> + (delalloc_end >> PAGE_SHIFT))) {
> + unsigned long i;
> + unsigned long end_index = delalloc_end >> PAGE_SHIFT;
> +
> + for (i = delalloc_start >> PAGE_SHIFT;
> + i <= end_index; i++)
> + if (!xa_get_mark(&inode->i_mapping->i_pages, i,
> + PAGECACHE_TAG_TOWRITE))
> + break;
> +
> + if (i <= end_index) {
> + u64 unlock_start = (u64)i << PAGE_SHIFT;
> +
> + if (i == delalloc_start >> PAGE_SHIFT)
> + unlock_start += PAGE_SIZE;
> +
> + unlock_extent(tree, unlock_start, delalloc_end);
> + __unlock_for_delalloc(inode, page, unlock_start,
> + delalloc_end);
> + delalloc_end = unlock_start - 1;
> + }
> + }
> +
Helper please. Really for all this hmzoned stuff I want it segregated as much
as possible so when I'm debugging or cleaning other stuff up I want to easily be
able to say "oh this is for zoned devices, it doesn't matter." Thanks,
Josef
On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
> When in HMZONED mode, make sure that device super blocks are located in
> randomly writable zones of zoned block devices. That is, do not write super
> blocks in sequential write required zones of host-managed zoned block
> devices as update would not be possible.
>
> Signed-off-by: Damien Le Moal <[email protected]>
> Signed-off-by: Naohiro Aota <[email protected]>
> ---
> fs/btrfs/disk-io.c | 11 +++++++++++
> fs/btrfs/disk-io.h | 1 +
> fs/btrfs/extent-tree.c | 4 ++++
> fs/btrfs/scrub.c | 2 ++
> 4 files changed, 18 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7c1404c76768..ddbb02906042 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
> return latest;
> }
>
> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
> +{
> + /* any address is good on a regular (zone_size == 0) device */
> + /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
This is not how you do multi-line comments in the kernel. Thanks,
Josef
On Fri, Jun 07, 2019 at 10:10:09PM +0900, Naohiro Aota wrote:
> HMZONED mode cannot be used together with the RAID5/6 profile for now.
> Introduce the function btrfs_check_hmzoned_mode() to check this. This
> function will also check if HMZONED flag is enabled on the file system and
> if the file system consists of zoned devices with equal zone size.
>
> Additionally, as updates to the space cache are in-place, the space cache
> cannot be located over sequential zones and there is no guarantees that the
> device will have enough conventional zones to store this cache. Resolve
> this problem by disabling completely the space cache. This does not
> introduces any problems with sequential block groups: all the free space is
> located after the allocation pointer and no free space before the pointer.
> There is no need to have such cache.
>
> Signed-off-by: Damien Le Moal <[email protected]>
> Signed-off-by: Naohiro Aota <[email protected]>
> ---
> fs/btrfs/ctree.h | 3 ++
> fs/btrfs/dev-replace.c | 7 +++
> fs/btrfs/disk-io.c | 7 +++
> fs/btrfs/super.c | 12 ++---
> fs/btrfs/volumes.c | 99 ++++++++++++++++++++++++++++++++++++++++++
> fs/btrfs/volumes.h | 1 +
> 6 files changed, 124 insertions(+), 5 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b81c331b28fa..6c00101407e4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -806,6 +806,9 @@ struct btrfs_fs_info {
> struct btrfs_root *uuid_root;
> struct btrfs_root *free_space_root;
>
> + /* Zone size when in HMZONED mode */
> + u64 zone_size;
> +
> /* the log root tree is a directory of all the other log roots */
> struct btrfs_root *log_root_tree;
>
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index ee0989c7e3a9..fbe5ea2a04ed 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -201,6 +201,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
> return PTR_ERR(bdev);
> }
>
> + if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
> + !btrfs_fs_incompat(fs_info, HMZONED)) ||
> + (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
You do this in a few places, turn this into a helper please.
> + ret = -EINVAL;
> + goto error;
> + }
> +
> filemap_write_and_wait(bdev->bd_inode->i_mapping);
>
> devices = &fs_info->fs_devices->devices;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 663efce22d98..7c1404c76768 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3086,6 +3086,13 @@ int open_ctree(struct super_block *sb,
>
> btrfs_free_extra_devids(fs_devices, 1);
>
> + ret = btrfs_check_hmzoned_mode(fs_info);
> + if (ret) {
> + btrfs_err(fs_info, "failed to init hmzoned mode: %d",
> + ret);
> + goto fail_block_groups;
> + }
> +
> ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
> if (ret) {
> btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 2c66d9ea6a3b..740a701f16c5 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
> bool saved_compress_force;
> int no_compress = 0;
>
> - cache_gen = btrfs_super_cache_generation(info->super_copy);
> - if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> - btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> - else if (cache_gen)
> - btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> + if (!btrfs_fs_incompat(info, HMZONED)) {
> + cache_gen = btrfs_super_cache_generation(info->super_copy);
> + if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
> + btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
> + else if (cache_gen)
> + btrfs_set_opt(info->mount_opt, SPACE_CACHE);
> + }
>
This disables the free space tree as well as the cache, sounds like you only
need to disable the free space cache? Thanks,
Josef
On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
> On 2019/06/13 2:50, David Sterba wrote:
> > On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
> >> btrfs zoned block device support
> >>
> >> This series adds zoned block device support to btrfs.
> >
> > The overall design sounds ok.
> >
> > I skimmed through the patches and the biggest task I see is how to make
> > the hmzoned adjustments and branches less visible, ie. there are too
> > many if (hmzoned) { do something } standing out. But that's merely a
> > matter of wrappers and maybe an abstraction here and there.
>
> Sure. I'll add some more abstractions in the next version.
Ok, I'll reply to the patches with specific things.
> > How can I test the zoned devices backed by files (or regular disks)? I
> > searched for some concrete example eg. for qemu or dm-zoned, but closest
> > match was a text description in libzbc README that it's possible to
> > implement. All other howtos expect a real zoned device.
>
> You can use tcmu-runer [1] to create an emulated zoned device backed by
> a regular file. Here is a setup how-to:
> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation
That looks great, thanks. I wonder why there's no way to find that, all
I got were dead links to linux-iscsi.org or tutorials of targetcli that
were years old and not working.
Feeding the textual commands to targetcli is not exactly what I'd
expect for scripting, but at least it seems to work.
I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
device) but lsscsi does not recognize that it as a zonde device (just a
QEMU harddisk). So this seems the emulation must be done inside the VM.
On 2019/06/13 2:50, David Sterba wrote:
> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>> btrfs zoned block device support
>>
>> This series adds zoned block device support to btrfs.
>
> The overall design sounds ok.
>
> I skimmed through the patches and the biggest task I see is how to make
> the hmzoned adjustments and branches less visible, ie. there are too
> many if (hmzoned) { do something } standing out. But that's merely a
> matter of wrappers and maybe an abstraction here and there.
Sure. I'll add some more abstractions in the next version.
> How can I test the zoned devices backed by files (or regular disks)? I
> searched for some concrete example eg. for qemu or dm-zoned, but closest
> match was a text description in libzbc README that it's possible to
> implement. All other howtos expect a real zoned device.
You can use tcmu-runer [1] to create an emulated zoned device backed by
a regular file. Here is a setup how-to:
http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation
[1] https://github.com/open-iscsi/tcmu-runner
> Merge target is 5.3 or later, we'll see how things will go. I'm
> expecting that we might need some time to get feedback about the
> usability as there's no previous work widely used that we can build on
> top of.
>
On 2019/06/13 22:45, David Sterba wrote:> On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
>> On 2019/06/13 2:50, David Sterba wrote:
>>> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>>> How can I test the zoned devices backed by files (or regular disks)? I
>>> searched for some concrete example eg. for qemu or dm-zoned, but closest
>>> match was a text description in libzbc README that it's possible to
>>> implement. All other howtos expect a real zoned device.
>>
>> You can use tcmu-runer [1] to create an emulated zoned device backed by
>> a regular file. Here is a setup how-to:
>> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation>> That looks great, thanks. I wonder why there's no way to find that, all
> I got were dead links to linux-iscsi.org or tutorials of targetcli that
> were years old and not working.
Actually, this is quite new site. ;-)
> Feeding the textual commands to targetcli is not exactly what I'd
> expect for scripting, but at least it seems to work.
You can use "targetcli <directory> <command> [<args> ...]" format, so
you can call e.g.
targetcli /backstores/user:zbc create name=foo size=10G cfgstring=model-HM/zsize-256/conv-1@/mnt/nvme/disk0.raw
> I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
> device) but lsscsi does not recognize that it as a zonde device (just a
> QEMU harddisk). So this seems the emulation must be done inside the VM.
Oops, QEMU hide the detail.
In this case, you can try exposing the ZBC device via iSCSI.
On the host:
(after creating the ZBC backstores)
# sudo targetcli /iscsi create
Created target iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.
# TARGET="iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c"
(WARN: Allow any node to connect without any auth)
# targetcli /iscsi/${TARGET}/tpg1 set attribute generate_node_acls=1
Parameter generate_node_acls is now '1'.
( or you can explicitly allow an initiator)
# TCMU_INITIATOR=iqn.2018-07....
# targecli /iscsi/${TARGET}/tpg1/acls create ${TCMU_INITIATOR}
(for each backend)
# targetcli /iscsi/${TARGET}/tpg1/luns create /backstores/user:zbc/foo
Created LUN 0.
Then, you can login to the iSCSI on the KVM guest like:
# iscsiadm -m discovery -t st -p $HOST_IP
127.0.0.1:3260,1 iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c
# iscsiadm -m node -l -T ${TARGET}
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c, portal: 127.0.0.1,3260]
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c, portal: 127.0.0.1,3260] successful.
David,
On 2019/06/13 22:45, David Sterba wrote:
> On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
>> On 2019/06/13 2:50, David Sterba wrote:
>>> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>>>> btrfs zoned block device support
>>>>
>>>> This series adds zoned block device support to btrfs.
>>>
>>> The overall design sounds ok.
>>>
>>> I skimmed through the patches and the biggest task I see is how to make
>>> the hmzoned adjustments and branches less visible, ie. there are too
>>> many if (hmzoned) { do something } standing out. But that's merely a
>>> matter of wrappers and maybe an abstraction here and there.
>>
>> Sure. I'll add some more abstractions in the next version.
>
> Ok, I'll reply to the patches with specific things.
>
>>> How can I test the zoned devices backed by files (or regular disks)? I
>>> searched for some concrete example eg. for qemu or dm-zoned, but closest
>>> match was a text description in libzbc README that it's possible to
>>> implement. All other howtos expect a real zoned device.
>>
>> You can use tcmu-runer [1] to create an emulated zoned device backed by
>> a regular file. Here is a setup how-to:
>> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation
>
> That looks great, thanks. I wonder why there's no way to find that, all
> I got were dead links to linux-iscsi.org or tutorials of targetcli that
> were years old and not working.
The site went online 4 days ago :) We will advertise it whenever we can. This is
intended to document all things "zoned block device" including Btrfs support,
when we get it finished :)
>
> Feeding the textual commands to targetcli is not exactly what I'd
> expect for scripting, but at least it seems to work.
Yes, this is not exactly obvious, but that is how most automation with linux
iscsi is done.
>
> I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
> device) but lsscsi does not recognize that it as a zonde device (just a
> QEMU harddisk). So this seems the emulation must be done inside the VM.
>
What driver did you use for the drive ? virtio block ? I have not touch that
driver nor qemu side, so zoned block dev support is likely missing. I will add
it. That would be especially useful for testing with a real drive. In the case
of tcmu runner, the initiator can be started in the guest directly and the
target emulation done either in the guest if loopback is used, or on the host
using iscsi connection. The former is what we use all the time and so is well
tested. I have to admit that testing with iscsi is lacking... Will add that to
the todo list.
Best regards.
--
Damien Le Moal
Western Digital Research
On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
> When in HMZONED mode, make sure that device super blocks are located in
> randomly writable zones of zoned block devices. That is, do not write super
> blocks in sequential write required zones of host-managed zoned block
> devices as update would not be possible.
This could be explained in more detail. My understanding is that the 1st
and 2nd copy superblocks is skipped at write time but the zone
containing the superblocks is not excluded from allocations. Ie. regular
data can appear in place where the superblocks would exist on
non-hmzoned filesystem. Is that correct?
The other option is to completely exclude the zone that contains the
superblock copies.
primary sb 64K
1st copy 64M
2nd copy 256G
Depends on the drives, but I think the size of the random write zone
will very often cover primary and 1st copy. So there's at least some
backup copy.
The 2nd copy will be in the sequential-only zone, so the whole zone
needs to be excluded in exclude_super_stripes. But it's not, so this
means data can go there. I think the zone should be left empty.
On 2019/06/13 22:57, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:09PM +0900, Naohiro Aota wrote:
>> HMZONED mode cannot be used together with the RAID5/6 profile for now.
>> Introduce the function btrfs_check_hmzoned_mode() to check this. This
>> function will also check if HMZONED flag is enabled on the file system and
>> if the file system consists of zoned devices with equal zone size.
>>
>> Additionally, as updates to the space cache are in-place, the space cache
>> cannot be located over sequential zones and there is no guarantees that the
>> device will have enough conventional zones to store this cache. Resolve
>> this problem by disabling completely the space cache. This does not
>> introduces any problems with sequential block groups: all the free space is
>> located after the allocation pointer and no free space before the pointer.
>> There is no need to have such cache.
>>
>> Signed-off-by: Damien Le Moal <[email protected]>
>> Signed-off-by: Naohiro Aota <[email protected]>
>> ---
>> fs/btrfs/ctree.h | 3 ++
>> fs/btrfs/dev-replace.c | 7 +++
>> fs/btrfs/disk-io.c | 7 +++
>> fs/btrfs/super.c | 12 ++---
>> fs/btrfs/volumes.c | 99 ++++++++++++++++++++++++++++++++++++++++++
>> fs/btrfs/volumes.h | 1 +
>> 6 files changed, 124 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index b81c331b28fa..6c00101407e4 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -806,6 +806,9 @@ struct btrfs_fs_info {
>> struct btrfs_root *uuid_root;
>> struct btrfs_root *free_space_root;
>>
>> + /* Zone size when in HMZONED mode */
>> + u64 zone_size;
>> +
>> /* the log root tree is a directory of all the other log roots */
>> struct btrfs_root *log_root_tree;
>>
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index ee0989c7e3a9..fbe5ea2a04ed 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -201,6 +201,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
>> return PTR_ERR(bdev);
>> }
>>
>> + if ((bdev_zoned_model(bdev) == BLK_ZONED_HM &&
>> + !btrfs_fs_incompat(fs_info, HMZONED)) ||
>> + (!bdev_is_zoned(bdev) && btrfs_fs_incompat(fs_info, HMZONED))) {
>
> You do this in a few places, turn this into a helper please.
>
>> + ret = -EINVAL;
>> + goto error;
>> + }
>> +
>> filemap_write_and_wait(bdev->bd_inode->i_mapping);
>>
>> devices = &fs_info->fs_devices->devices;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 663efce22d98..7c1404c76768 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3086,6 +3086,13 @@ int open_ctree(struct super_block *sb,
>>
>> btrfs_free_extra_devids(fs_devices, 1);
>>
>> + ret = btrfs_check_hmzoned_mode(fs_info);
>> + if (ret) {
>> + btrfs_err(fs_info, "failed to init hmzoned mode: %d",
>> + ret);
>> + goto fail_block_groups;
>> + }
>> +
>> ret = btrfs_sysfs_add_fsid(fs_devices, NULL);
>> if (ret) {
>> btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index 2c66d9ea6a3b..740a701f16c5 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -435,11 +435,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
>> bool saved_compress_force;
>> int no_compress = 0;
>>
>> - cache_gen = btrfs_super_cache_generation(info->super_copy);
>> - if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>> - btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
>> - else if (cache_gen)
>> - btrfs_set_opt(info->mount_opt, SPACE_CACHE);
>> + if (!btrfs_fs_incompat(info, HMZONED)) {
>> + cache_gen = btrfs_super_cache_generation(info->super_copy);
>> + if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
>> + btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
>> + else if (cache_gen)
>> + btrfs_set_opt(info->mount_opt, SPACE_CACHE);
>> + }
>>
>
> This disables the free space tree as well as the cache, sounds like you only
> need to disable the free space cache? Thanks,
Right. We can still use the free space tree on HMZONED. I'll fix in the next version.
Thanks
On 2019/06/13 23:13, Josef Bacik wrote:
> On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
>> When in HMZONED mode, make sure that device super blocks are located in
>> randomly writable zones of zoned block devices. That is, do not write super
>> blocks in sequential write required zones of host-managed zoned block
>> devices as update would not be possible.
>>
>> Signed-off-by: Damien Le Moal <[email protected]>
>> Signed-off-by: Naohiro Aota <[email protected]>
>> ---
>> fs/btrfs/disk-io.c | 11 +++++++++++
>> fs/btrfs/disk-io.h | 1 +
>> fs/btrfs/extent-tree.c | 4 ++++
>> fs/btrfs/scrub.c | 2 ++
>> 4 files changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 7c1404c76768..ddbb02906042 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>> return latest;
>> }
>>
>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
>> +{
>> + /* any address is good on a regular (zone_size == 0) device */
>> + /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
>
> This is not how you do multi-line comments in the kernel. Thanks,
>
> Josef
>
Thanks. I'll fix the style.
# I thought the checkpatch was catching this ...
On 2019/06/18 7:53, David Sterba wrote:
> On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
>> When in HMZONED mode, make sure that device super blocks are located in
>> randomly writable zones of zoned block devices. That is, do not write super
>> blocks in sequential write required zones of host-managed zoned block
>> devices as update would not be possible.
>
> This could be explained in more detail. My understanding is that the 1st
> and 2nd copy superblocks is skipped at write time but the zone
> containing the superblocks is not excluded from allocations. Ie. regular
> data can appear in place where the superblocks would exist on
> non-hmzoned filesystem. Is that correct?
Correct. You can see regular data stored at usually SB location on HMZONED fs.
> The other option is to completely exclude the zone that contains the
> superblock copies.
>
> primary sb 64K
> 1st copy 64M
> 2nd copy 256G
>
> Depends on the drives, but I think the size of the random write zone
> will very often cover primary and 1st copy. So there's at least some
> backup copy.
>
> The 2nd copy will be in the sequential-only zone, so the whole zone
> needs to be excluded in exclude_super_stripes. But it's not, so this
> means data can go there. I think the zone should be left empty.
>
I see. That's more safe for the older kernel/userland, right? By keeping that zone empty,
we can avoid old ones to mis-interpret data to be SB.
Alright, I will change the code to do so.
On Tue, Jun 18, 2019 at 09:01:35AM +0000, Naohiro Aota wrote:
> On 2019/06/18 7:53, David Sterba wrote:
> > On Fri, Jun 07, 2019 at 10:10:15PM +0900, Naohiro Aota wrote:
> >> When in HMZONED mode, make sure that device super blocks are located in
> >> randomly writable zones of zoned block devices. That is, do not write super
> >> blocks in sequential write required zones of host-managed zoned block
> >> devices as update would not be possible.
> >
> > This could be explained in more detail. My understanding is that the 1st
> > and 2nd copy superblocks is skipped at write time but the zone
> > containing the superblocks is not excluded from allocations. Ie. regular
> > data can appear in place where the superblocks would exist on
> > non-hmzoned filesystem. Is that correct?
>
> Correct. You can see regular data stored at usually SB location on HMZONED fs.
>
> > The other option is to completely exclude the zone that contains the
> > superblock copies.
> >
> > primary sb 64K
> > 1st copy 64M
> > 2nd copy 256G
> >
> > Depends on the drives, but I think the size of the random write zone
> > will very often cover primary and 1st copy. So there's at least some
> > backup copy.
> >
> > The 2nd copy will be in the sequential-only zone, so the whole zone
> > needs to be excluded in exclude_super_stripes. But it's not, so this
> > means data can go there. I think the zone should be left empty.
> >
>
> I see. That's more safe for the older kernel/userland, right? By keeping that zone empty,
> we can avoid old ones to mis-interpret data to be SB.
That's not only for older kernels, the superblock locations are known
and the contents should not depend on the type of device on which it was
created. This can be considered part of the on-disk format.
On 7/6/19 9:10 PM, Naohiro Aota wrote:
> When in HMZONED mode, make sure that device super blocks are located in
> randomly writable zones of zoned block devices. That is, do not write super
> blocks in sequential write required zones of host-managed zoned block
> devices as update would not be possible.
By design all copies of SB must be updated at each transaction,
as they are redundant copies they must match at the end of
each transaction.
Instead of skipping the sb updates, why not alter number of
copies at the time of mkfs.btrfs?
Thanks, Anand
> Signed-off-by: Damien Le Moal <[email protected]>
> Signed-off-by: Naohiro Aota <[email protected]>
> ---
> fs/btrfs/disk-io.c | 11 +++++++++++
> fs/btrfs/disk-io.h | 1 +
> fs/btrfs/extent-tree.c | 4 ++++
> fs/btrfs/scrub.c | 2 ++
> 4 files changed, 18 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 7c1404c76768..ddbb02906042 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
> return latest;
> }
>
> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
> +{
> + /* any address is good on a regular (zone_size == 0) device */
> + /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
> + return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
> +}
> +
> /*
> * Write superblock @sb to the @device. Do not wait for completion, all the
> * buffer heads we write are pinned.
> @@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
> if (bytenr + BTRFS_SUPER_INFO_SIZE >=
> device->commit_total_bytes)
> break;
> + if (!btrfs_check_super_location(device, bytenr))
> + continue;
>
> btrfs_set_super_bytenr(sb, bytenr);
>
> @@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
> if (bytenr + BTRFS_SUPER_INFO_SIZE >=
> device->commit_total_bytes)
> break;
> + if (!btrfs_check_super_location(device, bytenr))
> + continue;
>
> bh = __find_get_block(device->bdev,
> bytenr / BTRFS_BDEV_BLOCKSIZE,
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index a0161aa1ea0b..70e97cd6fa76 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
> struct page *page, size_t pg_offset, u64 start, u64 len,
> int create);
> int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
> int __init btrfs_end_io_wq_init(void);
> void __cold btrfs_end_io_wq_exit(void);
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3d41d840fe5c..ae2c895d08c4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
> return ret;
> }
>
> + /* we won't have super stripes in sequential zones */
> + if (cache->alloc_type == BTRFS_ALLOC_SEQ)
> + return 0;
> +
> for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
> bytenr = btrfs_sb_offset(i);
> ret = btrfs_rmap_block(fs_info, cache->key.objectid,
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f7b29f9db5e2..36ad4fad7eaf 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
> if (bytenr + BTRFS_SUPER_INFO_SIZE >
> scrub_dev->commit_total_bytes)
> break;
> + if (!btrfs_check_super_location(scrub_dev, bytenr))
> + continue;
>
> ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
> scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
>
On 2019/06/28 12:56, Anand Jain wrote:
> On 7/6/19 9:10 PM, Naohiro Aota wrote:
>> When in HMZONED mode, make sure that device super blocks are located in
>> randomly writable zones of zoned block devices. That is, do not write super
>> blocks in sequential write required zones of host-managed zoned block
>> devices as update would not be possible.
>
> By design all copies of SB must be updated at each transaction,
> as they are redundant copies they must match at the end of
> each transaction.
>
> Instead of skipping the sb updates, why not alter number of
> copies at the time of mkfs.btrfs?
>
> Thanks, Anand
That is exactly what the patched code does. It updates all the SB
copies, but it just avoids writing a copy to sequential writing
required zones. Mkfs.btrfs do the same. So, all the available SB
copies always match after a transaction. At the SB location in a
sequential write required zone, you will see zeroed region (in the
next version of the patch series), but that is easy to ignore: it
lacks even BTRFS_MAGIC.
The number of SB copy available on HMZONED device will vary
by its zone size and its zone layout.
Thanks,
>
>> Signed-off-by: Damien Le Moal <[email protected]>
>> Signed-off-by: Naohiro Aota <[email protected]>
>> ---
>> fs/btrfs/disk-io.c | 11 +++++++++++
>> fs/btrfs/disk-io.h | 1 +
>> fs/btrfs/extent-tree.c | 4 ++++
>> fs/btrfs/scrub.c | 2 ++
>> 4 files changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 7c1404c76768..ddbb02906042 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>> return latest;
>> }
>>
>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
>> +{
>> + /* any address is good on a regular (zone_size == 0) device */
>> + /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
>> + return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
>> +}
>> +
>> /*
>> * Write superblock @sb to the @device. Do not wait for completion, all the
>> * buffer heads we write are pinned.
>> @@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
>> if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>> device->commit_total_bytes)
>> break;
>> + if (!btrfs_check_super_location(device, bytenr))
>> + continue;
>>
>> btrfs_set_super_bytenr(sb, bytenr);
>>
>> @@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
>> if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>> device->commit_total_bytes)
>> break;
>> + if (!btrfs_check_super_location(device, bytenr))
>> + continue;
>>
>> bh = __find_get_block(device->bdev,
>> bytenr / BTRFS_BDEV_BLOCKSIZE,
>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>> index a0161aa1ea0b..70e97cd6fa76 100644
>> --- a/fs/btrfs/disk-io.h
>> +++ b/fs/btrfs/disk-io.h
>> @@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
>> struct page *page, size_t pg_offset, u64 start, u64 len,
>> int create);
>> int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
>> int __init btrfs_end_io_wq_init(void);
>> void __cold btrfs_end_io_wq_exit(void);
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 3d41d840fe5c..ae2c895d08c4 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
>> return ret;
>> }
>>
>> + /* we won't have super stripes in sequential zones */
>> + if (cache->alloc_type == BTRFS_ALLOC_SEQ)
>> + return 0;
>> +
>> for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>> bytenr = btrfs_sb_offset(i);
>> ret = btrfs_rmap_block(fs_info, cache->key.objectid,
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index f7b29f9db5e2..36ad4fad7eaf 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
>> if (bytenr + BTRFS_SUPER_INFO_SIZE >
>> scrub_dev->commit_total_bytes)
>> break;
>> + if (!btrfs_check_super_location(scrub_dev, bytenr))
>> + continue;
>>
>> ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
>> scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
>>
>
>
> On 28 Jun 2019, at 2:39 PM, Naohiro Aota <[email protected]> wrote:
>
> On 2019/06/28 12:56, Anand Jain wrote:
>> On 7/6/19 9:10 PM, Naohiro Aota wrote:
>>> When in HMZONED mode, make sure that device super blocks are located in
>>> randomly writable zones of zoned block devices. That is, do not write super
>>> blocks in sequential write required zones of host-managed zoned block
>>> devices as update would not be possible.
>>
>> By design all copies of SB must be updated at each transaction,
>> as they are redundant copies they must match at the end of
>> each transaction.
>>
>> Instead of skipping the sb updates, why not alter number of
>> copies at the time of mkfs.btrfs?
>>
>> Thanks, Anand
>
> That is exactly what the patched code does. It updates all the SB
> copies, but it just avoids writing a copy to sequential writing
> required zones. Mkfs.btrfs do the same. So, all the available SB
> copies always match after a transaction. At the SB location in a
> sequential write required zone, you will see zeroed region (in the
> next version of the patch series), but that is easy to ignore: it
> lacks even BTRFS_MAGIC.
>
Right, I saw the related Btrfs-progs patches at a later time,
there are piles of emails after a vacation.;-)
> The number of SB copy available on HMZONED device will vary
> by its zone size and its zone layout.
Thanks, Anand
> Thanks,
>
>>
>>> Signed-off-by: Damien Le Moal <[email protected]>
>>> Signed-off-by: Naohiro Aota <[email protected]>
>>> ---
>>> fs/btrfs/disk-io.c | 11 +++++++++++
>>> fs/btrfs/disk-io.h | 1 +
>>> fs/btrfs/extent-tree.c | 4 ++++
>>> fs/btrfs/scrub.c | 2 ++
>>> 4 files changed, 18 insertions(+)
>>>
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index 7c1404c76768..ddbb02906042 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -3466,6 +3466,13 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
>>> return latest;
>>> }
>>>
>>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos)
>>> +{
>>> + /* any address is good on a regular (zone_size == 0) device */
>>> + /* non-SEQUENTIAL WRITE REQUIRED zones are capable on a zoned device */
>>> + return device->zone_size == 0 || !btrfs_dev_is_sequential(device, pos);
>>> +}
>>> +
>>> /*
>>> * Write superblock @sb to the @device. Do not wait for completion, all the
>>> * buffer heads we write are pinned.
>>> @@ -3495,6 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
>>> if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>>> device->commit_total_bytes)
>>> break;
>>> + if (!btrfs_check_super_location(device, bytenr))
>>> + continue;
>>>
>>> btrfs_set_super_bytenr(sb, bytenr);
>>>
>>> @@ -3561,6 +3570,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
>>> if (bytenr + BTRFS_SUPER_INFO_SIZE >=
>>> device->commit_total_bytes)
>>> break;
>>> + if (!btrfs_check_super_location(device, bytenr))
>>> + continue;
>>>
>>> bh = __find_get_block(device->bdev,
>>> bytenr / BTRFS_BDEV_BLOCKSIZE,
>>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>>> index a0161aa1ea0b..70e97cd6fa76 100644
>>> --- a/fs/btrfs/disk-io.h
>>> +++ b/fs/btrfs/disk-io.h
>>> @@ -141,6 +141,7 @@ struct extent_map *btree_get_extent(struct btrfs_inode *inode,
>>> struct page *page, size_t pg_offset, u64 start, u64 len,
>>> int create);
>>> int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
>>> +int btrfs_check_super_location(struct btrfs_device *device, u64 pos);
>>> int __init btrfs_end_io_wq_init(void);
>>> void __cold btrfs_end_io_wq_exit(void);
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 3d41d840fe5c..ae2c895d08c4 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -267,6 +267,10 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache)
>>> return ret;
>>> }
>>>
>>> + /* we won't have super stripes in sequential zones */
>>> + if (cache->alloc_type == BTRFS_ALLOC_SEQ)
>>> + return 0;
>>> +
>>> for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
>>> bytenr = btrfs_sb_offset(i);
>>> ret = btrfs_rmap_block(fs_info, cache->key.objectid,
>>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>>> index f7b29f9db5e2..36ad4fad7eaf 100644
>>> --- a/fs/btrfs/scrub.c
>>> +++ b/fs/btrfs/scrub.c
>>> @@ -3720,6 +3720,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
>>> if (bytenr + BTRFS_SUPER_INFO_SIZE >
>>> scrub_dev->commit_total_bytes)
>>> break;
>>> + if (!btrfs_check_super_location(scrub_dev, bytenr))
>>> + continue;
>>>
>>> ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
>>> scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
>>>
>>
>>
>