cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files.
This can speed up the process by reducing the amount of data read and it
preserves sparseness when writing to the output file.
This patch series is an initial attempt at implementing
llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this
approach and suggestions for resolving the open issues.
In the block device world there are similar concepts to holes:
- SCSI has Logical Block Provisioning where the "mapped" state would be
considered data and other states would be considered holes.
- NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.
- Linux loop block devices and dm-linear targets can pass through queries to
the backing file.
- dm-thin targets can query metadata to find holes.
- ...and you may be able to think of more examples.
Therefore it is possible to offer this functionality in block drivers.
In my use case a QEMU process in userspace copies the contents of a dm-thin
target. QEMU already uses SEEK_HOLE but that doesn't work on dm-thin targets
without this patch series.
Others have also wished for block device support for SEEK_HOLE. Here is an open
issue from the BorgBackup project:
https://github.com/borgbackup/borg/issues/5609
With these patches userspace can identify holes in loop, dm-linear, and dm-thin
devices. This is done by adding a seek_hole_data() callback to struct
block_device_operations. When the callback is NULL the entire device is
considered data. Device-mapper is extended along the same lines so that targets
can provide seek_hole_data() callbacks.
I'm unfamiliar with much of this code and have probably missed locking
requirements. Since llseek() executes synchronously like ioctl() and is not an
asynchronous I/O request it's possible that my changes to the loop block driver
and dm-thin are broken (e.g. what if the loop device fd is changed during
llseek()?).
To run the tests:
# make TARGETS=block_seek_hole -C tools/testing/selftests run_tests
The code is also available here:
https://gitlab.com/stefanha/linux/-/tree/block-seek-hole
Please take a look and let me know your thoughts. Thanks!
Stefan Hajnoczi (9):
block: add llseek(SEEK_HOLE/SEEK_DATA) support
loop: add llseek(SEEK_HOLE/SEEK_DATA) support
selftests: block_seek_hole: add loop block driver tests
dm: add llseek(SEEK_HOLE/SEEK_DATA) support
selftests: block_seek_hole: add dm-zero test
dm-linear: add llseek(SEEK_HOLE/SEEK_DATA) support
selftests: block_seek_hole: add dm-linear test
dm thin: add llseek(SEEK_HOLE/SEEK_DATA) support
selftests: block_seek_hole: add dm-thin test
tools/testing/selftests/Makefile | 1 +
.../selftests/block_seek_hole/Makefile | 17 +++
include/linux/blkdev.h | 7 ++
include/linux/device-mapper.h | 5 +
block/fops.c | 43 ++++++-
drivers/block/loop.c | 36 +++++-
drivers/md/dm-linear.c | 25 ++++
drivers/md/dm-thin.c | 77 ++++++++++++
drivers/md/dm.c | 68 ++++++++++
.../testing/selftests/block_seek_hole/config | 3 +
.../selftests/block_seek_hole/dm_thin.sh | 80 ++++++++++++
.../selftests/block_seek_hole/dm_zero.sh | 31 +++++
.../selftests/block_seek_hole/map_holes.py | 37 ++++++
.../testing/selftests/block_seek_hole/test.py | 117 ++++++++++++++++++
14 files changed, 540 insertions(+), 7 deletions(-)
create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
create mode 100644 tools/testing/selftests/block_seek_hole/config
create mode 100755 tools/testing/selftests/block_seek_hole/dm_thin.sh
create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
create mode 100755 tools/testing/selftests/block_seek_hole/test.py
--
2.44.0
The SEEK_HOLE/SEEK_DATA interface is used by userspace applications to
detect sparseness. This makes copying and backup applications faster and
reduces space consumption because only ranges that do not contain data
can be skipped.
Handle SEEK_HOLE/SEEK_DATA for block devices. No block drivers implement
the new callback yet so the entire block device will appear to contain
data. Later patches will add support to drivers so this actually becomes
useful.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/blkdev.h | 7 +++++++
block/fops.c | 43 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c3e8f7cf96be9..eecfbf9c27fc4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -332,6 +332,9 @@ int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
int blk_revalidate_disk_zones(struct gendisk *disk,
void (*update_driver_data)(struct gendisk *disk));
+loff_t blkdev_seek_hole_data(struct block_device *bdev, loff_t offset,
+ int whence);
+
/*
* Independent access ranges: struct blk_independent_access_range describes
* a range of contiguous sectors that can be accessed using device command
@@ -1432,6 +1435,10 @@ struct block_device_operations {
* driver.
*/
int (*alternative_gpt_sector)(struct gendisk *disk, sector_t *sector);
+
+ /* Like llseek(SEEK_HOLE/SEEK_DATA). This callback may be NULL. */
+ loff_t (*seek_hole_data)(struct block_device *bdev, loff_t offset,
+ int whence);
};
#ifdef CONFIG_COMPAT
diff --git a/block/fops.c b/block/fops.c
index 679d9b752fe82..8ffbfec6b4c25 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -523,6 +523,43 @@ const struct address_space_operations def_blk_aops = {
};
#endif /* CONFIG_BUFFER_HEAD */
+/* Like llseek(SEEK_HOLE/SEEK_DATA) */
+loff_t blkdev_seek_hole_data(struct block_device *bdev, loff_t offset,
+ int whence)
+{
+ const struct block_device_operations *fops = bdev->bd_disk->fops;
+ loff_t size;
+
+ if (fops->seek_hole_data)
+ return fops->seek_hole_data(bdev, offset, whence);
+
+ size = bdev_nr_bytes(bdev);
+
+ switch (whence) {
+ case SEEK_DATA:
+ if ((unsigned long long)offset >= size)
+ return -ENXIO;
+ return offset;
+ case SEEK_HOLE:
+ if ((unsigned long long)offset >= size)
+ return -ENXIO;
+ return size;
+ default:
+ return -EINVAL;
+ }
+}
+
+static loff_t blkdev_llseek_hole_data(struct file *file, loff_t offset,
+ int whence)
+{
+ struct block_device *bdev = file_bdev(file);
+
+ offset = blkdev_seek_hole_data(bdev, offset, whence);
+ if (offset >= 0)
+ offset = vfs_setpos(file, offset, bdev_nr_bytes(bdev));
+ return offset;
+}
+
/*
* for a block special file file_inode(file)->i_size is zero
* so we compute the size by hand (just as in block_read/write above)
@@ -533,7 +570,11 @@ static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
loff_t retval;
inode_lock(bd_inode);
- retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
+ if (whence == SEEK_HOLE || whence == SEEK_DATA)
+ retval = blkdev_llseek_hole_data(file, offset, whence);
+ else
+ retval = fixed_size_llseek(file, offset, whence,
+ i_size_read(bd_inode));
inode_unlock(bd_inode);
return retval;
}
--
2.44.0
Run the tests with:
$ make TARGETS=block_seek_hole -C tools/selftests run_tests
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
.../selftests/block_seek_hole/Makefile | 17 +++
.../testing/selftests/block_seek_hole/config | 1 +
.../selftests/block_seek_hole/map_holes.py | 37 +++++++
.../testing/selftests/block_seek_hole/test.py | 103 ++++++++++++++++++
5 files changed, 159 insertions(+)
create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
create mode 100644 tools/testing/selftests/block_seek_hole/config
create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
create mode 100755 tools/testing/selftests/block_seek_hole/test.py
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index e1504833654db..8a21d6031b940 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -2,6 +2,7 @@
TARGETS += alsa
TARGETS += amd-pstate
TARGETS += arm64
+TARGETS += block_seek_hole
TARGETS += bpf
TARGETS += breakpoints
TARGETS += cachestat
diff --git a/tools/testing/selftests/block_seek_hole/Makefile b/tools/testing/selftests/block_seek_hole/Makefile
new file mode 100644
index 0000000000000..3f4bbd52db29f
--- /dev/null
+++ b/tools/testing/selftests/block_seek_hole/Makefile
@@ -0,0 +1,17 @@
+# SPDX-License-Identifier: GPL-2.0-only
+PY3 = $(shell which python3 2>/dev/null)
+
+ifneq ($(PY3),)
+
+TEST_PROGS := test.py
+
+include ../lib.mk
+
+else
+
+all: no_py3_warning
+
+no_py3_warning:
+ @echo "Missing python3. This test will be skipped."
+
+endif
diff --git a/tools/testing/selftests/block_seek_hole/config b/tools/testing/selftests/block_seek_hole/config
new file mode 100644
index 0000000000000..72437e0c0fc1c
--- /dev/null
+++ b/tools/testing/selftests/block_seek_hole/config
@@ -0,0 +1 @@
+CONFIG_BLK_DEV_LOOP=m
diff --git a/tools/testing/selftests/block_seek_hole/map_holes.py b/tools/testing/selftests/block_seek_hole/map_holes.py
new file mode 100755
index 0000000000000..9477ec5d69d3a
--- /dev/null
+++ b/tools/testing/selftests/block_seek_hole/map_holes.py
@@ -0,0 +1,37 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# map_holes.py <filename>
+#
+# Print the holes and data ranges in a file.
+
+import errno
+import os
+import sys
+
+def map_holes(fd):
+ end = os.lseek(fd, 0, os.SEEK_END)
+ offset = 0
+
+ print('TYPE START END SIZE')
+
+ while offset < end:
+ contents = 'DATA'
+ new_offset = os.lseek(fd, offset, os.SEEK_HOLE)
+ if new_offset == offset:
+ contents = 'HOLE'
+ try:
+ new_offset = os.lseek(fd, offset, os.SEEK_DATA)
+ except OSError as err:
+ if err.errno == errno.ENXIO:
+ new_offset = end
+ else:
+ raise err
+ assert new_offset != offset
+ print(f'{contents} {offset} {new_offset} {new_offset - offset}')
+ offset = new_offset
+
+if __name__ == '__main__':
+ with open(sys.argv[1], 'rb') as f:
+ fd = f.fileno()
+ map_holes(fd)
diff --git a/tools/testing/selftests/block_seek_hole/test.py b/tools/testing/selftests/block_seek_hole/test.py
new file mode 100755
index 0000000000000..4f7c2d01ab3d3
--- /dev/null
+++ b/tools/testing/selftests/block_seek_hole/test.py
@@ -0,0 +1,103 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# test.py
+#
+# Test SEEK_HOLE/SEEK_DATA support in block drivers
+
+import os
+import subprocess
+import sys
+from contextlib import contextmanager
+
+KB = 1024
+MB = 1024 * KB
+
+def run(args):
+ try:
+ cmd = subprocess.run(args, check=True, capture_output=True)
+ except subprocess.CalledProcessError as e:
+ print(e)
+ print(e.stderr.decode('utf-8').strip())
+ sys.exit(1)
+ return cmd
+
+@contextmanager
+def test_file(layout_fn, prefix='test'):
+ '''A context manager that creates a test file and produces its path'''
+ path = f'{prefix}-{os.getpid()}'
+ with open(path, 'w+b') as f:
+ layout_fn(f)
+
+ try:
+ yield path
+ finally:
+ os.unlink(path)
+
+@contextmanager
+def loop_device(file_path):
+ '''A context manager that attaches a loop device for a given file and produces the path of the loop device'''
+ cmd = run(['losetup', '--show', '-f', file_path])
+ loop_path = os.fsdecode(cmd.stdout.strip())
+
+ try:
+ yield loop_path
+ finally:
+ run(['losetup', '-d', loop_path])
+
+def test(layout, dev_context_manager):
+ with test_file(layout) as file_path, dev_context_manager(file_path) as dev_path:
+ cmd = run(['./map_holes.py', file_path])
+ file_output = cmd.stdout.decode('utf-8').strip()
+
+ cmd = run(['./map_holes.py', dev_path])
+ dev_output = cmd.stdout.decode('utf-8').strip()
+
+ if file_output != dev_output:
+ print(f'FAIL {dev_context_manager.__name__} {layout.__name__}')
+ print('File output:')
+ print(file_output)
+ print('Does not match device output:')
+ print(dev_output)
+ sys.exit(1)
+
+def test_all(layouts, dev_context_managers):
+ for dev_context_manager in dev_context_managers:
+ for layout in layouts:
+ test(layout, dev_context_manager)
+
+# Different data layouts to test
+
+def data_at_beginning_and_end(f):
+ f.write(b'A' * 4 * KB)
+ f.seek(256 * MB)
+
+ f.write(b'B' * 64 * KB)
+
+ f.seek(1024 * MB - KB)
+ f.write(b'C' * KB)
+
+def holes_at_beginning_and_end(f):
+ f.seek(128 * MB)
+ f.write(b'A' * 4 * KB)
+
+ f.seek(512 * MB)
+ f.write(b'B' * 64 * KB)
+
+ f.truncate(1024 * MB)
+
+def no_holes(f):
+ # Just 1 MB so test file generation is quick
+ mb = b'A' * MB
+ f.write(mb)
+
+def empty_file(f):
+ f.truncate(1024 * MB)
+
+if __name__ == '__main__':
+ layouts = [data_at_beginning_and_end,
+ holes_at_beginning_and_end,
+ no_holes,
+ empty_file]
+ dev_context_managers = [loop_device]
+ test_all(layouts, dev_context_managers)
--
2.44.0
Delegate SEEK_HOLE/SEEK_DATA to device-mapper targets. The new
dm_seek_hole_data() callback allows target types to customize behavior.
The default implementation treats the target as all data with no holes.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
include/linux/device-mapper.h | 5 +++
drivers/md/dm.c | 68 +++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+)
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 82b2195efaca7..e89ebaab6507a 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -161,6 +161,10 @@ typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff,
typedef size_t (*dm_dax_recovery_write_fn)(struct dm_target *ti, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
+/* Like llseek(SEEK_HOLE/SEEK_DATA) */
+typedef loff_t (*dm_seek_hole_data)(struct dm_target *ti, loff_t offset,
+ int whence);
+
void dm_error(const char *message);
struct dm_dev {
@@ -210,6 +214,7 @@ struct target_type {
dm_dax_direct_access_fn direct_access;
dm_dax_zero_page_range_fn dax_zero_page_range;
dm_dax_recovery_write_fn dax_recovery_write;
+ dm_seek_hole_data seek_hole_data;
/* For internal device-mapper use. */
struct list_head list;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 56aa2a8b9d715..3c921bdbd17fc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -3167,6 +3167,72 @@ void dm_free_md_mempools(struct dm_md_mempools *pools)
kfree(pools);
}
+/* Default implementation for targets that do not implement the callback */
+static loff_t dm_blk_seek_hole_data_default(loff_t offset, int whence,
+ loff_t size)
+{
+ switch (whence) {
+ case SEEK_DATA:
+ if ((unsigned long long)offset >= size)
+ return -ENXIO;
+ return offset;
+ case SEEK_HOLE:
+ if ((unsigned long long)offset >= size)
+ return -ENXIO;
+ return size;
+ default:
+ return -EINVAL;
+ }
+}
+
+static loff_t dm_blk_do_seek_hole_data(struct dm_table *table, loff_t offset,
+ int whence)
+{
+ struct dm_target *ti;
+ loff_t end;
+
+ /* Loop when the end of a target is reached */
+ do {
+ ti = dm_table_find_target(table, offset >> SECTOR_SHIFT);
+ if (!ti)
+ return whence == SEEK_DATA ? -ENXIO : offset;
+
+ end = (ti->begin + ti->len) << SECTOR_SHIFT;
+
+ if (ti->type->seek_hole_data)
+ offset = ti->type->seek_hole_data(ti, offset, whence);
+ else
+ offset = dm_blk_seek_hole_data_default(offset, whence, end);
+
+ if (whence == SEEK_DATA && offset == -ENXIO)
+ offset = end;
+ } while (offset == end);
+
+ return offset;
+}
+
+static loff_t dm_blk_seek_hole_data(struct block_device *bdev, loff_t offset,
+ int whence)
+{
+ struct mapped_device *md = bdev->bd_disk->private_data;
+ struct dm_table *table;
+ int srcu_idx;
+ loff_t ret;
+
+ if (dm_suspended_md(md))
+ return -EAGAIN;
+
+ table = dm_get_live_table(md, &srcu_idx);
+ if (!table)
+ return -EIO;
+
+ ret = dm_blk_do_seek_hole_data(table, offset, whence);
+
+ dm_put_live_table(md, srcu_idx);
+
+ return ret;
+}
+
struct dm_pr {
u64 old_key;
u64 new_key;
@@ -3493,6 +3559,7 @@ static const struct block_device_operations dm_blk_dops = {
.getgeo = dm_blk_getgeo,
.report_zones = dm_blk_report_zones,
.pr_ops = &dm_pr_ops,
+ .seek_hole_data = dm_blk_seek_hole_data,
.owner = THIS_MODULE
};
@@ -3502,6 +3569,7 @@ static const struct block_device_operations dm_rq_blk_dops = {
.ioctl = dm_blk_ioctl,
.getgeo = dm_blk_getgeo,
.pr_ops = &dm_pr_ops,
+ .seek_hole_data = dm_blk_seek_hole_data,
.owner = THIS_MODULE
};
--
2.44.0
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
.../selftests/block_seek_hole/Makefile | 2 +-
.../testing/selftests/block_seek_hole/config | 2 ++
.../selftests/block_seek_hole/dm_zero.sh | 31 +++++++++++++++++++
3 files changed, 34 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
diff --git a/tools/testing/selftests/block_seek_hole/Makefile b/tools/testing/selftests/block_seek_hole/Makefile
index 3f4bbd52db29f..1bd9e748b2acc 100644
--- a/tools/testing/selftests/block_seek_hole/Makefile
+++ b/tools/testing/selftests/block_seek_hole/Makefile
@@ -3,7 +3,7 @@ PY3 = $(shell which python3 2>/dev/null)
ifneq ($(PY3),)
-TEST_PROGS := test.py
+TEST_PROGS := test.py dm_zero.sh
include ../lib.mk
diff --git a/tools/testing/selftests/block_seek_hole/config b/tools/testing/selftests/block_seek_hole/config
index 72437e0c0fc1c..bfd59f1d98769 100644
--- a/tools/testing/selftests/block_seek_hole/config
+++ b/tools/testing/selftests/block_seek_hole/config
@@ -1 +1,3 @@
CONFIG_BLK_DEV_LOOP=m
+CONFIG_BLK_DEV_DM=m
+CONFIG_DM_ZERO=m
diff --git a/tools/testing/selftests/block_seek_hole/dm_zero.sh b/tools/testing/selftests/block_seek_hole/dm_zero.sh
new file mode 100755
index 0000000000000..20836a566fcc8
--- /dev/null
+++ b/tools/testing/selftests/block_seek_hole/dm_zero.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# dm_zero.sh
+#
+# Test that dm-zero reports data because it does not have a custom
+# SEEK_HOLE/SEEK_DATA implementation.
+
+set -e
+
+dev_name=test-$$
+size=$((1024 * 1024 * 1024 / 512)) # 1 GB
+
+cleanup() {
+ dmsetup remove $dev_name
+}
+trap cleanup EXIT
+
+dmsetup create $dev_name --table "0 $size zero"
+
+output=$(./map_holes.py /dev/mapper/$dev_name)
+expected='TYPE START END SIZE
+DATA 0 1073741824 1073741824'
+
+if [ "$output" != "$expected" ]; then
+ echo 'FAIL expected:'
+ echo "$expected"
+ echo 'Does not match device output:'
+ echo "$output"
+ exit 1
+fi
--
2.44.0
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
Open issues:
- The file offset is updated on both the blkdev file and the backing
file. Is there a way to avoid updating the backing file offset so the
file opened by userspace is not affected?
- Should this run in the worker or use the cgroups?
---
drivers/block/loop.c | 36 ++++++++++++++++++++++++++++++------
1 file changed, 30 insertions(+), 6 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 28a95fd366fea..6a89375de82e8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -750,6 +750,29 @@ static void loop_sysfs_exit(struct loop_device *lo)
&loop_attribute_group);
}
+static loff_t lo_seek_hole_data(struct block_device *bdev, loff_t offset,
+ int whence)
+{
+ /* TODO need to activate cgroups or use worker? */
+ /* TODO locking? */
+ struct loop_device *lo = bdev->bd_disk->private_data;
+ struct file *file = lo->lo_backing_file;
+
+ if (lo->lo_offset > 0)
+ offset += lo->lo_offset; /* TODO underflow/overflow? */
+
+ /* TODO backing file offset is modified! */
+ offset = vfs_llseek(file, offset, whence);
+ if (offset < 0)
+ return offset;
+
+ if (lo->lo_offset > 0)
+ offset -= lo->lo_offset; /* TODO underflow/overflow? */
+ if (lo->lo_sizelimit > 0 && offset > lo->lo_sizelimit)
+ offset = lo->lo_sizelimit;
+ return offset;
+}
+
static void loop_config_discard(struct loop_device *lo,
struct queue_limits *lim)
{
@@ -1751,13 +1774,14 @@ static void lo_free_disk(struct gendisk *disk)
}
static const struct block_device_operations lo_fops = {
- .owner = THIS_MODULE,
- .release = lo_release,
- .ioctl = lo_ioctl,
+ .owner = THIS_MODULE,
+ .release = lo_release,
+ .ioctl = lo_ioctl,
#ifdef CONFIG_COMPAT
- .compat_ioctl = lo_compat_ioctl,
+ .compat_ioctl = lo_compat_ioctl,
#endif
- .free_disk = lo_free_disk,
+ .free_disk = lo_free_disk,
+ .seek_hole_data = lo_seek_hole_data,
};
/*
@@ -2140,7 +2164,7 @@ static int loop_control_remove(int idx)
pr_warn_once("deleting an unspecified loop device is not supported.\n");
return -EINVAL;
}
-
+
/* Hide this loop device for serialization. */
ret = mutex_lock_killable(&loop_ctl_mutex);
if (ret)
--
2.44.0
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
drivers/md/dm-linear.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)
diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 2d3e186ca87e3..9b6cdfa4f951d 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -147,6 +147,30 @@ static int linear_report_zones(struct dm_target *ti,
#define linear_report_zones NULL
#endif
+static loff_t linear_seek_hole_data(struct dm_target *ti, loff_t offset,
+ int whence)
+{
+ struct linear_c *lc = ti->private;
+ loff_t ti_begin = ti->begin << SECTOR_SHIFT;
+ loff_t ti_len = ti->len << SECTOR_SHIFT;
+ loff_t bdev_start = lc->start << SECTOR_SHIFT;
+ loff_t bdev_offset;
+
+ /* TODO underflow/overflow? */
+ bdev_offset = offset - ti_begin + bdev_start;
+
+ bdev_offset = blkdev_seek_hole_data(lc->dev->bdev, bdev_offset,
+ whence);
+ if (bdev_offset < 0)
+ return bdev_offset;
+
+ offset = bdev_offset - bdev_start;
+ if (offset >= ti_len)
+ return whence == SEEK_DATA ? -ENXIO : ti_begin + ti_len;
+
+ return offset + ti_begin;
+}
+
static int linear_iterate_devices(struct dm_target *ti,
iterate_devices_callout_fn fn, void *data)
{
@@ -212,6 +236,7 @@ static struct target_type linear_target = {
.direct_access = linear_dax_direct_access,
.dax_zero_page_range = linear_dax_zero_page_range,
.dax_recovery_write = linear_dax_recovery_write,
+ .seek_hole_data = linear_seek_hole_data,
};
int __init dm_linear_init(void)
--
2.44.0
The dm-linear linear target passes through SEEK_HOLE/SEEK_DATA. Extend
the test case to check that the same holes/data are reported as for the
underlying file.
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
tools/testing/selftests/block_seek_hole/test.py | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/block_seek_hole/test.py b/tools/testing/selftests/block_seek_hole/test.py
index 4f7c2d01ab3d3..6360b72aee338 100755
--- a/tools/testing/selftests/block_seek_hole/test.py
+++ b/tools/testing/selftests/block_seek_hole/test.py
@@ -45,6 +45,20 @@ def loop_device(file_path):
finally:
run(['losetup', '-d', loop_path])
+@contextmanager
+def dm_linear(file_path):
+ file_size = os.path.getsize(file_path)
+
+ with loop_device(file_path) as loop_path:
+ dm_name = f'test-{os.getpid()}'
+ run(['dmsetup', 'create', dm_name, '--table',
+ f'0 {file_size // 512} linear {loop_path} 0'])
+
+ try:
+ yield f'/dev/mapper/{dm_name}'
+ finally:
+ run(['dmsetup', 'remove', dm_name])
+
def test(layout, dev_context_manager):
with test_file(layout) as file_path, dev_context_manager(file_path) as dev_path:
cmd = run(['./map_holes.py', file_path])
@@ -99,5 +113,5 @@ if __name__ == '__main__':
holes_at_beginning_and_end,
no_holes,
empty_file]
- dev_context_managers = [loop_device]
+ dev_context_managers = [loop_device, dm_linear]
test_all(layouts, dev_context_managers)
--
2.44.0
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
Open issues:
- Locking?
- thin_seek_hole_data() does not run as a bio or request. This patch
assumes dm_thin_find_mapped_range() synchronously performs I/O if
metadata needs to be loaded from disk. Is that a valid assumption?
---
drivers/md/dm-thin.c | 77 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 77 insertions(+)
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 4793ad2aa1f7e..3c5dc4f0fe8a3 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -4501,6 +4501,82 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
}
}
+static dm_block_t loff_to_block(struct pool *pool, loff_t offset)
+{
+ sector_t offset_sectors = offset >> SECTOR_SHIFT;
+ dm_block_t ret;
+
+ if (block_size_is_power_of_two(pool))
+ ret = offset_sectors >> pool->sectors_per_block_shift;
+ else {
+ ret = offset_sectors;
+ (void) sector_div(ret, pool->sectors_per_block);
+ }
+
+ return ret;
+}
+
+static loff_t block_to_loff(struct pool *pool, dm_block_t block)
+{
+ return block_to_sectors(pool, block) << SECTOR_SHIFT;
+}
+
+static loff_t thin_seek_hole_data(struct dm_target *ti, loff_t offset,
+ int whence)
+{
+ struct thin_c *tc = ti->private;
+ struct dm_thin_device *td = tc->td;
+ struct pool *pool = tc->pool;
+ dm_block_t begin;
+ dm_block_t end;
+ dm_block_t mapped_begin;
+ dm_block_t mapped_end;
+ dm_block_t pool_begin;
+ bool maybe_shared;
+ int ret;
+
+ /* TODO locking? */
+
+ if (block_size_is_power_of_two(pool))
+ end = ti->len >> pool->sectors_per_block_shift;
+ else {
+ end = ti->len;
+ (void) sector_div(end, pool->sectors_per_block);
+ }
+
+ offset -= ti->begin << SECTOR_SHIFT;
+
+ while (true) {
+ begin = loff_to_block(pool, offset);
+ ret = dm_thin_find_mapped_range(td, begin, end,
+ &mapped_begin, &mapped_end,
+ &pool_begin, &maybe_shared);
+ if (ret == -ENODATA) {
+ if (whence == SEEK_DATA)
+ return -ENXIO;
+ break;
+ } else if (ret < 0) {
+ /* TODO handle EWOULDBLOCK? */
+ return -ENXIO;
+ }
+
+ /* SEEK_DATA finishes here... */
+ if (whence == SEEK_DATA) {
+ if (mapped_begin != begin)
+ offset = block_to_loff(pool, mapped_begin);
+ break;
+ }
+
+ /* ...while SEEK_HOLE may need to look further */
+ if (mapped_begin != begin)
+ break; /* offset is in a hole */
+
+ offset = block_to_loff(pool, mapped_end);
+ }
+
+ return offset + (ti->begin << SECTOR_SHIFT);
+}
+
static struct target_type thin_target = {
.name = "thin",
.version = {1, 23, 0},
@@ -4515,6 +4591,7 @@ static struct target_type thin_target = {
.status = thin_status,
.iterate_devices = thin_iterate_devices,
.io_hints = thin_io_hints,
+ .seek_hole_data = thin_seek_hole_data,
};
/*----------------------------------------------------------------*/
--
2.44.0
Signed-off-by: Stefan Hajnoczi <[email protected]>
---
.../selftests/block_seek_hole/Makefile | 2 +-
.../selftests/block_seek_hole/dm_thin.sh | 80 +++++++++++++++++++
2 files changed, 81 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/block_seek_hole/dm_thin.sh
diff --git a/tools/testing/selftests/block_seek_hole/Makefile b/tools/testing/selftests/block_seek_hole/Makefile
index 1bd9e748b2acc..3b4ee1b1fb6e7 100644
--- a/tools/testing/selftests/block_seek_hole/Makefile
+++ b/tools/testing/selftests/block_seek_hole/Makefile
@@ -3,7 +3,7 @@ PY3 = $(shell which python3 2>/dev/null)
ifneq ($(PY3),)
-TEST_PROGS := test.py dm_zero.sh
+TEST_PROGS := test.py dm_zero.sh dm_thin.sh
include ../lib.mk
diff --git a/tools/testing/selftests/block_seek_hole/dm_thin.sh b/tools/testing/selftests/block_seek_hole/dm_thin.sh
new file mode 100755
index 0000000000000..a379b7c875f28
--- /dev/null
+++ b/tools/testing/selftests/block_seek_hole/dm_thin.sh
@@ -0,0 +1,80 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# dm_thin.sh
+#
+# Test that dm-thin supports SEEK_HOLE/SEEK_DATA.
+
+set -e
+
+# check <actual> <expected>
+# Check that the actual output matches the expected output.
+check() {
+ if [ "$1" != "$2" ]; then
+ echo 'FAIL expected:'
+ echo "$2"
+ echo 'Does not match device output:'
+ echo "$1"
+ exit 1
+ fi
+}
+
+cleanup() {
+ if [ -n "$thin_name" ]; then
+ dmsetup remove $thin_name
+ fi
+ if [ -n "$pool_name" ]; then
+ dmsetup remove $pool_name
+ fi
+ if [ -n "$metadata_path" ]; then
+ losetup --detach "$metadata_path"
+ fi
+ if [ -n "$data_path" ]; then
+ losetup --detach "$data_path"
+ fi
+ rm -f pool-metadata pool-data
+}
+trap cleanup EXIT
+
+rm -f pool-metadata pool-data
+truncate -s 256M pool-metadata
+truncate -s 1G pool-data
+
+size_sectors=$((1024 * 1024 * 1024 / 512)) # 1 GB
+metadata_path=$(losetup --show --find pool-metadata)
+data_path=$(losetup --show --find pool-data)
+pool_name=pool-$$
+thin_name=thin-$$
+
+dmsetup create $pool_name \
+ --table "0 $size_sectors thin-pool $metadata_path $data_path 128 $size_sectors"
+dmsetup message /dev/mapper/$pool_name 0 'create_thin 0'
+dmsetup create $thin_name --table "0 $size_sectors thin /dev/mapper/$pool_name 0"
+
+# Verify that the device is empty
+check "$(./map_holes.py /dev/mapper/$thin_name)" 'TYPE START END SIZE
+HOLE 0 1073741824 1073741824'
+
+# Write 4k at offset 128M but dm-thin will actually map an entire 64k block
+dd if=/dev/urandom of=/dev/mapper/$thin_name bs=4k count=1 seek=32768 status=none
+check "$(./map_holes.py /dev/mapper/$thin_name)" 'TYPE START END SIZE
+HOLE 0 134217728 134217728
+DATA 134217728 134283264 65536
+HOLE 134283264 1073741824 939458560'
+
+# Write at the beginning of the device
+dd if=/dev/urandom of=/dev/mapper/$thin_name bs=4k count=1 status=none
+check "$(./map_holes.py /dev/mapper/$thin_name)" 'TYPE START END SIZE
+DATA 0 65536 65536
+HOLE 65536 134217728 134152192
+DATA 134217728 134283264 65536
+HOLE 134283264 1073741824 939458560'
+
+# Write at the end of the device
+dd if=/dev/urandom of=/dev/mapper/$thin_name bs=4k count=1 seek=262143 status=none
+check "$(./map_holes.py /dev/mapper/$thin_name)" 'TYPE START END SIZE
+DATA 0 65536 65536
+HOLE 65536 134217728 134152192
+DATA 134217728 134283264 65536
+HOLE 134283264 1073676288 939393024
+DATA 1073676288 1073741824 65536'
--
2.44.0
On Thu, Mar 28, 2024 at 04:39:06PM -0400, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> .../selftests/block_seek_hole/Makefile | 2 +-
> .../testing/selftests/block_seek_hole/config | 2 ++
> .../selftests/block_seek_hole/dm_zero.sh | 31 +++++++++++++++++++
> 3 files changed, 34 insertions(+), 1 deletion(-)
> create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
>
> +++ b/tools/testing/selftests/block_seek_hole/dm_zero.sh
> @@ -0,0 +1,31 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# dm_zero.sh
> +#
> +# Test that dm-zero reports data because it does not have a custom
> +# SEEK_HOLE/SEEK_DATA implementation.
Why not? Wouldn't it make more sense to have dm-zero report the
entire device as a hole (that is, an in-range SEEK_HOLE always returns
the same offset, while an in-range SEEK_DATA returns the device size)?
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files.
As a minor point of clarity (perhaps as much for my own records for
documenting research I've done over the years, and not necessarily
something you need to change in the commit message):
Userspace apps generally use lseek(2) from glibc or similar (perhaps
via its alias lseek64(), depending on whether userspace is using large
file offsets), rather than directly calling the _llseek() syscall.
But it all boils down to the same notion of seeking information about
various special offsets.
Also, in past history, coreutils cp(1) and dd(1) did experiment with
using FS_IOC_FIEMAP ioctls when SEEK_HOLE was not available, but it
proved to cause more problems than it solved, so that is not currently
in favor. Yes, we could teach more and more block devices to expose
specific ioctls for querying sparseness boundaries, and then teach
userspace apps a list of ioctls to try; but as cp(1) already learned,
having one common interface is much easier than an ever-growing ioctl
ladder to be copied across every client that would benefit from
knowing where the unallocated portions are.
> This can speed up the process by reducing the amount of data read and it
> preserves sparseness when writing to the output file.
>
> This patch series is an initial attempt at implementing
> llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this
> approach and suggestions for resolving the open issues.
One of your open issues was whether adjusting the offset of the block
device itself should also adjust the file offset of the underlying
file (at least in the case of loopback and dm-linear). What would the
community think of adding two additional constants to the entire
family of *seek() functions, that have the effect of returning the
same offset as their SEEK_HOLE/SEEK_DATA counterparts but without
moving the file offset?
Explaining the idea by example, although I'm not stuck on these names:
suppose you have an fd visiting a file description of 2MiB in size,
with the first 1MiB being a hole and the second being data.
#define MiB (1024*1024)
lseek64(fd, MiB, SEEK_SET); // returns MiB, file offset changed to MiB
lseek64(fd, 0, SEEK_HOLE); // returns 0, file offset changed to 0
lseek64(fd, 0, SEEK_DATA); // returns MiB, file offset changed to MiB
lseek64(fd, 0, SEEK_PEEK_HOLE); // returns 0, but file offset left at MiB
lseek64(fd, 0, SEEK_SET); // returns 0, file offset changed to MiB
lseek64(fd, 0, SEEK_PEEK_DATA); // returns MiB, but file offset left at MiB
With semantics like that, it might be easier to implement just
SEEK_PEEK* in devices (don't worry about changing offsets, just about
reporting where the requested offset is), and then have a common layer
do the translation from llseek(...,offs,SEEK_HOLE) into a 2-step
llseek(...,llseek(...,offs,SEEK_PEEK_HOLE),SEEK_SET) if that makes life
easier under the hood.
>
> In the block device world there are similar concepts to holes:
> - SCSI has Logical Block Provisioning where the "mapped" state would be
> considered data and other states would be considered holes.
BIG caveat here: the SCSI spec does not necessarily guarantee that
unmapped regions read as all zeroes; compare the difference between
FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE. While lseek(SEEK_HOLE)
on a regular file guarantees that future read() in that hole will see
NUL bytes, I'm not sure whether we want to make that guarantee for
block devices. This may be yet another case where we might want to
add new SEEK_* constants to the *seek() family of functions that lets
the caller indicate whether they want offsets that are guaranteed to
read as zero, vs. merely offsets that are not allocated but may or may
not read as zero. Skipping unallocated portions, even when you don't
know if the contents reliably read as zero, is still a useful goal in
some userspace programs.
> - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.
However, utilizing it in nbd.ko would require teaching the kernel to
handle structured or extended headers (right now, that is an extension
only supported in user-space implementations of the NBD protocol). I
can see why you did not tackle that in this RFC series, even though
you mention it in the cover letter.
> - Linux loop block devices and dm-linear targets can pass through queries to
> the backing file.
> - dm-thin targets can query metadata to find holes.
> - ...and you may be able to think of more examples.
>
> Therefore it is possible to offer this functionality in block drivers.
>
> In my use case a QEMU process in userspace copies the contents of a dm-thin
> target. QEMU already uses SEEK_HOLE but that doesn't work on dm-thin targets
> without this patch series.
>
> Others have also wished for block device support for SEEK_HOLE. Here is an open
> issue from the BorgBackup project:
> https://github.com/borgbackup/borg/issues/5609
>
> With these patches userspace can identify holes in loop, dm-linear, and dm-thin
> devices. This is done by adding a seek_hole_data() callback to struct
> block_device_operations. When the callback is NULL the entire device is
> considered data. Device-mapper is extended along the same lines so that targets
> can provide seek_hole_data() callbacks.
>
> I'm unfamiliar with much of this code and have probably missed locking
> requirements. Since llseek() executes synchronously like ioctl() and is not an
> asynchronous I/O request it's possible that my changes to the loop block driver
> and dm-thin are broken (e.g. what if the loop device fd is changed during
> llseek()?).
>
> To run the tests:
>
> # make TARGETS=block_seek_hole -C tools/testing/selftests run_tests
>
> The code is also available here:
> https://gitlab.com/stefanha/linux/-/tree/block-seek-hole
>
> Please take a look and let me know your thoughts. Thanks!
>
> Stefan Hajnoczi (9):
> block: add llseek(SEEK_HOLE/SEEK_DATA) support
> loop: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add loop block driver tests
> dm: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add dm-zero test
> dm-linear: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add dm-linear test
> dm thin: add llseek(SEEK_HOLE/SEEK_DATA) support
> selftests: block_seek_hole: add dm-thin test
>
> tools/testing/selftests/Makefile | 1 +
> .../selftests/block_seek_hole/Makefile | 17 +++
> include/linux/blkdev.h | 7 ++
> include/linux/device-mapper.h | 5 +
> block/fops.c | 43 ++++++-
> drivers/block/loop.c | 36 +++++-
> drivers/md/dm-linear.c | 25 ++++
> drivers/md/dm-thin.c | 77 ++++++++++++
> drivers/md/dm.c | 68 ++++++++++
> .../testing/selftests/block_seek_hole/config | 3 +
> .../selftests/block_seek_hole/dm_thin.sh | 80 ++++++++++++
> .../selftests/block_seek_hole/dm_zero.sh | 31 +++++
> .../selftests/block_seek_hole/map_holes.py | 37 ++++++
> .../testing/selftests/block_seek_hole/test.py | 117 ++++++++++++++++++
> 14 files changed, 540 insertions(+), 7 deletions(-)
> create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
> create mode 100644 tools/testing/selftests/block_seek_hole/config
> create mode 100755 tools/testing/selftests/block_seek_hole/dm_thin.sh
> create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
> create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
> create mode 100755 tools/testing/selftests/block_seek_hole/test.py
>
> --
> 2.44.0
>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
Replying to myself,
On Thu, Mar 28, 2024 at 05:17:18PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > cp(1) and backup tools use llseek(SEEK_HOLE/SEEK_DATA) to skip holes in files.
>
> >
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> > considered data and other states would be considered holes.
>
> BIG caveat here: the SCSI spec does not necessarily guarantee that
> unmapped regions read as all zeroes; compare the difference between
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE. While lseek(SEEK_HOLE)
> on a regular file guarantees that future read() in that hole will see
> NUL bytes, I'm not sure whether we want to make that guarantee for
> block devices. This may be yet another case where we might want to
> add new SEEK_* constants to the *seek() family of functions that lets
> the caller indicate whether they want offsets that are guaranteed to
> read as zero, vs. merely offsets that are not allocated but may or may
> not read as zero. Skipping unallocated portions, even when you don't
> know if the contents reliably read as zero, is still a useful goal in
> some userspace programs.
>
> > - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.
The upstream NBD spec[1] took the time to represent two bits of
information per extent, _because_ of the knowledge that not all SCSI
devices with TRIM support actually guarantee a read of zeroes after
trimming. That is, NBD chose to convey both:
NBD_STATE_HOLE: 1<<0 if region is unallocated, 0 if region has not been trimmed
NBD_STATE_ZERO: 1<<1 if region reads as zeroes, 0 if region contents might be nonzero
it is always safe to describe an extent as value 0 (both bits clear),
whether or not lseek(SEEK_DATA) returns the same offset; meanwhile,
traditional lseek(SEEK_HOLE) on filesystems generally translates to a
status of 3 (both bits set), but as it is (sometimes) possible to
determine that allocated data still reads as zero, or that unallocated
data may not necessarily read as zero, it is also possible to
implement NBD servers that don't report both bits in parallel.
If we are going to enhance llseek(2) to expose more information about
underlying block devices, possibly by adding more SEEK_ constants for
use in the entire family of *seek() API, it may be worth thinking
about whether it is worth userspace being able to query this
additional distinction between unallocated vs reads-as-zero.
[1] https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md#baseallocation-metadata-context
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 05:19:26PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:06PM -0400, Stefan Hajnoczi wrote:
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > .../selftests/block_seek_hole/Makefile | 2 +-
> > .../testing/selftests/block_seek_hole/config | 2 ++
> > .../selftests/block_seek_hole/dm_zero.sh | 31 +++++++++++++++++++
> > 3 files changed, 34 insertions(+), 1 deletion(-)
> > create mode 100755 tools/testing/selftests/block_seek_hole/dm_zero.sh
> >
>
> > +++ b/tools/testing/selftests/block_seek_hole/dm_zero.sh
> > @@ -0,0 +1,31 @@
> > +#!/bin/sh
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +#
> > +# dm_zero.sh
> > +#
> > +# Test that dm-zero reports data because it does not have a custom
> > +# SEEK_HOLE/SEEK_DATA implementation.
>
> Why not? Wouldn't it make more sense to have dm-zero report the
> entire device as a hole (that is, an in-range SEEK_HOLE always returns
> the same offset, while an in-range SEEK_DATA returns the device size)?
Yes, dm-zero could report a hole. I just added this test to verify that
targets that do not implement seek_hole_data() work and the dm-zero
target was convenient for testing.
Stefan
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.
> Virtualization: qemu.org | libguestfs.org
>
On Thu, Mar 28, 2024 at 05:16:54PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > This can speed up the process by reducing the amount of data read and it
> > preserves sparseness when writing to the output file.
> >
> > This patch series is an initial attempt at implementing
> > llseek(SEEK_HOLE/SEEK_DATA) for block devices. I'm looking for feedback on this
> > approach and suggestions for resolving the open issues.
>
> One of your open issues was whether adjusting the offset of the block
> device itself should also adjust the file offset of the underlying
> file (at least in the case of loopback and dm-linear). What would the
Only the loop block driver has this issue. The dm-linear driver uses
blkdev_seek_hole_data(), which does not update the file offset because
it operates on a struct block_device instead of a struct file.
>
> >
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> > considered data and other states would be considered holes.
>
> BIG caveat here: the SCSI spec does not necessarily guarantee that
> unmapped regions read as all zeroes; compare the difference between
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE. While lseek(SEEK_HOLE)
> on a regular file guarantees that future read() in that hole will see
> NUL bytes, I'm not sure whether we want to make that guarantee for
> block devices. This may be yet another case where we might want to
> add new SEEK_* constants to the *seek() family of functions that lets
> the caller indicate whether they want offsets that are guaranteed to
> read as zero, vs. merely offsets that are not allocated but may or may
> not read as zero. Skipping unallocated portions, even when you don't
> know if the contents reliably read as zero, is still a useful goal in
> some userspace programs.
SCSI initiators can check the Logical Block Provisioning Read Zeroes
(LBPRZ) field to determine whether or not zeroes are guaranteed. The sd
driver would only rely on the device when LPBRZ indicates that zeroes
will be read. Otherwise the driver would report that the device is
filled with data.
>
> > - NBD has NBD_CMD_BLOCK_STATUS for querying whether blocks are present.
>
> However, utilizing it in nbd.ko would require teaching the kernel to
> handle structured or extended headers (right now, that is an extension
> only supported in user-space implementations of the NBD protocol). I
> can see why you did not tackle that in this RFC series, even though
> you mention it in the cover letter.
Yes, I'm mostly interested in dm-thin. The loop block driver and
dm-linear are useful for testing so I modified them. I didn't try SCSI
or NBD.
Thanks,
Stefan
On Thu, Mar 28, 2024 at 04:39:02PM -0400, Stefan Hajnoczi wrote:
> The SEEK_HOLE/SEEK_DATA interface is used by userspace applications to
> detect sparseness. This makes copying and backup applications faster and
> reduces space consumption because only ranges that do not contain data
> can be skipped.
s/only //
>
> Handle SEEK_HOLE/SEEK_DATA for block devices. No block drivers implement
> the new callback yet so the entire block device will appear to contain
> data. Later patches will add support to drivers so this actually becomes
> useful.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> include/linux/blkdev.h | 7 +++++++
> block/fops.c | 43 +++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 49 insertions(+), 1 deletion(-)
>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:03PM -0400, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> Open issues:
> - The file offset is updated on both the blkdev file and the backing
> file. Is there a way to avoid updating the backing file offset so the
> file opened by userspace is not affected?
> - Should this run in the worker or use the cgroups?
> ---
> drivers/block/loop.c | 36 ++++++++++++++++++++++++++++++------
> 1 file changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 28a95fd366fea..6a89375de82e8 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -750,6 +750,29 @@ static void loop_sysfs_exit(struct loop_device *lo)
> &loop_attribute_group);
> }
>
> +static loff_t lo_seek_hole_data(struct block_device *bdev, loff_t offset,
> + int whence)
> +{
> + /* TODO need to activate cgroups or use worker? */
> + /* TODO locking? */
> + struct loop_device *lo = bdev->bd_disk->private_data;
> + struct file *file = lo->lo_backing_file;
> +
> + if (lo->lo_offset > 0)
> + offset += lo->lo_offset; /* TODO underflow/overflow? */
> +
> + /* TODO backing file offset is modified! */
> + offset = vfs_llseek(file, offset, whence);
Not only did you modify the underlying offset...
> + if (offset < 0)
> + return offset;
> +
> + if (lo->lo_offset > 0)
> + offset -= lo->lo_offset; /* TODO underflow/overflow? */
> + if (lo->lo_sizelimit > 0 && offset > lo->lo_sizelimit)
> + offset = lo->lo_sizelimit;
..but if this code kicks in, you have clamped the return result to
EOF of the loop device while leaving the underlying offset beyond the
limit, which may mess up assumptions of other code expecting the loop
to always have an in-bounds offset for the underlying file (offhand, I
don't know if there is any such code; but since loop_ctl_fops.llseek =
noop_lseek, there may be code relying on an even-tighter restriction
that the offset of the underlying file never changes, not even within
bounds).
Furthermore, this is inconsistent with all other seek-beyond-end code
that fails with -ENXIO instead of returning size.
But for an RFC, the idea of being able to seek to HOLEs in a loop
device is awesome!
> @@ -2140,7 +2164,7 @@ static int loop_control_remove(int idx)
> pr_warn_once("deleting an unspecified loop device is not supported.\n");
> return -EINVAL;
> }
> -
> +
> /* Hide this loop device for serialization. */
> ret = mutex_lock_killable(&loop_ctl_mutex);
> if (ret)
Unrelated whitespace change?
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:04PM -0400, Stefan Hajnoczi wrote:
> Run the tests with:
>
> $ make TARGETS=block_seek_hole -C tools/selftests run_tests
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> tools/testing/selftests/Makefile | 1 +
> .../selftests/block_seek_hole/Makefile | 17 +++
> .../testing/selftests/block_seek_hole/config | 1 +
> .../selftests/block_seek_hole/map_holes.py | 37 +++++++
> .../testing/selftests/block_seek_hole/test.py | 103 ++++++++++++++++++
> 5 files changed, 159 insertions(+)
> create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
> create mode 100644 tools/testing/selftests/block_seek_hole/config
> create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
> create mode 100755 tools/testing/selftests/block_seek_hole/test.py
>
> +++ b/tools/testing/selftests/block_seek_hole/test.py
> +
> +# Different data layouts to test
> +
> +def data_at_beginning_and_end(f):
> + f.write(b'A' * 4 * KB)
> + f.seek(256 * MB)
> +
> + f.write(b'B' * 64 * KB)
> +
> + f.seek(1024 * MB - KB)
> + f.write(b'C' * KB)
> +
> +def holes_at_beginning_and_end(f):
> + f.seek(128 * MB)
> + f.write(b'A' * 4 * KB)
> +
> + f.seek(512 * MB)
> + f.write(b'B' * 64 * KB)
> +
> + f.truncate(1024 * MB)
> +
> +def no_holes(f):
> + # Just 1 MB so test file generation is quick
> + mb = b'A' * MB
> + f.write(mb)
> +
> +def empty_file(f):
> + f.truncate(1024 * MB)
Is it also worth attempting to test a (necessarily sparse!) file
larger than 2GiB to prove that we are 64-bit clean, even on a 32-bit
system where lseek is different than lseek64? (I honestly have no
idea if python always uses 64-bit seek even on 32-bit systems,
although I would be surprised if it were not)
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:05PM -0400, Stefan Hajnoczi wrote:
> Delegate SEEK_HOLE/SEEK_DATA to device-mapper targets. The new
> dm_seek_hole_data() callback allows target types to customize behavior.
> The default implementation treats the target as all data with no holes.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> include/linux/device-mapper.h | 5 +++
> drivers/md/dm.c | 68 +++++++++++++++++++++++++++++++++++
> 2 files changed, 73 insertions(+)
>
> +/* Default implementation for targets that do not implement the callback */
> +static loff_t dm_blk_seek_hole_data_default(loff_t offset, int whence,
> + loff_t size)
> +{
> + switch (whence) {
> + case SEEK_DATA:
> + if ((unsigned long long)offset >= size)
> + return -ENXIO;
> + return offset;
> + case SEEK_HOLE:
> + if ((unsigned long long)offset >= size)
> + return -ENXIO;
> + return size;
These fail with -ENXIO if offset == size (matching what we do on files)...
> + default:
> + return -EINVAL;
> + }
> +}
> +
> +static loff_t dm_blk_do_seek_hole_data(struct dm_table *table, loff_t offset,
> + int whence)
> +{
> + struct dm_target *ti;
> + loff_t end;
> +
> + /* Loop when the end of a target is reached */
> + do {
> + ti = dm_table_find_target(table, offset >> SECTOR_SHIFT);
> + if (!ti)
> + return whence == SEEK_DATA ? -ENXIO : offset;
..but this blindly returns offset for SEEK_HOLE, even when offset is
beyond the end of the dm. I think you want 'return -ENXIO;'
unconditionally here.
> +
> + end = (ti->begin + ti->len) << SECTOR_SHIFT;
> +
> + if (ti->type->seek_hole_data)
> + offset = ti->type->seek_hole_data(ti, offset, whence);
Are we guaranteed that ti->type->seek_hole_data will not return a
value exceeding end? Or can dm be used to truncate the view of an
underlying device, and the underlying seek_hold_data can now return an
answer beyond where dm_table_find_target should look for the next part
of the dm's view?
In which case, should the blkdev_seek_hole_data callback be passed a
max size parameter everywhere, similar to how fixed_size_llseek does
things?
> + else
> + offset = dm_blk_seek_hole_data_default(offset, whence, end);
> +
> + if (whence == SEEK_DATA && offset == -ENXIO)
> + offset = end;
You have a bug here. If I have a dm contructed of two underlying targets:
|A |B |
and A is all data, then whence == SEEK_HOLE will have offset = -ENXIO
at this point, and you fail to check whether B is also data. That is,
you have silently treated the rest of the block device as data, which
is semantically not wrong (as that is always a safe fallback), but not
optimal.
I think the correct logic is s/whence == SEEK_DATA &&//.
> + } while (offset == end);
I'm trying to make sure that we can never return the equivalent of
lseek(dm, 0, SEEK_END). If you make my above suggested changes, we
will iterate through the do loop once more at EOF, and
dm_table_find_target() will then fail to match at which point we do
get the desired -ENXIO for both SEEK_HOLE and SEEK_DATA.
> +
> + return offset;
> +}
> +
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:07PM -0400, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> drivers/md/dm-linear.c | 25 +++++++++++++++++++++++++
> 1 file changed, 25 insertions(+)
>
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 2d3e186ca87e3..9b6cdfa4f951d 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -147,6 +147,30 @@ static int linear_report_zones(struct dm_target *ti,
> #define linear_report_zones NULL
> #endif
>
> +static loff_t linear_seek_hole_data(struct dm_target *ti, loff_t offset,
> + int whence)
> +{
> + struct linear_c *lc = ti->private;
> + loff_t ti_begin = ti->begin << SECTOR_SHIFT;
> + loff_t ti_len = ti->len << SECTOR_SHIFT;
> + loff_t bdev_start = lc->start << SECTOR_SHIFT;
> + loff_t bdev_offset;
Okay, given my questions in 4/9, it looks like your intent is that
each callback for dm_seek_hole_data will obey its own ti-> limits.
> +
> + /* TODO underflow/overflow? */
> + bdev_offset = offset - ti_begin + bdev_start;
> +
> + bdev_offset = blkdev_seek_hole_data(lc->dev->bdev, bdev_offset,
> + whence);
> + if (bdev_offset < 0)
> + return bdev_offset;
> +
> + offset = bdev_offset - bdev_start;
> + if (offset >= ti_len)
> + return whence == SEEK_DATA ? -ENXIO : ti_begin + ti_len;
However, this is inconsistent with dm_blk_seek_hole_data_default in
4/9; I think you want to unconditionally return -ENXIO here, and let
the caller figure out when to turn -ENXIO back into end to proceed
with the next ti in the list.
OR, you may want to document the semantics that dm_seek_hole_data
callbacks must NOT return -ENXIO, but always return ti_begin + ti_len
when the answer (either SEEK_HOLE or SEEK_END) did not lie within the
current ti - it is DIFFERENT than the semantics for
blkdev_seek_hole_data, but gets normalized back into the expected
-ENXIO answer when dm_blk_do_seek_hole_data finally advances past the
last ti.
At any rate, I know this is an RFC series, but it goes to show that
comments will be essential, whichever interface you decide for
callbacks to honor (both a guarantee that callbacks will only ever see
SEEK_HOLE/SEEK_DATA in bounds, because earlier points in the call
stack have filtered out out-of-bounds and SEEK_SET; and constraints on
what the return value(s) must be for the various callbacks, especially
if it is different from the eventual return value of the overall
llseek syscall)
> +
> + return offset + ti_begin;
> +}
> +
> static int linear_iterate_devices(struct dm_target *ti,
> iterate_devices_callout_fn fn, void *data)
> {
> @@ -212,6 +236,7 @@ static struct target_type linear_target = {
> .direct_access = linear_dax_direct_access,
> .dax_zero_page_range = linear_dax_zero_page_range,
> .dax_recovery_write = linear_dax_recovery_write,
> + .seek_hole_data = linear_seek_hole_data,
> };
>
> int __init dm_linear_init(void)
> --
> 2.44.0
>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:08PM -0400, Stefan Hajnoczi wrote:
> The dm-linear linear target passes through SEEK_HOLE/SEEK_DATA. Extend
> the test case to check that the same holes/data are reported as for the
> underlying file.
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> tools/testing/selftests/block_seek_hole/test.py | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/block_seek_hole/test.py b/tools/testing/selftests/block_seek_hole/test.py
> index 4f7c2d01ab3d3..6360b72aee338 100755
> --- a/tools/testing/selftests/block_seek_hole/test.py
> +++ b/tools/testing/selftests/block_seek_hole/test.py
> @@ -45,6 +45,20 @@ def loop_device(file_path):
> finally:
> run(['losetup', '-d', loop_path])
>
> +@contextmanager
> +def dm_linear(file_path):
> + file_size = os.path.getsize(file_path)
> +
> + with loop_device(file_path) as loop_path:
> + dm_name = f'test-{os.getpid()}'
> + run(['dmsetup', 'create', dm_name, '--table',
> + f'0 {file_size // 512} linear {loop_path} 0'])
Would it be worth tryiing to create the dm with two copies of
loop_path concatenated one after the other? You'd have to do more
work on expected output (coalescing adjacent data or holes between the
tail of the first copy and the head of the second), but without that
in place, I worry that you are missing logic bugs for when there is
more than one table in the overall dm (as evidenced by my review in
4/9).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:09PM -0400, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> Open issues:
> - Locking?
> - thin_seek_hole_data() does not run as a bio or request. This patch
> assumes dm_thin_find_mapped_range() synchronously performs I/O if
> metadata needs to be loaded from disk. Is that a valid assumption?
> ---
> drivers/md/dm-thin.c | 77 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 77 insertions(+)
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index 4793ad2aa1f7e..3c5dc4f0fe8a3 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -4501,6 +4501,82 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> }
> }
>
> +static dm_block_t loff_to_block(struct pool *pool, loff_t offset)
> +{
> + sector_t offset_sectors = offset >> SECTOR_SHIFT;
> + dm_block_t ret;
> +
> + if (block_size_is_power_of_two(pool))
> + ret = offset_sectors >> pool->sectors_per_block_shift;
> + else {
> + ret = offset_sectors;
> + (void) sector_div(ret, pool->sectors_per_block);
> + }
> +
> + return ret;
> +}
> +
> +static loff_t block_to_loff(struct pool *pool, dm_block_t block)
> +{
> + return block_to_sectors(pool, block) << SECTOR_SHIFT;
> +}
> +
> +static loff_t thin_seek_hole_data(struct dm_target *ti, loff_t offset,
> + int whence)
> +{
> + struct thin_c *tc = ti->private;
> + struct dm_thin_device *td = tc->td;
> + struct pool *pool = tc->pool;
> + dm_block_t begin;
> + dm_block_t end;
> + dm_block_t mapped_begin;
> + dm_block_t mapped_end;
> + dm_block_t pool_begin;
> + bool maybe_shared;
> + int ret;
> +
> + /* TODO locking? */
> +
> + if (block_size_is_power_of_two(pool))
> + end = ti->len >> pool->sectors_per_block_shift;
> + else {
> + end = ti->len;
> + (void) sector_div(end, pool->sectors_per_block);
> + }
> +
> + offset -= ti->begin << SECTOR_SHIFT;
> +
> + while (true) {
> + begin = loff_to_block(pool, offset);
> + ret = dm_thin_find_mapped_range(td, begin, end,
> + &mapped_begin, &mapped_end,
> + &pool_begin, &maybe_shared);
> + if (ret == -ENODATA) {
> + if (whence == SEEK_DATA)
> + return -ENXIO;
> + break;
> + } else if (ret < 0) {
> + /* TODO handle EWOULDBLOCK? */
> + return -ENXIO;
This should probably be -EIO, not -ENXIO.
> + }
> +
> + /* SEEK_DATA finishes here... */
> + if (whence == SEEK_DATA) {
> + if (mapped_begin != begin)
> + offset = block_to_loff(pool, mapped_begin);
> + break;
> + }
> +
> + /* ...while SEEK_HOLE may need to look further */
> + if (mapped_begin != begin)
> + break; /* offset is in a hole */
> +
> + offset = block_to_loff(pool, mapped_end);
> + }
> +
> + return offset + (ti->begin << SECTOR_SHIFT);
It's hard to follow, but I'm fairly certain that if whence ==
SEEK_HOLE, you end up returning ti->begin + ti->len instead of -ENXIO
if the range from begin to end is fully mapped; which is inconsistent
with the semantics you have in 4/9 (although in 6/9 I argue that
having all of the dm callbacks return ti->begin + ti->len instead of
-ENXIO might make logic easier for iterating through consecutive ti,
and then convert to -ENXIO only in the caller).
> +}
> +
> static struct target_type thin_target = {
> .name = "thin",
> .version = {1, 23, 0},
> @@ -4515,6 +4591,7 @@ static struct target_type thin_target = {
> .status = thin_status,
> .iterate_devices = thin_iterate_devices,
> .io_hints = thin_io_hints,
> + .seek_hole_data = thin_seek_hole_data,
> };
>
> /*----------------------------------------------------------------*/
> --
> 2.44.0
>
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 04:39:04PM -0400, Stefan Hajnoczi wrote:
> Run the tests with:
>
> $ make TARGETS=block_seek_hole -C tools/selftests run_tests
>
> Signed-off-by: Stefan Hajnoczi <[email protected]>
> ---
> tools/testing/selftests/Makefile | 1 +
> .../selftests/block_seek_hole/Makefile | 17 +++
> .../testing/selftests/block_seek_hole/config | 1 +
> .../selftests/block_seek_hole/map_holes.py | 37 +++++++
> .../testing/selftests/block_seek_hole/test.py | 103 ++++++++++++++++++
> 5 files changed, 159 insertions(+)
> create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
> create mode 100644 tools/testing/selftests/block_seek_hole/config
> create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
> create mode 100755 tools/testing/selftests/block_seek_hole/test.py
>
> +
> +def map_holes(fd):
> + end = os.lseek(fd, 0, os.SEEK_END)
> + offset = 0
> +
> + print('TYPE START END SIZE')
> +
> + while offset < end:
> + contents = 'DATA'
> + new_offset = os.lseek(fd, offset, os.SEEK_HOLE)
> + if new_offset == offset:
> + contents = 'HOLE'
> + try:
> + new_offset = os.lseek(fd, offset, os.SEEK_DATA)
> + except OSError as err:
> + if err.errno == errno.ENXIO:
> + new_offset = end
> + else:
> + raise err
> + assert new_offset != offset
> + print(f'{contents} {offset} {new_offset} {new_offset - offset}')
> + offset = new_offset
Over the years, I've seen various SEEK_HOLE implementation bugs where
things work great on the initial boundary, but fail when requested on
an offset not aligned to the start of the extent boundary. It would
probably be worth enhancing the test to prove that:
if lseek(fd, offset, SEEK_HOLE) == offset:
new_offset = lseek(fd, offset, SEEK_DATA)
assert new_offset > offset
assert lseek(fd, new_offset - 1, SEEK_HOLE) == new_offset - 1
else:
assert lseek(fd, offset, SEEK_DATA) == offset
new_offset = lseek(fd, offset, SEEK_HOLE)
assert new_offset > offset
assert lseek(fd, new_offset - 1, SEEK_DATA) == new_offset - 1
Among other things, this would prove that even though block devices
generally operate on a minimum granularity of a sector, lseek() still
gives byte-accurate results for a random offset that falls in the
middle of a sector, and doesn't accidentally round down reporting an
offset less than the value passed in to the request.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 07:00:26PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:03PM -0400, Stefan Hajnoczi wrote:
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > Open issues:
> > - The file offset is updated on both the blkdev file and the backing
> > file. Is there a way to avoid updating the backing file offset so the
> > file opened by userspace is not affected?
> > - Should this run in the worker or use the cgroups?
> > ---
> > drivers/block/loop.c | 36 ++++++++++++++++++++++++++++++------
> > 1 file changed, 30 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 28a95fd366fea..6a89375de82e8 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -750,6 +750,29 @@ static void loop_sysfs_exit(struct loop_device *lo)
> > &loop_attribute_group);
> > }
> >
> > +static loff_t lo_seek_hole_data(struct block_device *bdev, loff_t offset,
> > + int whence)
> > +{
> > + /* TODO need to activate cgroups or use worker? */
> > + /* TODO locking? */
> > + struct loop_device *lo = bdev->bd_disk->private_data;
> > + struct file *file = lo->lo_backing_file;
> > +
> > + if (lo->lo_offset > 0)
> > + offset += lo->lo_offset; /* TODO underflow/overflow? */
> > +
> > + /* TODO backing file offset is modified! */
> > + offset = vfs_llseek(file, offset, whence);
>
> Not only did you modify the underlying offset...
>
> > + if (offset < 0)
> > + return offset;
> > +
> > + if (lo->lo_offset > 0)
> > + offset -= lo->lo_offset; /* TODO underflow/overflow? */
> > + if (lo->lo_sizelimit > 0 && offset > lo->lo_sizelimit)
> > + offset = lo->lo_sizelimit;
>
> ...but if this code kicks in, you have clamped the return result to
> EOF of the loop device while leaving the underlying offset beyond the
> limit, which may mess up assumptions of other code expecting the loop
> to always have an in-bounds offset for the underlying file (offhand, I
> don't know if there is any such code; but since loop_ctl_fops.llseek =
> noop_lseek, there may be code relying on an even-tighter restriction
> that the offset of the underlying file never changes, not even within
> bounds).
I don't think anything relies on the file offset. Requests coming from
the block device contain their own offset (which may have been based on
the block device file's offset, but not the backing file's offset).
> Furthermore, this is inconsistent with all other seek-beyond-end code
> that fails with -ENXIO instead of returning size.
You're right, in the SEEK_DATA case the return value should be -ENXIO.
The SEEK_HOLE case is correct with lo_sizelimit. There is also an
off-by-one error in the comparison.
It should be:
if (lo->lo_sizelimit > 0 && offset >= lo->lo_sizelimit) {
if (whence == SEEK_DATA)
offset = -ENXIO;
else
offset = lo->lo_sizelimit;
}
> But for an RFC, the idea of being able to seek to HOLEs in a loop
> device is awesome!
>
> > @@ -2140,7 +2164,7 @@ static int loop_control_remove(int idx)
> > pr_warn_once("deleting an unspecified loop device is not supported.\n");
> > return -EINVAL;
> > }
> > -
> > +
> > /* Hide this loop device for serialization. */
> > ret = mutex_lock_killable(&loop_ctl_mutex);
> > if (ret)
>
> Unrelated whitespace change?
Yes, I'll drop it.
Stefan
On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> In the block device world there are similar concepts to holes:
> - SCSI has Logical Block Provisioning where the "mapped" state would be
> considered data and other states would be considered holes.
But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do
not have to return zeroes. They could also return some other
initialization pattern pattern. So they are (unfortunately) not a 1:1
mapping to holes in sparse files.
On Tue, Apr 02, 2024 at 02:26:17PM +0200, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> > considered data and other states would be considered holes.
>
> But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do
> not have to return zeroes. They could also return some other
> initialization pattern pattern. So they are (unfortunately) not a 1:1
> mapping to holes in sparse files.
Hi Christoph,
There is a 1:1 mapping when when the Logical Block Provisioning Read
Zeroes (LBPRZ) field is set to xx1b in the Logical Block Provisioning
VPD page.
Otherwise SEEK_HOLE/SEEK_DATA has to treat the device as filled with
data because it doesn't know where the holes are.
Stefan
On Tue, Apr 02, 2024 at 02:26:17PM +0200, Christoph Hellwig wrote:
> On Thu, Mar 28, 2024 at 04:39:01PM -0400, Stefan Hajnoczi wrote:
> > In the block device world there are similar concepts to holes:
> > - SCSI has Logical Block Provisioning where the "mapped" state would be
> > considered data and other states would be considered holes.
>
> But for SCSI (and ATA and NVMe) unmapped/delallocated/etc blocks do
> not have to return zeroes. They could also return some other
> initialization pattern pattern. So they are (unfortunately) not a 1:1
> mapping to holes in sparse files.
Yes, and Stefan already answered that:
https://lore.kernel.org/dm-devel/e2lcp3n5gpf7zmlpyn4nj7wsr36sffn23z5bmzlsghu6oapi5u@sdkcbpimi5is/t/#m58146a45951ec086966497e179a2b2715692712d
>> SCSI initiators can check the Logical Block Provisioning Read Zeroes
>> (LBPRZ) field to determine whether or not zeroes are guaranteed. The sd
>> driver would only rely on the device when LPBRZ indicates that zeroes
>> will be read. Otherwise the driver would report that the device is
>> filled with data.
As well as my question on whether the community would be open to
introducing new SEEK_* constants to allow orthogonality between
searching for zeroes (known to read as zero, whether or not it was
allocated) vs. sparseness (known to be unallocated, whether or not it
reads as zero), where the existing SEEK_HOLE seeks for both properties
at once.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Thu, Mar 28, 2024 at 07:11:30PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:04PM -0400, Stefan Hajnoczi wrote:
> > Run the tests with:
> >
> > $ make TARGETS=block_seek_hole -C tools/selftests run_tests
> >
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > tools/testing/selftests/Makefile | 1 +
> > .../selftests/block_seek_hole/Makefile | 17 +++
> > .../testing/selftests/block_seek_hole/config | 1 +
> > .../selftests/block_seek_hole/map_holes.py | 37 +++++++
> > .../testing/selftests/block_seek_hole/test.py | 103 ++++++++++++++++++
> > 5 files changed, 159 insertions(+)
> > create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
> > create mode 100644 tools/testing/selftests/block_seek_hole/config
> > create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
> > create mode 100755 tools/testing/selftests/block_seek_hole/test.py
> >
> > +++ b/tools/testing/selftests/block_seek_hole/test.py
>
> > +
> > +# Different data layouts to test
> > +
> > +def data_at_beginning_and_end(f):
> > + f.write(b'A' * 4 * KB)
> > + f.seek(256 * MB)
> > +
> > + f.write(b'B' * 64 * KB)
> > +
> > + f.seek(1024 * MB - KB)
> > + f.write(b'C' * KB)
> > +
> > +def holes_at_beginning_and_end(f):
> > + f.seek(128 * MB)
> > + f.write(b'A' * 4 * KB)
> > +
> > + f.seek(512 * MB)
> > + f.write(b'B' * 64 * KB)
> > +
> > + f.truncate(1024 * MB)
> > +
> > +def no_holes(f):
> > + # Just 1 MB so test file generation is quick
> > + mb = b'A' * MB
> > + f.write(mb)
> > +
> > +def empty_file(f):
> > + f.truncate(1024 * MB)
>
> Is it also worth attempting to test a (necessarily sparse!) file
> larger than 2GiB to prove that we are 64-bit clean, even on a 32-bit
> system where lseek is different than lseek64? (I honestly have no
> idea if python always uses 64-bit seek even on 32-bit systems,
> although I would be surprised if it were not)
Yes, it wouldn't hurt to add a test case for that. I'll do that.
Stefan
On Fri, Mar 29, 2024 at 07:38:17AM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:04PM -0400, Stefan Hajnoczi wrote:
> > Run the tests with:
> >
> > $ make TARGETS=block_seek_hole -C tools/selftests run_tests
> >
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > tools/testing/selftests/Makefile | 1 +
> > .../selftests/block_seek_hole/Makefile | 17 +++
> > .../testing/selftests/block_seek_hole/config | 1 +
> > .../selftests/block_seek_hole/map_holes.py | 37 +++++++
> > .../testing/selftests/block_seek_hole/test.py | 103 ++++++++++++++++++
> > 5 files changed, 159 insertions(+)
> > create mode 100644 tools/testing/selftests/block_seek_hole/Makefile
> > create mode 100644 tools/testing/selftests/block_seek_hole/config
> > create mode 100755 tools/testing/selftests/block_seek_hole/map_holes.py
> > create mode 100755 tools/testing/selftests/block_seek_hole/test.py
> >
>
> > +
> > +def map_holes(fd):
> > + end = os.lseek(fd, 0, os.SEEK_END)
> > + offset = 0
> > +
> > + print('TYPE START END SIZE')
> > +
> > + while offset < end:
> > + contents = 'DATA'
> > + new_offset = os.lseek(fd, offset, os.SEEK_HOLE)
> > + if new_offset == offset:
> > + contents = 'HOLE'
> > + try:
> > + new_offset = os.lseek(fd, offset, os.SEEK_DATA)
> > + except OSError as err:
> > + if err.errno == errno.ENXIO:
> > + new_offset = end
> > + else:
> > + raise err
> > + assert new_offset != offset
> > + print(f'{contents} {offset} {new_offset} {new_offset - offset}')
> > + offset = new_offset
>
> Over the years, I've seen various SEEK_HOLE implementation bugs where
> things work great on the initial boundary, but fail when requested on
> an offset not aligned to the start of the extent boundary. It would
> probably be worth enhancing the test to prove that:
>
> if lseek(fd, offset, SEEK_HOLE) == offset:
> new_offset = lseek(fd, offset, SEEK_DATA)
> assert new_offset > offset
> assert lseek(fd, new_offset - 1, SEEK_HOLE) == new_offset - 1
> else:
> assert lseek(fd, offset, SEEK_DATA) == offset
> new_offset = lseek(fd, offset, SEEK_HOLE)
> assert new_offset > offset
> assert lseek(fd, new_offset - 1, SEEK_DATA) == new_offset - 1
>
> Among other things, this would prove that even though block devices
> generally operate on a minimum granularity of a sector, lseek() still
> gives byte-accurate results for a random offset that falls in the
> middle of a sector, and doesn't accidentally round down reporting an
> offset less than the value passed in to the request.
Sure. I'll add a test for that.
Stefan
On Thu, Mar 28, 2024 at 07:54:21PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:07PM -0400, Stefan Hajnoczi wrote:
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > drivers/md/dm-linear.c | 25 +++++++++++++++++++++++++
> > 1 file changed, 25 insertions(+)
> >
> > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> > index 2d3e186ca87e3..9b6cdfa4f951d 100644
> > --- a/drivers/md/dm-linear.c
> > +++ b/drivers/md/dm-linear.c
> > @@ -147,6 +147,30 @@ static int linear_report_zones(struct dm_target *ti,
> > #define linear_report_zones NULL
> > #endif
> >
> > +static loff_t linear_seek_hole_data(struct dm_target *ti, loff_t offset,
> > + int whence)
> > +{
> > + struct linear_c *lc = ti->private;
> > + loff_t ti_begin = ti->begin << SECTOR_SHIFT;
> > + loff_t ti_len = ti->len << SECTOR_SHIFT;
> > + loff_t bdev_start = lc->start << SECTOR_SHIFT;
> > + loff_t bdev_offset;
>
> Okay, given my questions in 4/9, it looks like your intent is that
> each callback for dm_seek_hole_data will obey its own ti-> limits.
Yes, exactly.
>
> > +
> > + /* TODO underflow/overflow? */
> > + bdev_offset = offset - ti_begin + bdev_start;
> > +
> > + bdev_offset = blkdev_seek_hole_data(lc->dev->bdev, bdev_offset,
> > + whence);
> > + if (bdev_offset < 0)
> > + return bdev_offset;
> > +
> > + offset = bdev_offset - bdev_start;
> > + if (offset >= ti_len)
> > + return whence == SEEK_DATA ? -ENXIO : ti_begin + ti_len;
>
> However, this is inconsistent with dm_blk_seek_hole_data_default in
> 4/9; I think you want to unconditionally return -ENXIO here, and let
> the caller figure out when to turn -ENXIO back into end to proceed
> with the next ti in the list.
>
> OR, you may want to document the semantics that dm_seek_hole_data
> callbacks must NOT return -ENXIO, but always return ti_begin + ti_len
> when the answer (either SEEK_HOLE or SEEK_END) did not lie within the
> current ti - it is DIFFERENT than the semantics for
> blkdev_seek_hole_data, but gets normalized back into the expected
> -ENXIO answer when dm_blk_do_seek_hole_data finally advances past the
> last ti.
>
> At any rate, I know this is an RFC series, but it goes to show that
> comments will be essential, whichever interface you decide for
> callbacks to honor (both a guarantee that callbacks will only ever see
> SEEK_HOLE/SEEK_DATA in bounds, because earlier points in the call
> stack have filtered out out-of-bounds and SEEK_SET; and constraints on
> what the return value(s) must be for the various callbacks, especially
> if it is different from the eventual return value of the overall
> llseek syscall)
It's easier to understand the code when lseek function has the same
semantics, so I'd rather not customize the semantics for certain lseek
functions.
I'll make sure that the device-mapper targets match the
dm_blk_seek_hole_data_default() behavior. To be honest, I relied on dm.c
always passing offset values that are within the target, but that in
itself is customizing the semantics :).
On Thu, Mar 28, 2024 at 07:38:20PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:05PM -0400, Stefan Hajnoczi wrote:
> > Delegate SEEK_HOLE/SEEK_DATA to device-mapper targets. The new
> > dm_seek_hole_data() callback allows target types to customize behavior.
> > The default implementation treats the target as all data with no holes.
> >
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > include/linux/device-mapper.h | 5 +++
> > drivers/md/dm.c | 68 +++++++++++++++++++++++++++++++++++
> > 2 files changed, 73 insertions(+)
> >
>
> > +/* Default implementation for targets that do not implement the callback */
> > +static loff_t dm_blk_seek_hole_data_default(loff_t offset, int whence,
> > + loff_t size)
> > +{
> > + switch (whence) {
> > + case SEEK_DATA:
> > + if ((unsigned long long)offset >= size)
> > + return -ENXIO;
> > + return offset;
> > + case SEEK_HOLE:
> > + if ((unsigned long long)offset >= size)
> > + return -ENXIO;
> > + return size;
>
> These fail with -ENXIO if offset == size (matching what we do on files)...
>
> > + default:
> > + return -EINVAL;
> > + }
> > +}
> > +
> > +static loff_t dm_blk_do_seek_hole_data(struct dm_table *table, loff_t offset,
> > + int whence)
> > +{
> > + struct dm_target *ti;
> > + loff_t end;
> > +
> > + /* Loop when the end of a target is reached */
> > + do {
> > + ti = dm_table_find_target(table, offset >> SECTOR_SHIFT);
> > + if (!ti)
> > + return whence == SEEK_DATA ? -ENXIO : offset;
>
> ...but this blindly returns offset for SEEK_HOLE, even when offset is
> beyond the end of the dm. I think you want 'return -ENXIO;'
> unconditionally here.
If the initial offset is beyond the end of the table, then SEEK_HOLE
should return -ENXIO. I agree that the code doesn't handle this case.
However, returning offset here is correct when there is data at the end
with SEEK_HOLE.
I'll update the code to address the out-of-bounds offset case, perhaps
by checking the initial offset before entering the loop.
>
> > +
> > + end = (ti->begin + ti->len) << SECTOR_SHIFT;
> > +
> > + if (ti->type->seek_hole_data)
> > + offset = ti->type->seek_hole_data(ti, offset, whence);
>
> Are we guaranteed that ti->type->seek_hole_data will not return a
> value exceeding end? Or can dm be used to truncate the view of an
> underlying device, and the underlying seek_hold_data can now return an
> answer beyond where dm_table_find_target should look for the next part
> of the dm's view?
ti->type->seek_hole_data() must not return a value larger than
(ti->begin + ti->len) << SECTOR_SHIFT.
>
> In which case, should the blkdev_seek_hole_data callback be passed a
> max size parameter everywhere, similar to how fixed_size_llseek does
> things?
>
> > + else
> > + offset = dm_blk_seek_hole_data_default(offset, whence, end);
> > +
> > + if (whence == SEEK_DATA && offset == -ENXIO)
> > + offset = end;
>
> You have a bug here. If I have a dm contructed of two underlying targets:
>
> |A |B |
>
> and A is all data, then whence == SEEK_HOLE will have offset = -ENXIO
> at this point, and you fail to check whether B is also data. That is,
> you have silently treated the rest of the block device as data, which
> is semantically not wrong (as that is always a safe fallback), but not
> optimal.
>
> I think the correct logic is s/whence == SEEK_DATA &&//.
No, with whence == SEEK_HOLE and an initial offset in A, the new offset
will be (A->begin + A->end) << SECTOR_SHIFT. The loop will iterate and
continue seeking into B.
The if statement you commented on ensures that we also continue looping
with whence == SEEK_DATA, because that would otherwise prematurely end
with the new offset = -ENXIO.
>
> > + } while (offset == end);
>
> I'm trying to make sure that we can never return the equivalent of
> lseek(dm, 0, SEEK_END). If you make my above suggested changes, we
> will iterate through the do loop once more at EOF, and
> dm_table_find_target() will then fail to match at which point we do
> get the desired -ENXIO for both SEEK_HOLE and SEEK_DATA.
Wait, lseek() is supposed to return the equivalent of lseek(dm, 0,
SEEK_END) when whence == SEEK_HOLE and there is data at the end.
>
> > +
> > + return offset;
> > +}
> > +
>
> --
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.
> Virtualization: qemu.org | libguestfs.org
>
On Thu, Mar 28, 2024 at 07:59:14PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:08PM -0400, Stefan Hajnoczi wrote:
> > The dm-linear linear target passes through SEEK_HOLE/SEEK_DATA. Extend
> > the test case to check that the same holes/data are reported as for the
> > underlying file.
> >
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > tools/testing/selftests/block_seek_hole/test.py | 16 +++++++++++++++-
> > 1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/block_seek_hole/test.py b/tools/testing/selftests/block_seek_hole/test.py
> > index 4f7c2d01ab3d3..6360b72aee338 100755
> > --- a/tools/testing/selftests/block_seek_hole/test.py
> > +++ b/tools/testing/selftests/block_seek_hole/test.py
> > @@ -45,6 +45,20 @@ def loop_device(file_path):
> > finally:
> > run(['losetup', '-d', loop_path])
> >
> > +@contextmanager
> > +def dm_linear(file_path):
> > + file_size = os.path.getsize(file_path)
> > +
> > + with loop_device(file_path) as loop_path:
> > + dm_name = f'test-{os.getpid()}'
> > + run(['dmsetup', 'create', dm_name, '--table',
> > + f'0 {file_size // 512} linear {loop_path} 0'])
>
> Would it be worth tryiing to create the dm with two copies of
> loop_path concatenated one after the other? You'd have to do more
> work on expected output (coalescing adjacent data or holes between the
> tail of the first copy and the head of the second), but without that
> in place, I worry that you are missing logic bugs for when there is
> more than one table in the overall dm (as evidenced by my review in
> 4/9).
Yes, I agree that more tests are needed to cover transitions between
adjacent targets.
Stefan
On Thu, Mar 28, 2024 at 08:31:21PM -0500, Eric Blake wrote:
> On Thu, Mar 28, 2024 at 04:39:09PM -0400, Stefan Hajnoczi wrote:
> > Signed-off-by: Stefan Hajnoczi <[email protected]>
> > ---
> > Open issues:
> > - Locking?
> > - thin_seek_hole_data() does not run as a bio or request. This patch
> > assumes dm_thin_find_mapped_range() synchronously performs I/O if
> > metadata needs to be loaded from disk. Is that a valid assumption?
> > ---
> > drivers/md/dm-thin.c | 77 ++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 77 insertions(+)
> >
> > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> > index 4793ad2aa1f7e..3c5dc4f0fe8a3 100644
> > --- a/drivers/md/dm-thin.c
> > +++ b/drivers/md/dm-thin.c
> > @@ -4501,6 +4501,82 @@ static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
> > }
> > }
> >
> > +static dm_block_t loff_to_block(struct pool *pool, loff_t offset)
> > +{
> > + sector_t offset_sectors = offset >> SECTOR_SHIFT;
> > + dm_block_t ret;
> > +
> > + if (block_size_is_power_of_two(pool))
> > + ret = offset_sectors >> pool->sectors_per_block_shift;
> > + else {
> > + ret = offset_sectors;
> > + (void) sector_div(ret, pool->sectors_per_block);
> > + }
> > +
> > + return ret;
> > +}
> > +
> > +static loff_t block_to_loff(struct pool *pool, dm_block_t block)
> > +{
> > + return block_to_sectors(pool, block) << SECTOR_SHIFT;
> > +}
> > +
> > +static loff_t thin_seek_hole_data(struct dm_target *ti, loff_t offset,
> > + int whence)
> > +{
> > + struct thin_c *tc = ti->private;
> > + struct dm_thin_device *td = tc->td;
> > + struct pool *pool = tc->pool;
> > + dm_block_t begin;
> > + dm_block_t end;
> > + dm_block_t mapped_begin;
> > + dm_block_t mapped_end;
> > + dm_block_t pool_begin;
> > + bool maybe_shared;
> > + int ret;
> > +
> > + /* TODO locking? */
> > +
> > + if (block_size_is_power_of_two(pool))
> > + end = ti->len >> pool->sectors_per_block_shift;
> > + else {
> > + end = ti->len;
> > + (void) sector_div(end, pool->sectors_per_block);
> > + }
> > +
> > + offset -= ti->begin << SECTOR_SHIFT;
> > +
> > + while (true) {
> > + begin = loff_to_block(pool, offset);
> > + ret = dm_thin_find_mapped_range(td, begin, end,
> > + &mapped_begin, &mapped_end,
> > + &pool_begin, &maybe_shared);
> > + if (ret == -ENODATA) {
> > + if (whence == SEEK_DATA)
> > + return -ENXIO;
> > + break;
> > + } else if (ret < 0) {
> > + /* TODO handle EWOULDBLOCK? */
> > + return -ENXIO;
>
> This should probably be -EIO, not -ENXIO.
Yes. XFS also returns -EIO, so I guess it's okay to do so.
I still need to get to the bottom of whether calling
dm_thin_find_mapped_range() is sane here and what to do when/if it
returns EWOULDBLOCK.
> > + }
> > +
> > + /* SEEK_DATA finishes here... */
> > + if (whence == SEEK_DATA) {
> > + if (mapped_begin != begin)
> > + offset = block_to_loff(pool, mapped_begin);
> > + break;
> > + }
> > +
> > + /* ...while SEEK_HOLE may need to look further */
> > + if (mapped_begin != begin)
> > + break; /* offset is in a hole */
> > +
> > + offset = block_to_loff(pool, mapped_end);
> > + }
> > +
> > + return offset + (ti->begin << SECTOR_SHIFT);
>
> It's hard to follow, but I'm fairly certain that if whence ==
> SEEK_HOLE, you end up returning ti->begin + ti->len instead of -ENXIO
> if the range from begin to end is fully mapped; which is inconsistent
> with the semantics you have in 4/9 (although in 6/9 I argue that
> having all of the dm callbacks return ti->begin + ti->len instead of
> -ENXIO might make logic easier for iterating through consecutive ti,
> and then convert to -ENXIO only in the caller).
Returning (ti->begin + ti->len) << SECTOR_SHIFT for SEEK_HOLE when there
is data at the end of the target is intentional. This matches the
semantics of lseek().
I agree there is adjustment necessary in dm.c, but I want to seek the
semantics of all lseek() functions identical to avoid confusion.
Stefan
On Wed, Apr 03, 2024 at 10:11:47AM -0400, Stefan Hajnoczi wrote:
..
> > > +static loff_t dm_blk_do_seek_hole_data(struct dm_table *table, loff_t offset,
> > > + int whence)
> > > +{
> > > + struct dm_target *ti;
> > > + loff_t end;
> > > +
> > > + /* Loop when the end of a target is reached */
> > > + do {
> > > + ti = dm_table_find_target(table, offset >> SECTOR_SHIFT);
> > > + if (!ti)
> > > + return whence == SEEK_DATA ? -ENXIO : offset;
> >
> > ...but this blindly returns offset for SEEK_HOLE, even when offset is
> > beyond the end of the dm. I think you want 'return -ENXIO;'
> > unconditionally here.
>
> If the initial offset is beyond the end of the table, then SEEK_HOLE
> should return -ENXIO. I agree that the code doesn't handle this case.
>
> However, returning offset here is correct when there is data at the end
> with SEEK_HOLE.
>
> I'll update the code to address the out-of-bounds offset case, perhaps
> by checking the initial offset before entering the loop.
You are correct that if we are on the second loop iteration of
SEEK_HOLE (because the first iteration saw all data), then we have
found the offset of the start of a hole and should return that offset,
not -ENXIO. This may be a case where we just have to be careful on
whether the initial pass might have any corner cases different from
later times through the loop, and that we end the loop with correct
results for both SEEK_HOLE and SEEK_DATA.
>
> >
> > > +
> > > + end = (ti->begin + ti->len) << SECTOR_SHIFT;
> > > +
> > > + if (ti->type->seek_hole_data)
> > > + offset = ti->type->seek_hole_data(ti, offset, whence);
> >
> > Are we guaranteed that ti->type->seek_hole_data will not return a
> > value exceeding end? Or can dm be used to truncate the view of an
> > underlying device, and the underlying seek_hold_data can now return an
> > answer beyond where dm_table_find_target should look for the next part
> > of the dm's view?
>
> ti->type->seek_hole_data() must not return a value larger than
> (ti->begin + ti->len) << SECTOR_SHIFT.
Worth adding as documentation then.
>
> >
> > In which case, should the blkdev_seek_hole_data callback be passed a
> > max size parameter everywhere, similar to how fixed_size_llseek does
> > things?
> >
> > > + else
> > > + offset = dm_blk_seek_hole_data_default(offset, whence, end);
> > > +
> > > + if (whence == SEEK_DATA && offset == -ENXIO)
> > > + offset = end;
> >
> > You have a bug here. If I have a dm contructed of two underlying targets:
> >
> > |A |B |
> >
> > and A is all data, then whence == SEEK_HOLE will have offset = -ENXIO
> > at this point, and you fail to check whether B is also data. That is,
> > you have silently treated the rest of the block device as data, which
> > is semantically not wrong (as that is always a safe fallback), but not
> > optimal.
> >
> > I think the correct logic is s/whence == SEEK_DATA &&//.
>
> No, with whence == SEEK_HOLE and an initial offset in A, the new offset
> will be (A->begin + A->end) << SECTOR_SHIFT. The loop will iterate and
> continue seeking into B.
>
> The if statement you commented on ensures that we also continue looping
> with whence == SEEK_DATA, because that would otherwise prematurely end
> with the new offset = -ENXIO.
>
> >
> > > + } while (offset == end);
> >
> > I'm trying to make sure that we can never return the equivalent of
> > lseek(dm, 0, SEEK_END). If you make my above suggested changes, we
> > will iterate through the do loop once more at EOF, and
> > dm_table_find_target() will then fail to match at which point we do
> > get the desired -ENXIO for both SEEK_HOLE and SEEK_DATA.
>
> Wait, lseek() is supposed to return the equivalent of lseek(dm, 0,
> SEEK_END) when whence == SEEK_HOLE and there is data at the end.
It was confusing enough for me to write my initial review, I apologize
if I'm making it harder for you. Yes, we want to ensure that:
off1 = lseek(fd, -1, SEEK_END);
off2 = off1 + 1; // == lseek(fd, 0, SEEK_END)
if off1 belongs to a data extent:
- lseek(fd, off1, SEEK_DATA) == off1
- lseek(fd, off1, SEEK_HOLE) == off2
- lseek(fd, off2, SEEK_DATA) == -ENXIO
- lseek(fd, off2, SEEK_HOLE) == -ENXIO
if off1 belongs to a hole:
- lseek(fd, off1, SEEK_DATA) == -ENXIO
- lseek(fd, off1, SEEK_HOLE) == off1
- lseek(fd, off2, SEEK_DATA) == -ENXIO
- lseek(fd, off2, SEEK_HOLE) == -ENXIO
Anything in my wall of text from the earlier message inconsistent with
this table can be ignored; but at the same time, I was not able to
quickly convince myself that your code properly had those properties,
even after writing up the table.
Reiterating what I said elsewhere, it may be smarter to document that
for callbacks, it is wiser to require intermediate behavior that the
input value 'offset' is always between the half-open range
[ti->begin<<SECTOR_SHIFT, (ti->begin+ti->len)<<SECTOR_SHIFT), and on
success, the output must be in the fully-closed range [offset,
(ti->begin+ti->len)<<SECTOR_SHIFT], errors like -EIO are permitted but
-ENXIO should not be returned; and let the caller worry about
synthesizing -ENXIO from that (since the caller knows whether or not
there is a successor ti where adjacency concerns come into play).
That is, we can never pass in off2 (beyond the bounds of the table),
and when passing in off1, I think this interface may be easier to work
with in the intermediate layers, even though it differs from the
lseek() interface above. For off1 in data:
- dm_blk_do_seek_hole_data(dm, off1, SEEK_DATA) == off1
- dm_blk_do_seek_hole_data(dm, off1, SEEK_HOLE) == off2
and for a hole:
- dm_blk_do_seek_hole_data(dm, off1, SEEK_DATA) == off2
- dm_blk_do_seek_hole_data(dm, off1, SEEK_HOLE) == off1
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Wed, Apr 03, 2024 at 12:02:19PM -0500, Eric Blake wrote:
> On Wed, Apr 03, 2024 at 10:11:47AM -0400, Stefan Hajnoczi wrote:
> ...
> > > > +static loff_t dm_blk_do_seek_hole_data(struct dm_table *table, loff_t offset,
> > > > + int whence)
> > > > +{
> > > > + struct dm_target *ti;
> > > > + loff_t end;
> > > > +
> > > > + /* Loop when the end of a target is reached */
> > > > + do {
> > > > + ti = dm_table_find_target(table, offset >> SECTOR_SHIFT);
> > > > + if (!ti)
> > > > + return whence == SEEK_DATA ? -ENXIO : offset;
> > >
> > > ...but this blindly returns offset for SEEK_HOLE, even when offset is
> > > beyond the end of the dm. I think you want 'return -ENXIO;'
> > > unconditionally here.
> >
> > If the initial offset is beyond the end of the table, then SEEK_HOLE
> > should return -ENXIO. I agree that the code doesn't handle this case.
> >
> > However, returning offset here is correct when there is data at the end
> > with SEEK_HOLE.
> >
> > I'll update the code to address the out-of-bounds offset case, perhaps
> > by checking the initial offset before entering the loop.
>
> You are correct that if we are on the second loop iteration of
> SEEK_HOLE (because the first iteration saw all data), then we have
> found the offset of the start of a hole and should return that offset,
> not -ENXIO. This may be a case where we just have to be careful on
> whether the initial pass might have any corner cases different from
> later times through the loop, and that we end the loop with correct
> results for both SEEK_HOLE and SEEK_DATA.
>
> >
> > >
> > > > +
> > > > + end = (ti->begin + ti->len) << SECTOR_SHIFT;
> > > > +
> > > > + if (ti->type->seek_hole_data)
> > > > + offset = ti->type->seek_hole_data(ti, offset, whence);
> > >
> > > Are we guaranteed that ti->type->seek_hole_data will not return a
> > > value exceeding end? Or can dm be used to truncate the view of an
> > > underlying device, and the underlying seek_hold_data can now return an
> > > answer beyond where dm_table_find_target should look for the next part
> > > of the dm's view?
> >
> > ti->type->seek_hole_data() must not return a value larger than
> > (ti->begin + ti->len) << SECTOR_SHIFT.
>
> Worth adding as documentation then.
>
> >
> > >
> > > In which case, should the blkdev_seek_hole_data callback be passed a
> > > max size parameter everywhere, similar to how fixed_size_llseek does
> > > things?
> > >
> > > > + else
> > > > + offset = dm_blk_seek_hole_data_default(offset, whence, end);
> > > > +
> > > > + if (whence == SEEK_DATA && offset == -ENXIO)
> > > > + offset = end;
> > >
> > > You have a bug here. If I have a dm contructed of two underlying targets:
> > >
> > > |A |B |
> > >
> > > and A is all data, then whence == SEEK_HOLE will have offset = -ENXIO
> > > at this point, and you fail to check whether B is also data. That is,
> > > you have silently treated the rest of the block device as data, which
> > > is semantically not wrong (as that is always a safe fallback), but not
> > > optimal.
> > >
> > > I think the correct logic is s/whence == SEEK_DATA &&//.
> >
> > No, with whence == SEEK_HOLE and an initial offset in A, the new offset
> > will be (A->begin + A->end) << SECTOR_SHIFT. The loop will iterate and
> > continue seeking into B.
> >
> > The if statement you commented on ensures that we also continue looping
> > with whence == SEEK_DATA, because that would otherwise prematurely end
> > with the new offset = -ENXIO.
> >
> > >
> > > > + } while (offset == end);
> > >
> > > I'm trying to make sure that we can never return the equivalent of
> > > lseek(dm, 0, SEEK_END). If you make my above suggested changes, we
> > > will iterate through the do loop once more at EOF, and
> > > dm_table_find_target() will then fail to match at which point we do
> > > get the desired -ENXIO for both SEEK_HOLE and SEEK_DATA.
> >
> > Wait, lseek() is supposed to return the equivalent of lseek(dm, 0,
> > SEEK_END) when whence == SEEK_HOLE and there is data at the end.
>
> It was confusing enough for me to write my initial review, I apologize
> if I'm making it harder for you.
No worries, if my code is hard to understand I can learn from your
feedback.
> Yes, we want to ensure that:
>
> off1 = lseek(fd, -1, SEEK_END);
> off2 = off1 + 1; // == lseek(fd, 0, SEEK_END)
>
> if off1 belongs to a data extent:
> - lseek(fd, off1, SEEK_DATA) == off1
> - lseek(fd, off1, SEEK_HOLE) == off2
> - lseek(fd, off2, SEEK_DATA) == -ENXIO
> - lseek(fd, off2, SEEK_HOLE) == -ENXIO
Agreed.
> if off1 belongs to a hole:
> - lseek(fd, off1, SEEK_DATA) == -ENXIO
> - lseek(fd, off1, SEEK_HOLE) == off1
> - lseek(fd, off2, SEEK_DATA) == -ENXIO
> - lseek(fd, off2, SEEK_HOLE) == -ENXIO
Agreed.
>
> Anything in my wall of text from the earlier message inconsistent with
> this table can be ignored; but at the same time, I was not able to
> quickly convince myself that your code properly had those properties,
> even after writing up the table.
>
> Reiterating what I said elsewhere, it may be smarter to document that
> for callbacks, it is wiser to require intermediate behavior that the
> input value 'offset' is always between the half-open range
> [ti->begin<<SECTOR_SHIFT, (ti->begin+ti->len)<<SECTOR_SHIFT), and on
> success, the output must be in the fully-closed range [offset,
> (ti->begin+ti->len)<<SECTOR_SHIFT], errors like -EIO are permitted but
> -ENXIO should not be returned; and let the caller worry about
> synthesizing -ENXIO from that (since the caller knows whether or not
> there is a successor ti where adjacency concerns come into play).
>
> That is, we can never pass in off2 (beyond the bounds of the table),
> and when passing in off1, I think this interface may be easier to work
> with in the intermediate layers, even though it differs from the
> lseek() interface above. For off1 in data:
> - dm_blk_do_seek_hole_data(dm, off1, SEEK_DATA) == off1
> - dm_blk_do_seek_hole_data(dm, off1, SEEK_HOLE) == off2
> and for a hole:
> - dm_blk_do_seek_hole_data(dm, off1, SEEK_DATA) == off2
> - dm_blk_do_seek_hole_data(dm, off1, SEEK_HOLE) == off1
I'll take a look again starting from block/fops.c, through dm.c, and
into dm-linear.c to see how to make things clearest. Although I would
like to have the same semantics for every seek function, maybe in the
end your suggestion will make the code clearer. Let's see.
Stefan
On Wed, Apr 03, 2024 at 01:58:38PM -0400, Stefan Hajnoczi wrote:
> > Yes, we want to ensure that:
> >
> > off1 = lseek(fd, -1, SEEK_END);
> > off2 = off1 + 1; // == lseek(fd, 0, SEEK_END)
> >
> > if off1 belongs to a data extent:
> > - lseek(fd, off1, SEEK_DATA) == off1
> > - lseek(fd, off1, SEEK_HOLE) == off2
> > - lseek(fd, off2, SEEK_DATA) == -ENXIO
> > - lseek(fd, off2, SEEK_HOLE) == -ENXIO
>
> Agreed.
>
> > if off1 belongs to a hole:
> > - lseek(fd, off1, SEEK_DATA) == -ENXIO
> > - lseek(fd, off1, SEEK_HOLE) == off1
> > - lseek(fd, off2, SEEK_DATA) == -ENXIO
> > - lseek(fd, off2, SEEK_HOLE) == -ENXIO
>
> Agreed.
>
..
> >
> > That is, we can never pass in off2 (beyond the bounds of the table),
> > and when passing in off1, I think this interface may be easier to work
> > with in the intermediate layers, even though it differs from the
> > lseek() interface above. For off1 in data:
> > - dm_blk_do_seek_hole_data(dm, off1, SEEK_DATA) == off1
> > - dm_blk_do_seek_hole_data(dm, off1, SEEK_HOLE) == off2
> > and for a hole:
> > - dm_blk_do_seek_hole_data(dm, off1, SEEK_DATA) == off2
> > - dm_blk_do_seek_hole_data(dm, off1, SEEK_HOLE) == off1
>
In the caller, we already need to know both off1 and off2 (that is,
the offset we are asking about, AND the offset at which the underlying
component ends its map at, since that is where we are then in charge
of whether to concatenate that with the next component or give the
overall result).
If we already guarantee that we never call into a lower layer with
off2 (because it was beyond bounds), then the only difference between
the two semantics is whether SEEK_DATA in a final hole returns -ENXIO
or off2; and it looks like we can do a conversion from either style
into the other.
So designing the caller logic, it looks like I would want:
if off1 >= EOF return -ENXIO (out of bounds)
while off1 < EOF:
determine off2 of current lower region
at this point, we are guaranteed off1 < off2
we also know that off2 < EOF unless we are on last lower region
call result=lower_layer(off1, SEEK_X)
it is a bug if result is non-negative but < off1, or if result > off2
if result == off1, return result (we are already in the right HOLE
or DATA)
if result > off1 and < off2, return result (we had to advance to get
to the right region, but the right region was within the lower
layer, and we don't need to inspect further layers)
Note - the above two paragraphs can be collapsed into one: if result
< off2, return result (the current subregion gave us an adequate
answer)
if result == off2, continue to the next region with off1=result (in
first semantics, this is only possible for SEEK_HOLE for a lower
region ending in data; in the second semantics, it is possible for
either SEEK_HOLE or SEEK_DATA for a lower region ending in the type
opposite of the request)
if result == -ENXIO, continue to the next region by using off1=off2
(only possible in the first semantics for SEEK_DATA on a lower
region ending in a hole)
any other result is necessarily a negative errno like -EIO; return it
if the loop completes without an early return, we are out of lower
regions, and we should be left with off1 == EOF. Because we advanced,
we know now to:
return whence == SEEK_HOLE ? off1 : -ENXIO
> I'll take a look again starting from block/fops.c, through dm.c, and
> into dm-linear.c to see how to make things clearest. Although I would
> like to have the same semantics for every seek function, maybe in the
> end your suggestion will make the code clearer. Let's see.
Keeping lseek semantics may require a couple more lines of code
duplicated at each layer, but less maintainer headaches in the long
run of converting between slightly different semantics depending on
which layer you are at.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization: qemu.org | libguestfs.org
On Tue, Apr 02, 2024 at 09:04:46AM -0400, Stefan Hajnoczi wrote:
> Hi Christoph,
> There is a 1:1 mapping when when the Logical Block Provisioning Read
> Zeroes (LBPRZ) field is set to xx1b in the Logical Block Provisioning
> VPD page.
Yes. NVMe also has a similar field, but ATA does not.
On Tue, Apr 02, 2024 at 08:31:09AM -0500, Eric Blake wrote:
> As well as my question on whether the community would be open to
> introducing new SEEK_* constants to allow orthogonality between
> searching for zeroes (known to read as zero, whether or not it was
> allocated) vs. sparseness (known to be unallocated, whether or not it
> reads as zero), where the existing SEEK_HOLE seeks for both properties
> at once.
That seems like quite an effort. Is is worth it?