2023-04-05 16:16:38

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v3 00/11] blksnap - block devices snapshots module

Hi Jens. Hi Christoph. Hi Jonathan. Hi Mike. Hi all.

I am happy to offer a modified version of the Block Devices Snapshots
Module. It allows to create non-persistent snapshots of any block devices.
The main purpose of such snapshots is to provide backups of block devices.
See more in Documentation/block/blksnap.rst.

The Block Device Filtering Mechanism is added to the block layer. This
allows to attach and detach block device filters to the block layer.
Filters allow to extend the functionality of the block layer.
See more in Documentation/block/blkfilter.rst.

The tool, library and tests for working with blksnap can be found on github.
Link: https://github.com/veeam/blksnap/tree/stable-v2.0

The v2 version was suggested at 9 December 2022.
Link: https://patchwork.kernel.org/project/linux-block/list/?series=703315&archive=both
Since then, in collaboration with Christoph, work was carried out to optimize
COW algorithms for snapshots, the algorithm for reading images of snapshots,
and the control interface was redesigned.

Changes:
- new block device I/O contols BLKFILTER_ATTACH and BLKFILTER_DETACH allow
to attach and detach filters
- new block device I/O contol BLKFILTER_CTL allow send command to attached
block device filter
- the copy-on-write algorithm for processing I/O units has been optimized and
has become asynchronous
- the snapshot image reading algorithm has been optimized and has become
asynchronous
- optimized the finite state machine for processing chunks
- fixed a tracking block size calculation bug.

The v1 version was suggested at 2 November 2022.
Link: https://patchwork.kernel.org/project/linux-block/list/?series=691286&archive=both
Since then, documentation has been added describing the filtering mechanism and
the snapshot module of block devices. Many thanks to Fabio Fantoni for his for
his participation in the "blksnap" project on github and Jonathan Corbet for
his article.
Link: https://lwn.net/Articles/914031/.

Changes:
- added documentation for Block Device Filtering Mechanism
- added documentation for Block Devices Snapshots Module (blksnap)
- the MAINTAINERS file has been updated
- optimized queue code for snapshot images
- fixed comments, log messages and code for better readability.

The first version was suggested at 13 June 2022.
Link: https://patchwork.kernel.org/project/linux-block/list/?series=649931&archive=both
Many thanks to Christoph Hellwig and Randy Dunlap for the review of that
version.

Changes:
- forgotten "static" declarations have been added
- the text of the comments has been corrected.
- it is possible to connect only one filter, since there are no others in
upstream.
- do not have additional locks for attach/detach filter.
- blksnap.h moved to include/uapi/.
- #pragma once and commented code removed.
- uuid_t removed from user API.
- removed default values for module parameters from the configuration file.
- the debugging code for tracking memory leaks has been removed.
- simplified Makefile.
- optimized work with large memory buffers, CBT tables are now in virtual
memory.
- the allocation code of minor numbers has been optimized.
- the implementation of the snapshot image block device has been
simplified, now it is a bio-based block device.
- removed initialization of global variables with null values.
- only one bio is used to copy one chunk.
- checked on ppc64le.

Sergei Shtepa (11):
documentation: Block Device Filtering Mechanism
block: Block Device Filtering Mechanism
documentation: Block Devices Snapshots Module
blksnap: header file of the module interface
blksnap: module management interface functions
blksnap: handling and tracking I/O units
blksnap: minimum data storage unit of the original block device
blksnap: difference storage
blksnap: event queue from the difference storage
blksnap: snapshot and snapshot image block device
blksnap: Kconfig and Makefile

Documentation/block/blkfilter.rst | 64 ++++
Documentation/block/blksnap.rst | 345 ++++++++++++++++++++
Documentation/block/index.rst | 2 +
MAINTAINERS | 17 +
block/Makefile | 2 +-
block/bdev.c | 1 +
block/blk-core.c | 40 ++-
block/blk-filter.c | 199 ++++++++++++
block/blk.h | 10 +
block/genhd.c | 2 +
block/ioctl.c | 7 +
block/partitions/core.c | 2 +
drivers/block/Kconfig | 2 +
drivers/block/Makefile | 2 +
drivers/block/blksnap/Kconfig | 12 +
drivers/block/blksnap/Makefile | 15 +
drivers/block/blksnap/cbt_map.c | 228 +++++++++++++
drivers/block/blksnap/cbt_map.h | 90 +++++
drivers/block/blksnap/chunk.c | 470 +++++++++++++++++++++++++++
drivers/block/blksnap/chunk.h | 106 ++++++
drivers/block/blksnap/diff_area.c | 440 +++++++++++++++++++++++++
drivers/block/blksnap/diff_area.h | 133 ++++++++
drivers/block/blksnap/diff_buffer.c | 127 ++++++++
drivers/block/blksnap/diff_buffer.h | 37 +++
drivers/block/blksnap/diff_storage.c | 329 +++++++++++++++++++
drivers/block/blksnap/diff_storage.h | 111 +++++++
drivers/block/blksnap/event_queue.c | 87 +++++
drivers/block/blksnap/event_queue.h | 64 ++++
drivers/block/blksnap/main.c | 428 ++++++++++++++++++++++++
drivers/block/blksnap/params.h | 16 +
drivers/block/blksnap/snapimage.c | 120 +++++++
drivers/block/blksnap/snapimage.h | 10 +
drivers/block/blksnap/snapshot.c | 433 ++++++++++++++++++++++++
drivers/block/blksnap/snapshot.h | 68 ++++
drivers/block/blksnap/tracker.c | 320 ++++++++++++++++++
drivers/block/blksnap/tracker.h | 71 ++++
include/linux/blk-filter.h | 51 +++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 1 +
include/uapi/linux/blk-filter.h | 35 ++
include/uapi/linux/blksnap.h | 421 ++++++++++++++++++++++++
include/uapi/linux/fs.h | 5 +
42 files changed, 4922 insertions(+), 3 deletions(-)
create mode 100644 Documentation/block/blkfilter.rst
create mode 100644 Documentation/block/blksnap.rst
create mode 100644 block/blk-filter.c
create mode 100644 drivers/block/blksnap/Kconfig
create mode 100644 drivers/block/blksnap/Makefile
create mode 100644 drivers/block/blksnap/cbt_map.c
create mode 100644 drivers/block/blksnap/cbt_map.h
create mode 100644 drivers/block/blksnap/chunk.c
create mode 100644 drivers/block/blksnap/chunk.h
create mode 100644 drivers/block/blksnap/diff_area.c
create mode 100644 drivers/block/blksnap/diff_area.h
create mode 100644 drivers/block/blksnap/diff_buffer.c
create mode 100644 drivers/block/blksnap/diff_buffer.h
create mode 100644 drivers/block/blksnap/diff_storage.c
create mode 100644 drivers/block/blksnap/diff_storage.h
create mode 100644 drivers/block/blksnap/event_queue.c
create mode 100644 drivers/block/blksnap/event_queue.h
create mode 100644 drivers/block/blksnap/main.c
create mode 100644 drivers/block/blksnap/params.h
create mode 100644 drivers/block/blksnap/snapimage.c
create mode 100644 drivers/block/blksnap/snapimage.h
create mode 100644 drivers/block/blksnap/snapshot.c
create mode 100644 drivers/block/blksnap/snapshot.h
create mode 100644 drivers/block/blksnap/tracker.c
create mode 100644 drivers/block/blksnap/tracker.h
create mode 100644 include/linux/blk-filter.h
create mode 100644 include/uapi/linux/blk-filter.h
create mode 100644 include/uapi/linux/blksnap.h

--
2.20.1


2023-04-05 16:18:47

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v3 03/11] documentation: Block Devices Snapshots Module

The document contains:
* Describes the purpose of the mechanism
* Description of features
* Description of algorithms
* Recommendations about using the module from the user-space side
* Reference to module interface description

Signed-off-by: Sergei Shtepa <[email protected]>
---
Documentation/block/blksnap.rst | 345 ++++++++++++++++++++++++++++++++
Documentation/block/index.rst | 1 +
MAINTAINERS | 6 +
3 files changed, 352 insertions(+)
create mode 100644 Documentation/block/blksnap.rst

diff --git a/Documentation/block/blksnap.rst b/Documentation/block/blksnap.rst
new file mode 100644
index 000000000000..7752f33809bb
--- /dev/null
+++ b/Documentation/block/blksnap.rst
@@ -0,0 +1,345 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+Block Devices Snapshots Module (blksnap)
+========================================
+
+Introduction
+============
+
+At first glance, there is no novelty in the idea of creating snapshots for
+block devices. The Linux kernel already has mechanisms for creating snapshots.
+Device Mapper includes dm-snap, which allows to create snapshots of block
+devices. BTRFS supports snapshots at the file system level. However, both
+of these options have flaws that do not allow to use them as a universal
+tool for creating backups.
+
+The main properties that a backup tool should have are:
+
+- Simplicity and versatility of use
+- Reliability
+- Minimal consumption of system resources during backup
+- Minimal time required for recovery or replication of the entire system
+
+Therefore, the features of the blksnap module are:
+
+- Change tracker
+- Snapshots at the block device level
+- Dynamic allocation of space for storing differences
+- Snapshot overflow resistance
+- Coherent snapshot of multiple block devices
+
+Features
+========
+
+Change tracker
+--------------
+
+The change tracker allows to determine which blocks were changed during the
+time between the last snapshot created and any of the previous snapshots.
+Having a map of changes, it is enough to copy only the changed blocks, and
+no need to reread the entire block device completely. The change tracker
+allows to implement the logic of both incremental and differential backups.
+Incremental backup is critical for large file repositories whose size can be
+hundreds of terabytes and whose full backup time can take more than a day.
+On such servers, the use of backup tools without a change tracker becomes
+practically impossible.
+
+Snapshot at the block device level
+----------------------------------
+
+A snapshot at the block device level allows to simplify the backup algorithm
+and reduce consumption of system resources. It also allows to perform linear
+reading of disk space directly, which allows to achieve maximum reading speed
+with minimal use of processor time. At the same time, the versatility of
+creating snapshots for any block device is achieved, regardless of the file
+system located on it. The exceptions are BTRFS, ZFS and cluster file systems.
+
+Dynamic allocation of storage space for differences
+---------------------------------------------------
+
+To store differences, the module does not require a pre-reserved block
+device range. A range of sectors can be allocated on any block device
+immediately before creating a snapshot in individual files on the file
+system. In addition, the size of the difference storage can be increased
+after the snapshot is created by adding new sector ranges on block devices.
+Sector ranges can be allocated on any block devices of the system, including
+those on which the snapshot was created. A shared difference storage for
+all images of snapshot block devices allows to optimize the use of disk space.
+
+Snapshot overflow resistance
+----------------------------
+
+To create images of snapshots of block devices, the module stores blocks
+of the original block device that have been changed since the snapshot
+was taken. To do this, the module handles write requests and reads blocks
+that need to be overwritten. This algorithm guarantees safety of the data
+of the original block device in the event of an overflow of the snapshot,
+and even in the case of unpredictable critical errors. If a problem occurs
+during backup, the difference storage is released, the snapshot is closed,
+no backup is created, but the server continues to work.
+
+Coherent snapshot of multiple block devices
+-------------------------------------------
+
+A snapshot is created simultaneously for all block devices for which a backup
+is being created, ensuring their coherent state.
+
+
+Algorithms
+==========
+
+Overview
+--------
+
+The blksnap module is a block-level filter. It handles all write I/O units.
+The filter is attached to the block device when the snapshot is created
+for the first time. The change tracker marks all overwritten blocks.
+Information about the history of changes on the block device is available
+while holding the snapshot. The module reads the blocks that need to be
+overwritten and stores them in the difference storage. When reading from
+a snapshot image, reading is performed either from the original device or
+from the difference storage.
+
+Change tracking
+---------------
+
+A change tracker map is created for each block device. One byte
+of this map corresponds to one block. The block size is set by the
+``tracking_block_minimum_shift`` and ``tracking_block_maximum_count``
+module parameters. The ``tracking_block_minimum_shift`` parameter limits
+the minimum block size for tracking, while ``tracking_block_maximum_count``
+defines the maximum allowed number of blocks. The size of the change tracker
+block is determined depending on the size of the block device when adding
+a tracking device, that is, when the snapshot is taken for the first time.
+The block size must be a power of two. The ``tracking_block_maximum_shift``
+module parameter allows to limit the maximum block size for tracking. If the
+block size reaches the allowable limit, the number of blocks will exceed the
+``tracking_block_maximum_count`` parameter.
+
+The byte of the change map stores a number from 0 to 255. This is the
+snapshot number, since the creation of which there have been changes in
+the block. Each time a snapshot is created, the number of the current
+snapshot is increased by one. This number is written to the cell of the
+change map when writing to the block. Thus, knowing the number of one of
+the previous snapshots and the number of the last snapshot, one can determine
+from the change map which blocks have been changed. When the number of the
+current change reaches the maximum allowed value for the map of 255, at the
+time when the next snapshot is created, the map of changes is reset to zero,
+and the number of the current snapshot is assigned the value 1. The change
+tracker is reset, and a new UUID is generated - a unique identifier of the
+snapshot generation. The snapshot generation identifier allows to identify
+that a change tracking reset has been performed.
+
+The change map has two copies. One copy is active, it tracks the current
+changes on the block device. The second copy is available for reading
+while the snapshot is being held, and contains the history up to the moment
+the snapshot is taken. Copies are synchronized at the moment of snapshot
+creation. After the snapshot is released, a second copy of the map is not
+needed, but it is not released, so as not to allocate memory for it again
+the next time the snapshot is created.
+
+Copy on write
+-------------
+
+Data is copied in blocks, or rather in chunks. The term "chunk" is used to
+avoid confusion with change tracker blocks and I/O blocks. In addition,
+the "chunk" in the blksnap module means about the same as the "chunk" in
+the dm-snap module.
+
+The size of the chunk is determined by the ``chunk_minimum_shift`` and
+``chunk_maximum_count`` module parameters. The ``chunk_minimum_shift``
+parameter limits the minimum size of the chunk, while ``chunk_maximum_count``
+defines the maximum allowed number of chunks. The size of the chunk is
+determined depending on the size of the block device at the time of taking the
+snapshot. The size of the chunk must be a power of two. The module parameter
+``chunk_maximum_shift`` allows to limit the maximum chunk size. If the chunk
+size reaches the allowable limit, the number of chunks will exceed the
+``chunk_maximum_count`` parameter.
+
+One chunk is described by the ``struct chunk`` structure. An array of structures
+is created for each block device. The structure contains all the necessary
+information to copy the chunks data from the original block device to the
+difference storage. This information allows to describe the snapshot image.
+A semaphore is located in the structure, which allows synchronization of threads
+accessing the chunk.
+
+The block level has a feature. If a read I/O unit was sent, and a write I/O
+unit was sent after it, then a write can be performed first, and only then
+a read. Therefore, the copy-on-write algorithm is executed synchronously.
+If a write request is handled, the execution of this I/O unit will be
+delayed until the overwritten chunks are copied to the difference storage.
+But if, when handling a write I/O unit, it turns out that the recorded range
+of sectors has already been copied to the difference storage, then the I/O
+unit is simply passed.
+
+This algorithm allows to efficiently perform backups of systems that run
+Round Robin Database. Such databases can be overwritten several times during
+the system backup. Of course, the value of a backup of the RRD monitoring
+system data can be questioned. However, it is often a task to make a backup
+of the entire enterprise infrastructure in order to restore or replicate it
+entirely in case of problems.
+
+There is also a flaw in the algorithm. When overwriting at least one sector,
+an entire chunk is copied. Thus, a situation of rapid filling of the difference
+storage when writing data to a block device in small portions in random order
+is possible. This situation is possible in case of strong fragmentation of
+data on the file system. But it must be borne in mind that with such data
+fragmentation, performance of systems usually degrades greatly. So, this
+problem does not occur on real servers, although it can easily be created
+by artificial tests.
+
+Difference storage
+------------------
+
+The difference storage is a pool of disk space areas, and it is shared with
+all block devices in the snapshot. Therefore, there is no need to divide
+the difference storage area between block devices, and the difference storage
+itself can be located on different block devices.
+
+There is no need to allocate a large disk space immediately before creating
+a snapshot. Even while the snapshot is being held, the difference storage
+can be expanded. It is enough to have free space on the file system.
+
+Areas of disk space can be allocated on the file system using fallocate(),
+and the file location can be requested using Fiemap Ioctl or Fibmap Ioctl.
+Unfortunately, not all file systems support these mechanisms, but the most
+common XFS, EXT4 and BTRFS file systems support it. BTRFS requires additional
+conversion of virtual offsets to physical ones.
+
+While holding the snapshot, the user process can poll the status of the module.
+When free space in the difference storage is reduced to a threshold value, the
+module generates an event about it. The user process can prepare a new area
+and pass it to the module to expand the difference storage. The threshold
+value is determined as half of the value of the ``diff_storage_minimum``
+module parameter.
+
+If free space in the difference storage runs out, an event is generated about
+the overflow of the snapshot. Such a snapshot is considered corrupted, and
+read I/O units to snapshot images will be terminated with an error code.
+The difference storage stores outdated data required for snapshot images,
+so when the snapshot is overflowed, the backup process is interrupted,
+but the system maintains its operability without data loss.
+
+Performing I/O for a snapshot image
+-----------------------------------
+
+To read snapshot data, when taking a snapshot, block devices of snapshot images
+are created. The snapshot image block devices support the write operation.
+This allows to perform additional data preparation on the file system before
+creating a backup.
+
+To process the I/O unit, clones of the I/O unit are created, which redirect
+the I/O unit either to the original block device or to the difference storage.
+When processing of cloned I/O units is completed, the original I/O unit is
+marked as completed too.
+
+An I/O unit can be partially processed without accessing to block devices if
+the I/O unit refers to a chunk that is in the queue for storing to the
+difference storage. In this case, the data is read or written in a buffer in
+memory.
+
+If, when processing the write I/O unit, it turns out that the data of the
+referred chunk has not yet been stored to the difference storage or has not
+even been read from the original device, then an I/O unit to read data from the
+original device is initiated beforehand. After the reading from original device
+is performed, their data from the I/O unit is partially overwritten directly in
+the buffer of the chunk in memory, and the chunk is scheduled to be saved to the
+difference storage.
+
+How to use
+==========
+
+Depending on the needs and the selected license, you can choose different
+options for managing the module:
+
+- Using ioctl directly
+- Using a static C++ library
+- Using the blksnap console tool
+
+Using a BLKFILTER_CTL for block device
+--------------------------------------
+
+BLKFILTER_CTL allows to send a filter-specific command to the filter on block
+device and get the result of its execution. The module provides the
+``include/uapi/blksnap.h`` header file with a description of the commands and
+their data structures.
+
+1. ``blkfilter_ctl_blksnap_cbtinfo`` allows to get information from the
+ change tracker.
+2. ``blkfilter_ctl_blksnap_cbtmap`` reads the change tracker table. If a write
+ operation was performed for the snapshot, then the change tracker takes this
+ into account. Therefore, it is necessary to receive tracker data after write
+ operations have been completed.
+3. ``blkfilter_ctl_blksnap_cbtdirty`` mark blocks as changed in the change
+ tracker table. This is necessary if post-processing is performed after the
+ backup is created, which changes the backup blocks.
+4. ``blkfilter_ctl_blksnap_snapshotadd`` adds a block device to the snapshot.
+5. ``blkfilter_ctl_blksnap_snapshotinfo`` allows to get the name of the snapshot
+ image block device and the presence of an error.
+
+Using ioctl
+-----------
+
+Using a BLKFILTER_CTL ioctl does not allow to fully implement the management of
+the blksnap module. A control file ``blksnap-control`` is created to manage
+snapshots. The control commands are also described in the file
+``include/uapi/blksnap.h``.
+
+1. ``blksnap_ioctl_version`` get the version number.
+2. ``blk_snap_ioctl_snapshot_create`` initiates the snapshot creation process.
+3. ``blk_snap_ioctl_snapshot_append_storage`` add the range of blocks to
+ difference storage.
+4. ``blk_snap_ioctl_snapshot_take`` creates block devices of block device
+ snapshot images.
+5. ``blk_snap_ioctl_snapshot_collect`` collect all created snapshots.
+6. ``blk_snap_ioctl_snapshot_wait_event`` allows to track the status of
+ snapshots and receive events about the requirement to expand the difference
+ storage or about snapshot overflow.
+7. ``blk_snap_ioctl_snapshot_destroy`` releases the snapshot.
+
+Static C++ library
+------------------
+
+The [#userspace_libs]_ library was created primarily to simplify creation of
+tests in C++, and it is also a good example of using the module interface.
+When creating applications, direct use of control calls is preferable.
+However, the library can be used in an application with a GPL-2+ license,
+or a library with an LGPL-2+ license can be created, with which even a
+proprietary application can be dynamically linked.
+
+blksnap console tool
+--------------------
+
+The blksnap [#userspace_tools]_ console tool allows to control the module
+from the command line. The tool contains detailed built-in help. To get
+the list of commands, enter the ``blksnap --help`` command. The ``blksnap
+<command name> --help`` command allows to get detailed information about the
+parameters of each command call. This option may be convenient when creating
+proprietary software, as it allows not to compile with the open source code.
+At the same time, the blksnap tool can be used for creating backup scripts.
+For example, rsync can be called to synchronize files on the file system of
+the mounted snapshot image and files in the archive on a file system that
+supports compression.
+
+Tests
+-----
+
+A set of tests was created for regression testing [#userspace_tests]_.
+Tests with simple algorithms that use the ``blksnap`` console tool to
+control the module are written in Bash. More complex testing algorithms
+are implemented in C++.
+
+References
+==========
+
+.. [#userspace_libs] https://github.com/veeam/blksnap/tree/stable-v2.0/lib
+
+.. [#userspace_tools] https://github.com/veeam/blksnap/tree/stable-v2.0/tools
+
+.. [#userspace_tests] https://github.com/veeam/blksnap/tree/stable-v2.0/tests
+
+Module interface description
+============================
+
+.. kernel-doc:: include/uapi/linux/blksnap.h
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index e56d89db7b85..34937516c865 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -11,6 +11,7 @@ Block
biovecs
blk-mq
blkfilter
+ blksnap
cmdline-partition
data-integrity
deadline-iosched
diff --git a/MAINTAINERS b/MAINTAINERS
index fb6b7abe83e1..4bdb30369a74 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3580,6 +3580,12 @@ F: block/blk-filter.c
F: include/linux/blk-filter.h
F: include/uapi/linux/blk-filter.h

+BLOCK DEVICE SNAPSHOTS MODULE
+M: Sergei Shtepa <[email protected]>
+L: [email protected]
+S: Supported
+F: Documentation/block/blksnap.rst
+
BLOCK LAYER
M: Jens Axboe <[email protected]>
L: [email protected]
--
2.20.1

2023-04-05 16:19:42

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v3 02/11] block: Block Device Filtering Mechanism

The block device filtering mechanism is an API that allows to attach
block device filters. Block device filters allow perform additional
processing for I/O units.

The idea of handling I/O units on block devices is not new. Back in the
2.6 kernel, there was an undocumented possibility of handling I/O units
by substituting the make_request_fn() function, which belonged to the
request_queue structure. But none of the in-tree kernel modules used
this feature, and it was eliminated in the 5.10 kernel.

The block device filtering mechanism returns the ability to handle I/O
units. It is possible to safely attach filter to a block device "on the
fly" without changing the structure of block devices stack.

Co-developed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sergei Shtepa <[email protected]>
---
MAINTAINERS | 3 +
block/Makefile | 2 +-
block/bdev.c | 1 +
block/blk-core.c | 40 ++++++-
block/blk-filter.c | 199 ++++++++++++++++++++++++++++++++
block/blk.h | 10 ++
block/genhd.c | 2 +
block/ioctl.c | 7 ++
block/partitions/core.c | 2 +
include/linux/blk-filter.h | 51 ++++++++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 1 +
include/uapi/linux/blk-filter.h | 35 ++++++
include/uapi/linux/fs.h | 5 +
14 files changed, 357 insertions(+), 3 deletions(-)
create mode 100644 block/blk-filter.c
create mode 100644 include/linux/blk-filter.h
create mode 100644 include/uapi/linux/blk-filter.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2cbe4331ac97..fb6b7abe83e1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3576,6 +3576,9 @@ M: Sergei Shtepa <[email protected]>
L: [email protected]
S: Supported
F: Documentation/block/blkfilter.rst
+F: block/blk-filter.c
+F: include/linux/blk-filter.h
+F: include/uapi/linux/blk-filter.h

BLOCK LAYER
M: Jens Axboe <[email protected]>
diff --git a/block/Makefile b/block/Makefile
index 4e01bb71ad6e..d4671c7e499c 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,7 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
- disk-events.o blk-ia-ranges.o
+ disk-events.o blk-ia-ranges.o blk-filter.o

obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
diff --git a/block/bdev.c b/block/bdev.c
index 1795c7d4b99e..e290020810dd 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -424,6 +424,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
return NULL;
}
bdev->bd_disk = disk;
+ bdev->bd_filter = NULL;
return bdev;
}

diff --git a/block/blk-core.c b/block/blk-core.c
index 42926e6cb83c..179a1c9ecc90 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -18,6 +18,7 @@
#include <linux/blkdev.h>
#include <linux/blk-pm.h>
#include <linux/blk-integrity.h>
+#include <linux/blk-filter.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -591,10 +592,32 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
return BLK_STS_OK;
}

+static bool submit_bio_filter(struct bio *bio)
+{
+ /*
+ * If this bio came from the filter driver, send it straight down to the
+ * actual device and clear the filtered flag, as the bio could be passed
+ * on to another device that might have a filter attached again.
+ */
+ if (bio_flagged(bio, BIO_FILTERED)) {
+ bio_clear_flag(bio, BIO_FILTERED);
+ return false;
+ }
+ bio_set_flag(bio, BIO_FILTERED);
+ return bio->bi_bdev->bd_filter->ops->submit_bio(bio);
+}
+
static void __submit_bio(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;

+ /*
+ * If there is a filter driver attached, check if the BIO needs to go to
+ * the filter driver first, which can then pass on the bio or consume it.
+ */
+ if (bio->bi_bdev->bd_filter && submit_bio_filter(bio))
+ return;
+
if (unlikely(!blk_crypto_bio_prep(&bio)))
return;

@@ -682,6 +705,15 @@ static void __submit_bio_noacct_mq(struct bio *bio)
current->bio_list = NULL;
}

+/**
+ * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O
+ * from block device filter.
+ * @bio: The bio describing the location in memory and on the device.
+ *
+ * This is a version of submit_bio() that shall only be used for I/O that is
+ * resubmitted to lower level by block device filters. All file systems and
+ * other upper level users of the block layer should use submit_bio() instead.
+ */
void submit_bio_noacct_nocheck(struct bio *bio)
{
blk_cgroup_bio_start(bio);
@@ -702,13 +734,17 @@ void submit_bio_noacct_nocheck(struct bio *bio)
* to collect a list of requests submited by a ->submit_bio method while
* it is active, and then process them after it returned.
*/
- if (current->bio_list)
+ if (current->bio_list) {
bio_list_add(&current->bio_list[0], bio);
- else if (!bio->bi_bdev->bd_disk->fops->submit_bio)
+ return;
+ }
+
+ if (!bio->bi_bdev->bd_disk->fops->submit_bio)
__submit_bio_noacct_mq(bio);
else
__submit_bio_noacct(bio);
}
+EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck);

/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
diff --git a/block/blk-filter.c b/block/blk-filter.c
new file mode 100644
index 000000000000..5e9d884fad4d
--- /dev/null
+++ b/block/blk-filter.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#include <linux/blk-filter.h>
+#include <linux/blk-mq.h>
+#include <linux/module.h>
+
+#include "blk.h"
+
+static LIST_HEAD(blkfilters);
+static DEFINE_SPINLOCK(blkfilters_lock);
+
+static inline struct blkfilter_operations *__blkfilter_find(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ list_for_each_entry(ops, &blkfilters, link)
+ if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0)
+ return ops;
+
+ return NULL;
+}
+
+static inline struct blkfilter_operations *blkfilter_find_get(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ spin_lock(&blkfilters_lock);
+ ops = __blkfilter_find(name);
+ if (ops && !try_module_get(ops->owner))
+ ops = NULL;
+ spin_unlock(&blkfilters_lock);
+
+ return ops;
+}
+
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ struct blkfilter_operations *ops;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ ops = blkfilter_find_get(name.name);
+ if (!ops)
+ return -ENOENT;
+
+ ret = freeze_bdev(bdev);
+ if (ret)
+ goto out_put_module;
+ blk_mq_freeze_queue(bdev->bd_queue);
+
+ if (bdev->bd_filter) {
+ if (bdev->bd_filter->ops == ops)
+ ret = -EALREADY;
+ else
+ ret = -EBUSY;
+ goto out_unfreeze;
+ }
+
+ flt = ops->attach(bdev);
+ if (IS_ERR(flt)) {
+ ret = PTR_ERR(flt);
+ goto out_unfreeze;
+ }
+
+ flt->ops = ops;
+ bdev->bd_filter = flt;
+
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ thaw_bdev(bdev);
+out_put_module:
+ if (ret)
+ module_put(ops->owner);
+ return ret;
+}
+
+static void __blkfilter_detach(struct block_device *bdev)
+{
+ struct blkfilter *flt = bdev->bd_filter;
+ const struct blkfilter_operations *ops = flt->ops;
+
+ bdev->bd_filter = NULL;
+ ops->detach(flt);
+ module_put(ops->owner);
+}
+
+void blkfilter_detach(struct block_device *bdev)
+{
+ if (bdev->bd_filter) {
+ blk_mq_freeze_queue(bdev->bd_queue);
+ __blkfilter_detach(bdev);
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ }
+}
+
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ int error = 0;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (!bdev->bd_filter) {
+ error = -ENOENT;
+ goto out_unfreeze;
+ }
+ if (strncmp(bdev->bd_filter->ops->name, name.name,
+ BLKFILTER_NAME_LENGTH)) {
+ error = -EINVAL;
+ goto out_unfreeze;
+ }
+
+ __blkfilter_detach(bdev);
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ return error;
+}
+
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp)
+{
+ struct blkfilter_ctl ctl;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&ctl, argp, sizeof(ctl)))
+ return -EFAULT;
+
+ ret = blk_queue_enter(bdev_get_queue(bdev), 0);
+ if (ret)
+ return ret;
+
+ flt = bdev->bd_filter;
+ if (!flt || strncmp(flt->ops->name, ctl.name, BLKFILTER_NAME_LENGTH)) {
+ ret = -ENOENT;
+ goto out_queue_exit;
+ }
+
+ if (!flt->ops->ctl) {
+ ret = -ENOTTY;
+ goto out_queue_exit;
+ }
+
+ ret = flt->ops->ctl(flt, ctl.cmd, u64_to_user_ptr(ctl.opt),
+ &ctl.optlen);
+out_queue_exit:
+ blk_queue_exit(bdev_get_queue(bdev));
+ return ret;
+}
+
+/**
+ * blkfilter_register() - Register block device filter operations
+ * @ops: The operations to register.
+ *
+ * Return:
+ * 0 if succeeded,
+ * -EBUSY if a block device filter with the same name is already
+ * registered.
+ */
+int blkfilter_register(struct blkfilter_operations *ops)
+{
+ struct blkfilter_operations *found;
+ int ret = 0;
+
+ spin_lock(&blkfilters_lock);
+ found = __blkfilter_find(ops->name);
+ if (found)
+ ret = -EBUSY;
+ else
+ list_add_tail(&ops->link, &blkfilters);
+ spin_unlock(&blkfilters_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkfilter_register);
+
+/**
+ * blkfilter_unregister() - Unregister block device filter operations
+ * @ops: The operations to unregister.
+ *
+ * Important: before unloading, it is necessary to detach the filter from all
+ * block devices.
+ *
+ */
+void blkfilter_unregister(struct blkfilter_operations *ops)
+{
+ spin_lock(&blkfilters_lock);
+ list_del(&ops->link);
+ spin_unlock(&blkfilters_lock);
+}
+EXPORT_SYMBOL_GPL(blkfilter_unregister);
diff --git a/block/blk.h b/block/blk.h
index cc4e8873dfde..3500e46368e3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,8 @@
#include <xen/xen.h>
#include "blk-crypto-internal.h"

+struct blkfilter_ctl;
+struct blkfilter_name;
struct elevator_type;

/* Max future timer expiry for timeouts */
@@ -454,6 +456,14 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);

extern const struct address_space_operations def_blk_aops;

+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp);
+void blkfilter_detach(struct block_device *bdev);
+
int disk_register_independent_access_ranges(struct gendisk *disk);
void disk_unregister_independent_access_ranges(struct gendisk *disk);

diff --git a/block/genhd.c b/block/genhd.c
index 02d9cfb9e077..b23ceea895de 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -25,6 +25,7 @@
#include <linux/pm_runtime.h>
#include <linux/badblocks.h>
#include <linux/part_stat.h>
+#include <linux/blk-filter.h>
#include "blk-throttle.h"

#include "blk.h"
@@ -625,6 +626,7 @@ void del_gendisk(struct gendisk *disk)

fsync_bdev(disk->part0);
__invalidate_device(disk->part0, true);
+ blkfilter_detach(disk->part0);

/*
* Fail any new I/O.
diff --git a/block/ioctl.c b/block/ioctl.c
index 9c5f637ff153..e840150e1aa8 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -2,6 +2,7 @@
#include <linux/capability.h>
#include <linux/compat.h>
#include <linux/blkdev.h>
+#include <linux/blk-filter.h>
#include <linux/export.h>
#include <linux/gfp.h>
#include <linux/blkpg.h>
@@ -545,6 +546,12 @@ static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
return blkdev_pr_preempt(bdev, argp, true);
case IOC_PR_CLEAR:
return blkdev_pr_clear(bdev, argp);
+ case BLKFILTER_ATTACH:
+ return blkfilter_ioctl_attach(bdev, argp);
+ case BLKFILTER_DETACH:
+ return blkfilter_ioctl_detach(bdev, argp);
+ case BLKFILTER_CTL:
+ return blkfilter_ioctl_ctl(bdev, argp);
default:
return -ENOIOCTLCMD;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index 7b8ef6296abd..0b5e3a3f7c31 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -10,6 +10,7 @@
#include <linux/ctype.h>
#include <linux/vmalloc.h>
#include <linux/raid/detect.h>
+#include <linux/blk-filter.h>
#include "check.h"

static int (*check_part[])(struct parsed_partitions *) = {
@@ -277,6 +278,7 @@ static void delete_partition(struct block_device *part)

fsync_bdev(part);
__invalidate_device(part, true);
+ blkfilter_detach(part);

xa_erase(&part->bd_disk->part_tbl, part->bd_partno);
kobject_put(part->bd_holder_dir);
diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
new file mode 100644
index 000000000000..0afdb40f3bab
--- /dev/null
+++ b/include/linux/blk-filter.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _LINUX_BLK_FILTER_H
+#define _LINUX_BLK_FILTER_H
+
+#include <uapi/linux/blk-filter.h>
+
+struct bio;
+struct block_device;
+struct blkfilter_operations;
+
+/**
+ * struct blkfilter - Block device filter.
+ *
+ * @ops: Block device filter operations.
+ *
+ * For each filtered block device, the filter creates a data structure
+ * associated with this device. The data in this structure is specific to the
+ * filter, but it must contain a pointer to the block device filter account.
+ */
+struct blkfilter {
+ const struct blkfilter_operations *ops;
+};
+
+/**
+ * struct blkfilter_operations - Block device filter operations.
+ *
+ * @link: Entry in the global list of filter drivers
+ * (must not be accessed by the driver).
+ * @owner: Module implementing the filter driver.
+ * @name: Name of the filter driver.
+ * @attach: Attach the filter driver to the block device.
+ * @detach: Detach the filter driver from the block device.
+ * @ctl: Send a control command to the filter driver.
+ * @submit_bio: Handle bio submissions to the filter driver.
+ */
+struct blkfilter_operations {
+ struct list_head link;
+ struct module *owner;
+ const char *name;
+ struct blkfilter *(*attach)(struct block_device *bdev);
+ void (*detach)(struct blkfilter *flt);
+ int (*ctl)(struct blkfilter *flt, const unsigned int cmd,
+ __u8 __user *buf, __u32 *plen);
+ bool (*submit_bio)(struct bio *bio);
+};
+
+int blkfilter_register(struct blkfilter_operations *ops);
+void blkfilter_unregister(struct blkfilter_operations *ops);
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..b140ddd9b7ab 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -68,6 +68,7 @@ struct block_device {
#ifdef CONFIG_FAIL_MAKE_REQUEST
bool bd_make_it_fail;
#endif
+ struct blkfilter *bd_filter;
} __randomize_layout;

#define bdev_whole(_bdev) \
@@ -333,6 +334,7 @@ enum {
BIO_QOS_MERGED, /* but went through rq_qos merge path */
BIO_REMAPPED,
BIO_ZONE_WRITE_LOCKED, /* Owns a zoned device zone write lock */
+ BIO_FILTERED, /* bio has already been filtered */
BIO_FLAG_LAST
};

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 941304f17492..25ebbf296f35 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -854,6 +854,7 @@ void blk_request_module(dev_t devt);

extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
+void submit_bio_noacct_nocheck(struct bio *bio);
void submit_bio_noacct(struct bio *bio);
struct bio *bio_split_to_limits(struct bio *bio);

diff --git a/include/uapi/linux/blk-filter.h b/include/uapi/linux/blk-filter.h
new file mode 100644
index 000000000000..18885dc1b717
--- /dev/null
+++ b/include/uapi/linux/blk-filter.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLK_FILTER_H
+#define _UAPI_LINUX_BLK_FILTER_H
+
+#include <linux/types.h>
+
+#define BLKFILTER_NAME_LENGTH 32
+
+/**
+ * struct blkfilter_name - parameter for BLKFILTER_ATTACH and BLKFILTER_DETACH
+ * ioctl.
+ *
+ * @name: Name of block device filter.
+ */
+struct blkfilter_name {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+};
+
+/**
+ * struct blkfilter_ctl - parameter for BLKFILTER_CTL ioctl
+ *
+ * @name: Name of block device filter.
+ * @cmd: The filter-specific operation code of the command.
+ * @optlen: Size of data at @opt.
+ * @opt: Userspace buffer with options.
+ */
+struct blkfilter_ctl {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+ __u32 cmd;
+ __u32 optlen;
+ __u64 opt;
+};
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..1848d62979a4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -185,6 +185,11 @@ struct fsxattr {
#define BLKROTATIONAL _IO(0x12,126)
#define BLKZEROOUT _IO(0x12,127)
#define BLKGETDISKSEQ _IOR(0x12,128,__u64)
+/* 13* is defined in linux/blkzoned.h */
+#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name)
+#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name)
+#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl)
+
/*
* A jump here: 130-136 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
--
2.20.1

2023-04-05 16:21:51

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v3 04/11] blksnap: header file of the module interface

The header file contains a set of declarations, structures and control
requests (ioctl) that allows to manage the module from the user space.

Co-developed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sergei Shtepa <[email protected]>
---
MAINTAINERS | 1 +
include/uapi/linux/blksnap.h | 421 +++++++++++++++++++++++++++++++++++
2 files changed, 422 insertions(+)
create mode 100644 include/uapi/linux/blksnap.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 4bdb30369a74..d4a9b44521dd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3585,6 +3585,7 @@ M: Sergei Shtepa <[email protected]>
L: [email protected]
S: Supported
F: Documentation/block/blksnap.rst
+F: include/uapi/linux/blksnap.h

BLOCK LAYER
M: Jens Axboe <[email protected]>
diff --git a/include/uapi/linux/blksnap.h b/include/uapi/linux/blksnap.h
new file mode 100644
index 000000000000..2bfcf9031e95
--- /dev/null
+++ b/include/uapi/linux/blksnap.h
@@ -0,0 +1,421 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLKSNAP_H
+#define _UAPI_LINUX_BLKSNAP_H
+
+#include <linux/types.h>
+
+#define BLKSNAP_CTL "blksnap-control"
+#define BLKSNAP_IMAGE_NAME "blksnap-image"
+#define BLKSNAP 'V'
+
+/**
+ * DOC: Block device filter interface.
+ *
+ * Control commands that are transmitted through the block device filter
+ * interface.
+ */
+
+/**
+ * enum blkfilter_ctl_blksnap - List of commands for BLKFILTER_CTL ioctl
+ *
+ * @blkfilter_ctl_blksnap_cbtinfo:
+ * Get CBT information.
+ * The result of executing the command is a &struct blksnap_cbtinfo.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_cbtmap:
+ * Read the CBT map.
+ * The option passes the &struct blksnap_cbtmap.
+ * The size of the table can be quite large. Thus, the table is read in
+ * a loop, in each cycle of which the next offset is set to
+ * &blksnap_tracker_read_cbt_bitmap.offset.
+ * Return a count of bytes read if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_cbtdirty:
+ * Set dirty blocks in the CBT map.
+ * The option passes the &struct blksnap_cbtdirty.
+ * There are cases when some blocks need to be marked as changed.
+ * This ioctl allows to do this.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_snapshotadd:
+ * Add device to snapshot.
+ * The option passes the &struct blksnap_snapshotadd.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_snapshotinfo:
+ * Get information about snapshot.
+ * The result of executing the command is a &struct blksnap_snapshotinfo.
+ * Return 0 if succeeded, negative errno otherwise.
+ */
+enum blkfilter_ctl_blksnap {
+ blkfilter_ctl_blksnap_cbtinfo,
+ blkfilter_ctl_blksnap_cbtmap,
+ blkfilter_ctl_blksnap_cbtdirty,
+ blkfilter_ctl_blksnap_snapshotadd,
+ blkfilter_ctl_blksnap_snapshotinfo,
+};
+
+#ifndef UUID_SIZE
+#define UUID_SIZE 16
+#endif
+
+/**
+ * struct blksnap_uuid - Unique 16-byte identifier.
+ *
+ * @b:
+ * An array of 16 bytes.
+ */
+struct blksnap_uuid {
+ __u8 b[UUID_SIZE];
+};
+
+/**
+ * struct blksnap_cbtinfo - Result for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtinfo.
+ *
+ * @device_capacity:
+ * Device capacity in bytes.
+ * @block_size:
+ * Block size in bytes.
+ * @block_count:
+ * Number of blocks.
+ * @generation_id:
+ * Unique identifier of change tracking generation.
+ * @changes_number:
+ * Current changes number.
+ */
+struct blksnap_cbtinfo {
+ __u64 device_capacity;
+ __u32 block_size;
+ __u32 block_count;
+ struct blksnap_uuid generation_id;
+ __u8 changes_number;
+};
+
+/**
+ * struct blksnap_cbtmap - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtmap.
+ *
+ * @offset:
+ * Offset from the beginning of the CBT bitmap in bytes.
+ * @length:
+ * Size of @buff in bytes.
+ * @buffer:
+ * Pointer to the buffer for output.
+ */
+struct blksnap_cbtmap {
+ __u32 offset;
+ __u32 length;
+ __u8 *buffer;
+};
+
+/**
+ * struct blksnap_sectors - Description of the block device region.
+ *
+ * @offset:
+ * Offset from the beginning of the disk in sectors.
+ * @count:
+ * Count of sectors.
+ */
+struct blksnap_sectors {
+ __u64 offset;
+ __u64 count;
+};
+
+/**
+ * struct blksnap_cbtdirty - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtdirty.
+ *
+ * @count:
+ * Count of elements in the @dirty_sectors.
+ * @dirty_sectors:
+ * Pointer to the array of &struct blksnap_sectors.
+ */
+struct blksnap_cbtdirty {
+ __u32 count;
+ struct blksnap_sectors *dirty_sectors;
+};
+
+/**
+ * struct blksnap_snapshotadd - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotadd.
+ *
+ * @id:
+ * ID of the snapshot to which the block device should be added.
+ */
+struct blksnap_snapshotadd {
+ struct blksnap_uuid id;
+};
+
+#define IMAGE_DISK_NAME_LEN 32
+
+/**
+ * struct blksnap_snapshotinfo - Result for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotinfo.
+ *
+ * @error_code:
+ * Zero if there were no errors while holding the snapshot.
+ * The error code -ENOSPC means that while holding the snapshot, a snapshot
+ * overflow situation has occurred. Other error codes mean other reasons
+ * for failure.
+ * The error code is reset when the device is added to a new snapshot.
+ * @image:
+ * If the snapshot was taken, it stores the block device name of the
+ * image, or empty string otherwise.
+ */
+struct blksnap_snapshotinfo {
+ __s32 error_code;
+ __u8 image[IMAGE_DISK_NAME_LEN];
+};
+
+/**
+ * DOC: Interface for managing snapshots
+ *
+ * Control commands that are transmitted through the blksnap module interface.
+ */
+enum blksnap_ioctl {
+ blksnap_ioctl_version,
+ blksnap_ioctl_snapshot_create,
+ blksnap_ioctl_snapshot_destroy,
+ blksnap_ioctl_snapshot_append_storage,
+ blksnap_ioctl_snapshot_take,
+ blksnap_ioctl_snapshot_collect,
+ blksnap_ioctl_snapshot_wait_event,
+};
+
+/**
+ * struct blksnap_version - Module version.
+ *
+ * @major:
+ * Version major part.
+ * @minor:
+ * Version minor part.
+ * @revision:
+ * Revision number.
+ * @build:
+ * Build number. Should be zero.
+ */
+struct blksnap_version {
+ __u16 major;
+ __u16 minor;
+ __u16 revision;
+ __u16 build;
+};
+
+/**
+ * define IOCTL_BLKSNAP_VERSION - Get module version.
+ *
+ * The version may increase when the API changes. But linking the user space
+ * behavior to the version code does not seem to be a good idea.
+ * To ensure backward compatibility, API changes should be made by adding new
+ * ioctl without changing the behavior of existing ones. The version should be
+ * used for logs.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_VERSION \
+ _IOW(BLKSNAP, blksnap_ioctl_version, struct blksnap_version)
+
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_CREATE - Create snapshot.
+ *
+ * Creates a snapshot structure in the memory and allocates an identifier for
+ * it. Further interaction with the snapshot is possible by this identifier.
+ * A snapshot is created for several block devices at once.
+ * Several snapshots can be created at the same time, but with the condition
+ * that one block device can only be included in one snapshot.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_CREATE \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_create, \
+ struct blksnap_uuid)
+
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_DESTROY - Release and destroy the snapshot.
+ *
+ * Destroys snapshot with &blksnap_snapshot_destroy.id. This leads to the
+ * deletion of all block device images of the snapshot. The difference storage
+ * is being released. But the change tracker keeps tracking.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_DESTROY \
+ _IOR(BLKSNAP, blksnap_ioctl_snapshot_destroy, \
+ struct blksnap_uuid)
+
+/**
+ * struct blksnap_snapshot_append_storage - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE control.
+ *
+ * @id:
+ * Snapshot ID.
+ * @bdev_path:
+ * Device path string buffer.
+ * @bdev_path_size:
+ * Device path string buffer size.
+ * @count:
+ * Size of @ranges in the number of &struct blksnap_sectors.
+ * @ranges:
+ * Pointer to the array of &struct blksnap_sectors.
+ */
+struct blksnap_snapshot_append_storage {
+ struct blksnap_uuid id;
+ __u8 *bdev_path;
+ __u32 bdev_path_size;
+ __u32 count;
+ struct blksnap_sectors *ranges;
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE - Append storage to the
+ * difference storage of the snapshot.
+ *
+ * The snapshot difference storage can be set either before or after creating
+ * the snapshot images. This allows to dynamically expand the difference
+ * storage while holding the snapshot.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_APPEND_STORAGE \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_append_storage, \
+ struct blksnap_snapshot_append_storage)
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_TAKE - Take snapshot.
+ *
+ * Creates snapshot images of block devices and switches change trackers tables.
+ * The snapshot must be created before this call, and the areas of block
+ * devices should be added to the difference storage.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_TAKE \
+ _IOR(BLKSNAP, blksnap_ioctl_snapshot_take, \
+ struct blksnap_uuid)
+
+/**
+ * struct blksnap_snapshot_collect - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_COLLECT control.
+ *
+ * @count:
+ * Size of &blksnap_snapshot_collect.ids in the number of 16-byte UUID.
+ * @ids:
+ * Pointer to the array with the snapshot ID for output.
+ */
+struct blksnap_snapshot_collect {
+ __u32 count;
+ struct blksnap_uuid *ids;
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_COLLECT - Get collection of created snapshots.
+ *
+ * Multiple snapshots can be created at the same time. This allows for one
+ * system to create backups for different data with a independent schedules.
+ *
+ * If in &blksnap_snapshot_collect.count is less than required to store the
+ * &blksnap_snapshot_collect.ids, the array is not filled, and the ioctl
+ * returns the required count for &blksnap_snapshot_collect.ids.
+ *
+ * So, it is recommended to call the ioctl twice. The first call with an null
+ * pointer &blksnap_snapshot_collect.ids and a zero value in
+ * &blksnap_snapshot_collect.count. It will set the required array size in
+ * &blksnap_snapshot_collect.count. The second call with a pointer
+ * &blksnap_snapshot_collect.ids to an array of the required size will allow to
+ * get collection of active snapshots.
+ *
+ * Return: 0 if succeeded, -ENODATA if there is not enough space in the array
+ * to store collection of active snapshots, or negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_COLLECT \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_collect, \
+ struct blksnap_snapshot_collect)
+
+/**
+ * enum blksnap_event_codes - Variants of event codes.
+ *
+ * @blksnap_event_code_low_free_space:
+ * Low free space in difference storage event.
+ * If the free space in the difference storage is reduced to the specified
+ * limit, the module generates this event.
+ * @blksnap_event_code_corrupted:
+ * Snapshot image is corrupted event.
+ * If a chunk could not be allocated when trying to save data to the
+ * difference storage, this event is generated. However, this does not mean
+ * that the backup process was interrupted with an error. If the snapshot
+ * image has been read to the end by this time, the backup process is
+ * considered successful.
+ */
+enum blksnap_event_codes {
+ blksnap_event_code_low_free_space,
+ blksnap_event_code_corrupted,
+};
+
+/**
+ * struct blksnap_snapshot_event - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT control.
+ *
+ * @id:
+ * Snapshot ID.
+ * @timeout_ms:
+ * Timeout for waiting in milliseconds.
+ * @time_label:
+ * Timestamp of the received event.
+ * @code:
+ * Code of the received event &enum blksnap_event_codes.
+ * @data:
+ * The received event body.
+ */
+struct blksnap_snapshot_event {
+ struct blksnap_uuid id;
+ __u32 timeout_ms;
+ __u32 code;
+ __s64 time_label;
+ __u8 data[4096 - 32];
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT - Wait and get the event from the
+ * snapshot.
+ *
+ * While holding the snapshot, the kernel module can transmit information about
+ * changes in its state in the form of events to the user level.
+ * It is very important to receive these events as quickly as possible, so the
+ * user's thread is in the state of interruptable sleep.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_wait_event, \
+ struct blksnap_snapshot_event)
+
+/**
+ * struct blksnap_event_low_free_space - Data for the
+ * &blksnap_event_code_low_free_space event.
+ *
+ * @requested_nr_sect:
+ * The required number of sectors.
+ */
+struct blksnap_event_low_free_space {
+ __u64 requested_nr_sect;
+};
+
+/**
+ * struct blksnap_event_corrupted - Data for the
+ * &blksnap_event_code_corrupted event.
+ *
+ * @dev_id_mj:
+ * Major part of original device ID.
+ * @dev_id_mn:
+ * Minor part of original device ID.
+ * @err_code:
+ * Error code.
+ */
+struct blksnap_event_corrupted {
+ __u32 dev_id_mj;
+ __u32 dev_id_mn;
+ __s32 err_code;
+};
+
+#endif /* _UAPI_LINUX_BLKSNAP_H */
--
2.20.1

2023-04-08 15:24:02

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism

Hi, Sergei,

On 4/4/23 16:08, Sergei Shtepa wrote:
> The block device filtering mechanism is an API that allows to attach
> block device filters. Block device filters allow perform additional
> processing for I/O units.
> [...]
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..1848d62979a4 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -185,6 +185,11 @@ struct fsxattr {
> #define BLKROTATIONAL _IO(0x12,126)
> #define BLKZEROOUT _IO(0x12,127)
> #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
> +/* 13* is defined in linux/blkzoned.h */

nit: This is already explained in the comment below your insert.

Best

Donald

> +#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name)
> +#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name)
> +#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl)
> +
> /*
> * A jump here: 130-136 are reserved for zoned block devices
> * (see uapi/linux/blkzoned.h)

2023-04-08 15:32:00

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism

Dear Sergei,

On 4/4/23 16:08, Sergei Shtepa wrote:
> The block device filtering mechanism is an API that allows to attach
> block device filters. Block device filters allow perform additional
> processing for I/O units.
>
> The idea of handling I/O units on block devices is not new. Back in the
> 2.6 kernel, there was an undocumented possibility of handling I/O units
> by substituting the make_request_fn() function, which belonged to the
> request_queue structure. But none of the in-tree kernel modules used
> this feature, and it was eliminated in the 5.10 kernel.
>
> The block device filtering mechanism returns the ability to handle I/O
> units. It is possible to safely attach filter to a block device "on the
> fly" without changing the structure of block devices stack.
>
> Co-developed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Sergei Shtepa <[email protected]>
> ---
> MAINTAINERS | 3 +
> block/Makefile | 2 +-
> block/bdev.c | 1 +
> block/blk-core.c | 40 ++++++-
> block/blk-filter.c | 199 ++++++++++++++++++++++++++++++++
> block/blk.h | 10 ++
> block/genhd.c | 2 +
> block/ioctl.c | 7 ++
> block/partitions/core.c | 2 +
> include/linux/blk-filter.h | 51 ++++++++
> include/linux/blk_types.h | 2 +
> include/linux/blkdev.h | 1 +
> include/uapi/linux/blk-filter.h | 35 ++++++
> include/uapi/linux/fs.h | 5 +
> 14 files changed, 357 insertions(+), 3 deletions(-)
> create mode 100644 block/blk-filter.c
> create mode 100644 include/linux/blk-filter.h
> create mode 100644 include/uapi/linux/blk-filter.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2cbe4331ac97..fb6b7abe83e1 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3576,6 +3576,9 @@ M: Sergei Shtepa <[email protected]>
> L: [email protected]
> S: Supported
> F: Documentation/block/blkfilter.rst
> +F: block/blk-filter.c
> +F: include/linux/blk-filter.h
> +F: include/uapi/linux/blk-filter.h
>
> BLOCK LAYER
> M: Jens Axboe <[email protected]>
> diff --git a/block/Makefile b/block/Makefile
> index 4e01bb71ad6e..d4671c7e499c 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -9,7 +9,7 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
> blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
> blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
> genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
> - disk-events.o blk-ia-ranges.o
> + disk-events.o blk-ia-ranges.o blk-filter.o
>
> obj-$(CONFIG_BOUNCE) += bounce.o
> obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
> diff --git a/block/bdev.c b/block/bdev.c
> index 1795c7d4b99e..e290020810dd 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -424,6 +424,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
> return NULL;
> }
> bdev->bd_disk = disk;
> + bdev->bd_filter = NULL;
> return bdev;
> }
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 42926e6cb83c..179a1c9ecc90 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -18,6 +18,7 @@
> #include <linux/blkdev.h>
> #include <linux/blk-pm.h>
> #include <linux/blk-integrity.h>
> +#include <linux/blk-filter.h>
> #include <linux/highmem.h>
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> @@ -591,10 +592,32 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
> return BLK_STS_OK;
> }
>
> +static bool submit_bio_filter(struct bio *bio)
> +{
> + /*
> + * If this bio came from the filter driver, send it straight down to the
> + * actual device and clear the filtered flag, as the bio could be passed
> + * on to another device that might have a filter attached again.
> + */
> + if (bio_flagged(bio, BIO_FILTERED)) {
> + bio_clear_flag(bio, BIO_FILTERED);
> + return false;
> + }
> + bio_set_flag(bio, BIO_FILTERED);
> + return bio->bi_bdev->bd_filter->ops->submit_bio(bio);
> +}
> +
> static void __submit_bio(struct bio *bio)
> {
> struct gendisk *disk = bio->bi_bdev->bd_disk;
>
> + /*
> + * If there is a filter driver attached, check if the BIO needs to go to
> + * the filter driver first, which can then pass on the bio or consume it.
> + */
> + if (bio->bi_bdev->bd_filter && submit_bio_filter(bio))
> + return;
> +
> if (unlikely(!blk_crypto_bio_prep(&bio)))
> return;
>
> @@ -682,6 +705,15 @@ static void __submit_bio_noacct_mq(struct bio *bio)
> current->bio_list = NULL;
> }
>
> +/**
> + * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O
> + * from block device filter.
> + * @bio: The bio describing the location in memory and on the device.
> + *
> + * This is a version of submit_bio() that shall only be used for I/O that is
> + * resubmitted to lower level by block device filters. All file systems and
> + * other upper level users of the block layer should use submit_bio() instead.
> + */
> void submit_bio_noacct_nocheck(struct bio *bio)
> {
> blk_cgroup_bio_start(bio);
> @@ -702,13 +734,17 @@ void submit_bio_noacct_nocheck(struct bio *bio)
> * to collect a list of requests submited by a ->submit_bio method while
> * it is active, and then process them after it returned.
> */
> - if (current->bio_list)
> + if (current->bio_list) {
> bio_list_add(&current->bio_list[0], bio);
> - else if (!bio->bi_bdev->bd_disk->fops->submit_bio)
> + return;
> + }
> +
> + if (!bio->bi_bdev->bd_disk->fops->submit_bio)
> __submit_bio_noacct_mq(bio);
> else
> __submit_bio_noacct(bio);
> }
> +EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck);
>
> /**
> * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> diff --git a/block/blk-filter.c b/block/blk-filter.c
> new file mode 100644
> index 000000000000..5e9d884fad4d
> --- /dev/null
> +++ b/block/blk-filter.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (C) 2023 Veeam Software Group GmbH */
> +#include <linux/blk-filter.h>
> +#include <linux/blk-mq.h>
> +#include <linux/module.h>
> +
> +#include "blk.h"
> +
> +static LIST_HEAD(blkfilters);
> +static DEFINE_SPINLOCK(blkfilters_lock);
> +
> +static inline struct blkfilter_operations *__blkfilter_find(const char *name)
> +{
> + struct blkfilter_operations *ops;
> +
> + list_for_each_entry(ops, &blkfilters, link)
> + if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0)
> + return ops;
> +
> + return NULL;
> +}
> +
> +static inline struct blkfilter_operations *blkfilter_find_get(const char *name)
> +{
> + struct blkfilter_operations *ops;
> +
> + spin_lock(&blkfilters_lock);
> + ops = __blkfilter_find(name);
> + if (ops && !try_module_get(ops->owner))
> + ops = NULL;
> + spin_unlock(&blkfilters_lock);
> +
> + return ops;
> +}
> +
> +int blkfilter_ioctl_attach(struct block_device *bdev,
> + struct blkfilter_name __user *argp)
> +{
> + struct blkfilter_name name;
> + struct blkfilter_operations *ops;
> + struct blkfilter *flt;
> + int ret;
> +
> + if (copy_from_user(&name, argp, sizeof(name)))
> + return -EFAULT;
> +
> + ops = blkfilter_find_get(name.name);
> + if (!ops)
> + return -ENOENT;
> +
> + ret = freeze_bdev(bdev);
> + if (ret)
> + goto out_put_module;
> + blk_mq_freeze_queue(bdev->bd_queue);
> +
> + if (bdev->bd_filter) {
> + if (bdev->bd_filter->ops == ops)
> + ret = -EALREADY;
> + else
> + ret = -EBUSY;
> + goto out_unfreeze;
> + }

Maybe detach the old filter and attach the new one instead? An atomic replace might be usefull and it wouldn't complicate the code to do that instead. If its the same filter, maybe just return success and don't go through ops->detach and ops->attach?

D.

> [...]

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-04-10 05:03:55

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module

On Tue, Apr 04, 2023 at 04:08:27PM +0200, Sergei Shtepa wrote:
> +The main properties that a backup tool should have are:
> +
> +- Simplicity and versatility of use
> +- Reliability
> +- Minimal consumption of system resources during backup
> +- Minimal time required for recovery or replication of the entire system
> +
> +Therefore, the features of the blksnap module are:
"Taking above properties into account, blksnap module features:"

> +The change tracker allows to determine which blocks were changed during the
> +time between the last snapshot created and any of the previous snapshots.
> +Having a map of changes, it is enough to copy only the changed blocks, and
"With a map of changes, ..."

> +3. ``blkfilter_ctl_blksnap_cbtdirty`` mark blocks as changed in the change
marks

> +The blksnap [#userspace_tools]_ console tool allows to control the module
> +from the command line. The tool contains detailed built-in help. To get
> +the list of commands, enter the ``blksnap --help`` command. The ``blksnap
> +<command name> --help`` command allows to get detailed information about the
"To get list of commands with usage description, see ``blksnap --help``."

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (1.28 kB)
signature.asc (235.00 B)
Download all attachments

2023-04-11 06:25:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism

On Sat, Apr 08, 2023 at 05:16:42PM +0200, Donald Buczek wrote:
> Hi, Sergei,
>
> On 4/4/23 16:08, Sergei Shtepa wrote:
> > The block device filtering mechanism is an API that allows to attach
> > block device filters. Block device filters allow perform additional
> > processing for I/O units.
> > [...]
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index b7b56871029c..1848d62979a4 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -185,6 +185,11 @@ struct fsxattr {
> > #define BLKROTATIONAL _IO(0x12,126)
> > #define BLKZEROOUT _IO(0x12,127)
> > #define BLKGETDISKSEQ _IOR(0x12,128,__u64)
> > +/* 13* is defined in linux/blkzoned.h */
>
> nit: This is already explained in the comment below your insert.

My faul. But indeed, no need to duplicate that, instead the new
ioctl opcodes should move below that existing comment.

2023-04-11 06:26:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism

On Sat, Apr 08, 2023 at 05:30:19PM +0200, Donald Buczek wrote:
> Maybe detach the old filter and attach the new one instead? An atomic replace might be usefull and it wouldn't complicate the code to do that instead. If its the same filter, maybe just return success and don't go through ops->detach and ops->attach?

I don't think a replace makes any sense. We might want multiple
filters eventually, but unless we have a good use case for even just
more than a single driver we can deal with that once needed. The
interface is prepared to support multiple attached filters already.

2023-04-12 11:04:34

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism



On 4/11/23 08:25, Christoph Hellwig wrote:
> Subject:
> Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism
> From:
> Christoph Hellwig <[email protected]>
> Date:
> 4/11/23, 08:25
>
> To:
> Donald Buczek <[email protected]>
> CC:
> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>
>
> On Sat, Apr 08, 2023 at 05:30:19PM +0200, Donald Buczek wrote:
>> Maybe detach the old filter and attach the new one instead? An atomic replace might be usefull and it wouldn't complicate the code to do that instead. If its the same filter, maybe just return success and don't go through ops->detach and ops->attach?
> I don't think a replace makes any sense. We might want multiple
> filters eventually, but unless we have a good use case for even just
> more than a single driver we can deal with that once needed. The
> interface is prepared to support multiple attached filters already.
>


Thank you Donald for your comment. It got me thinking.

Despite the fact that only one filter is currently offered for the kernel,
I think that out-of-tree filters of block devices may appear very soon.
It would be good to think about it in advance.
And, I agree with Christophe, we would not like to redo the blk-filter interface
when new filters appear in the tree.

We can consider a block device as a resource that two actor want to take over.
There are two possible behavioral strategies:
1. If one owner occupies a resource, then for other actors, the ownership
request will end with a refusal. The owner will not lose his resource.
2. Any actor can take away a resource from the owner and inform him about its
loss using a callback.

I think the first strategy is safer. When calling ioctl BLKFILTER_ATTACH, the
kernel informs the actor that the resource is busy.
Of course, there is still an option to grab someone else's occupied resource.
To do this, he will have to call ioctl BLKFILTER_DETACH, specifying the name
of the filter that needs to be detached. It is assumed that such detached
should be performed by the same actor that attached it there.

If we replace the owner at each ioctl BLKFILTER_ATTACH, then we can get a
situation of competition between two actors. At the same time, they won't
even get a message that something is going wrong.

An example from life. The user compares different backup tools. Install one,
then another. Each uses its own filter (And why not? this is technically
possible).
With the first strategy, the second tool will make it clear to the user that
it cannot work, since the resource is already occupied by another.
The user will have to experiment first with one tool, uninstall it, and then
experiment with another.
With the second strategy, both tools will unload each other's filters. In the
best case, this will lead to disruption of their work. At a minimum, blksnap,
when detached, will reset the change tracker and each backup will perform a
full read of the block device. As a result, the user will receive distorted
data, the system will not work as planned, although there will be no error
message.

2023-04-12 19:46:45

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module

I think, you can trigger all kind of user-after-free when userspace deletes a snapshot image or the snapshot image and the tracker while the disk device snapshot image is kept alive (mounted or just opened) and doing I/O.

Here is what I did to provoke that:

root@dose:~# s=$(blksnap snapshot_create -d /dev/vdb)
root@dose:~# blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
device path: '/dev/block/253:2'
allocate range: ofs=11264624 cnt=2097152
root@dose:~# blksnap snapshot_take -i $s
root@dose:~# mount /dev/blksnap-image_253\:16 /mnt
root@dose:~# dd if=/dev/zero of=/mnt/x.x &
[1] 2514
root@dose:~# blksnap snapshot_destroy -i $s
dd: writing to '/mnt/x.x': No space left on device
1996041+0 records in
1996040+0 records out
1021972480 bytes (1.0 GB, 975 MiB) copied, 8.48923 s, 120 MB/s
[1]+ Exit 1 dd if=/dev/zero of=/mnt/x.x

And here's the UAF:

[ 4508.526091] [2475] diff_storage_event_low:64: blksnap-diff-storage: Diff storage low free space. Portion: 2097152 sectors, requested: 2097152
[ 4508.526141] [2475] event_gen:44: blksnap-event_queue: Generate event: time=4507748140846 code=0 data_size=8
[ 4508.526158] blksnap-snapshot: Snapshot aa986b45-bf07-46f7-a52d-8d7829221f24 was created
[ 4512.731380] [2478] ioctl_snapshot_append_storage:195: blksnap: Append difference storage
[ 4512.731417] [2478] diff_storage_append_block:223: blksnap-diff-storage: Append 1 blocks
[ 4512.731485] [2478] diff_storage_add_range:193: blksnap-diff-storage: Add range to diff storage: [253:2] 11264624:2097152
[ 4512.780757] [2481] diff_area_new:180: blksnap-diff-area: Open device [253:16]
[ 4512.780786] [2481] diff_area_calculate_chunk_size:57: blksnap-diff-area: Minimal IO block 1 sectors
[ 4512.780794] [2481] diff_area_calculate_chunk_size:58: blksnap-diff-area: Device capacity 2097152 sectors
[ 4512.780801] [2481] diff_area_calculate_chunk_size:61: blksnap-diff-area: Chunks count 4096
[ 4512.780808] [2481] diff_area_calculate_chunk_size:76: blksnap-diff-area: The optimal chunk size was calculated as 262144 bytes for device [253:16]
[ 4512.780817] [2481] diff_area_new:200: blksnap-diff-area: Chunk size 262144 in bytes
[ 4512.780824] [2481] diff_area_new:201: blksnap-diff-area: Chunk count 4096
[ 4512.828814] [2481] snapshot_take_trackers:250: blksnap-snapshot: Device [253:16] was frozen
[ 4512.834029] [2481] cbt_map_switch:142: blksnap-cbt_map: CBT map switch
[ 4512.834051] [2481] snapshot_take_trackers:277: blksnap-snapshot: Device [253:16] was unfrozen
[ 4512.834058] blksnap-image: Create snapshot image device for original device [253:16]
[ 4512.835437] [2481] snapimage_create:97: blksnap-image: Snapshot image disk name [blksnap-image_253:16]
[ 4512.838499] [2481] snapimage_create:112: blksnap-image: Image block device [259:1] has been created
[ 4512.838508] blksnap-snapshot: Snapshot aa986b45-bf07-46f7-a52d-8d7829221f24 was taken successfully
[ 4525.592286] XFS (blksnap-image_253:16): Mounting V5 Filesystem 35f0c7a2-27fa-4183-8ede-7462ac31a97d
[ 4525.619281] XFS (blksnap-image_253:16): Ending clean mount
[ 4558.292030] clocksource: timekeeping watchdog on CPU10: hpet retried 2 times before success
[ 4558.486074] blksnap-snapshot: Destroy snapshot aa986b45-bf07-46f7-a52d-8d7829221f24
[ 4558.488731] blksnap-snapshot: Release snapshot aa986b45-bf07-46f7-a52d-8d7829221f24
[ 4558.505931] [2527] tracker_release_snapshot:293: blksnap-tracker: Tracker for device [253:16] release snapshot
[ 4558.505959] [2527] snapimage_free:63: blksnap-image: Snapshot image disk blksnap-image_253:16 delete
[ 4558.548358] [1899] diff_storage_event_low:64: blksnap-diff-storage: Diff storage low free space. Portion: 2097152 sectors, requested: 4194304
[ 4558.548378] [1899] event_gen:44: blksnap-event_queue: Generate event: time=4557770519985 code=0 data_size=8
[ 4561.444548] ==================================================================
[ 4561.446224] BUG: KASAN: slab-use-after-free in chunk_notify_store+0x40/0x190 [blksnap]
[ 4561.448018] Read of size 8 at addr ffff888112d21500 by task kworker/13:0/1504

[ 4561.449965] CPU: 13 PID: 1504 Comm: kworker/13:0 Not tainted 6.3.0-rc5.mx64.428-00094-g21dc08a94f59 #40
[ 4561.452025] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[ 4561.454538] Workqueue: events chunk_notify_store [blksnap]
[ 4561.455809] Call Trace:
[ 4561.456393] <TASK>
[ 4561.456915] dump_stack_lvl+0x37/0x50
[ 4561.457807] print_report+0xcc/0x630
[ 4561.458650] ? __virt_addr_valid+0xf5/0x180
[ 4561.459607] ? chunk_notify_store+0x40/0x190 [blksnap]
[ 4561.460785] kasan_report+0xb2/0xe0
[ 4561.468472] ? chunk_notify_store+0x40/0x190 [blksnap]
[ 4561.476365] chunk_notify_store+0x40/0x190 [blksnap]
[ 4561.483755] process_one_work+0x407/0x790
[ 4561.490118] worker_thread+0x2ab/0x700
[ 4561.495707] ? __pfx_set_cpus_allowed_ptr+0x10/0x10
[ 4561.500883] ? __pfx_worker_thread+0x10/0x10
[ 4561.505664] kthread+0x15d/0x190
[ 4561.509864] ? __pfx_kthread+0x10/0x10
[ 4561.513824] ret_from_fork+0x2c/0x50
[ 4561.517546] </TASK>

[ 4561.524020] Allocated by task 2481:
[ 4561.527147] kasan_save_stack+0x22/0x50
[ 4561.530226] kasan_set_track+0x25/0x30
[ 4561.533096] __kasan_kmalloc+0x80/0x90
[ 4561.535868] chunk_alloc+0x37/0xf0 [blksnap]
[ 4561.538569] diff_area_new+0x42d/0x690 [blksnap]
[ 4561.541251] snapshot_take+0x13b/0x530 [blksnap]
[ 4561.543853] ioctl_snapshot_take+0x7a/0xc0 [blksnap]
[ 4561.546394] ctrl_unlocked_ioctl+0x3a/0x60 [blksnap]
[ 4561.548898] __x64_sys_ioctl+0xc6/0xe0
[ 4561.551276] do_syscall_64+0x47/0xa0
[ 4561.553621] entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 4561.558154] Freed by task 2527:
[ 4561.560350] kasan_save_stack+0x22/0x50
[ 4561.562569] kasan_set_track+0x25/0x30
[ 4561.564788] kasan_save_free_info+0x2b/0x50
[ 4561.567012] ____kasan_slab_free+0xf9/0x1a0
[ 4561.569210] __kmem_cache_free+0x141/0x200
[ 4561.571370] diff_area_free+0xab/0x150 [blksnap]
[ 4561.573546] tracker_release_snapshot+0xc8/0x110 [blksnap]
[ 4561.575763] snapshot_free+0x9f/0x170 [blksnap]
[ 4561.577882] snapshot_destroy+0x119/0x170 [blksnap]
[ 4561.579992] ioctl_snapshot_destroy+0x7a/0xc0 [blksnap]
[ 4561.582155] ctrl_unlocked_ioctl+0x3a/0x60 [blksnap]
[ 4561.584270] __x64_sys_ioctl+0xc6/0xe0
[ 4561.586247] do_syscall_64+0x47/0xa0
[ 4561.588176] entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 4561.591915] Last potentially related work creation:
[ 4561.593833] kasan_save_stack+0x22/0x50
[ 4561.595611] __kasan_record_aux_stack+0x60/0x70
[ 4561.597442] kvfree_call_rcu+0x2e/0x460
[ 4561.599235] cache_clean+0x46d/0x500 [sunrpc]
[ 4561.601181] cache_flush+0x15/0x40 [sunrpc]
[ 4561.603096] ip_map_parse+0x2ca/0x300 [sunrpc]
[ 4561.605035] cache_do_downcall+0x59/0x90 [sunrpc]
[ 4561.606993] cache_write_procfs+0x90/0xd0 [sunrpc]
[ 4561.608946] proc_reg_write+0xe0/0x140
[ 4561.610728] vfs_write+0x186/0x680
[ 4561.612465] ksys_write+0xbd/0x160
[ 4561.614186] do_syscall_64+0x47/0xa0
[ 4561.615922] entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 4561.619340] The buggy address belongs to the object at ffff888112d21500
which belongs to the cache kmalloc-96 of size 96
[ 4561.623265] The buggy address is located 0 bytes inside of
freed 96-byte region [ffff888112d21500, ffff888112d21560)

[ 4561.628960] The buggy address belongs to the physical page:
[ 4561.631000] page:00000000b32dc240 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x112d21
[ 4561.633393] flags: 0x17fffc000000200(slab|node=0|zone=2|lastcpupid=0x1ffff)
[ 4561.635646] raw: 017fffc000000200 ffff888100040300 ffffea0004317ed0 ffffea000406a350
[ 4561.637982] raw: 0000000000000000 ffff888112d21000 0000000100000020 0000000000000000
[ 4561.640338] page dumped because: kasan: bad access detected

[ 4561.644448] Memory state around the buggy address:
[ 4561.646610] ffff888112d21400: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4561.649000] ffff888112d21480: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4561.651367] >ffff888112d21500: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4561.653709] ^
[ 4561.655754] ffff888112d21580: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4561.658132] ffff888112d21600: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4561.660503] ==================================================================
[ 4561.662931] Disabling lock debugging due to kernel taint
[ 4561.665449] general protection fault, probably for non-canonical address 0x79a00c8000009df: 0000 [#1] PREEMPT SMP KASAN PTI
[ 4561.668387] CPU: 13 PID: 1504 Comm: kworker/13:0 Tainted: G B 6.3.0-rc5.mx64.428-00094-g21dc08a94f59 #40
[ 4561.671243] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[ 4561.674216] Workqueue: events chunk_notify_store [blksnap]
[ 4561.676778] RIP: 0010:chunk_notify_store+0x66/0x190 [blksnap]
[ 4561.679365] Code: d0 f5 90 e0 4c 8b 75 00 48 8d 7d 08 e8 c3 f5 90 e0 4c 8b 65 08 49 8d 7e 08 e8 66 f6 90 e0 4d 89 66 08 4c 89 e7 e8 5a f6 90 e0 <4d> 89 34 24 48 89 ef 4c 8d 73 68 e8 4a f6 90 e0 48 89 6d 00 48 89
[ 4561.685274] RSP: 0000:ffff88810e49fdd0 EFLAGS: 00010282
[ 4561.687972] RAX: 0000000000000000 RBX: ffff888181be3300 RCX: ffffffffa0b98f56
[ 4561.690821] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 079a00c8000009df
[ 4561.693676] RBP: ffff888112d21500 R08: 0000000000000001 R09: ffffffff84286a47
[ 4561.696550] R10: fffffbfff0850d48 R11: 0000000000000001 R12: 079a00c8000009df
[ 4561.699418] R13: ffff888181be3320 R14: ffff88817f708800 R15: ffff888181be3308
[ 4561.702307] FS: 0000000000000000(0000) GS:ffff888261c80000(0000) knlGS:0000000000000000
[ 4561.705292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4561.708105] CR2: 00007f7189264c58 CR3: 0000000111dec001 CR4: 0000000000170ee0
[ 4561.711052] Call Trace:
[ 4561.713647] <TASK>
[ 4561.716196] process_one_work+0x407/0x790
[ 4561.718912] worker_thread+0x2ab/0x700
[ 4561.721597] ? __pfx_set_cpus_allowed_ptr+0x10/0x10
[ 4561.724372] ? __pfx_worker_thread+0x10/0x10
[ 4561.727101] kthread+0x15d/0x190
[ 4561.729745] ? __pfx_kthread+0x10/0x10
[ 4561.732418] ret_from_fork+0x2c/0x50
[ 4561.735097] </TASK>
[ 4561.737655] Modules linked in: blksnap rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc bochs kvm_intel drm_vram_helper drm_ttm_helper ttm kvm drm_kms_helper input_leds led_class drm virtio_net irqbypass syscopyarea net_failover sysfillrect intel_agp crc32c_intel failover floppy sysimgblt intel_gtt i2c_piix4 nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables unix ipv6 autofs4
[ 4561.748159] ---[ end trace 0000000000000000 ]---
[ 4561.751205] RIP: 0010:chunk_notify_store+0x66/0x190 [blksnap]
[ 4561.754393] Code: d0 f5 90 e0 4c 8b 75 00 48 8d 7d 08 e8 c3 f5 90 e0 4c 8b 65 08 49 8d 7e 08 e8 66 f6 90 e0 4d 89 66 08 4c 89 e7 e8 5a f6 90 e0 <4d> 89 34 24 48 89 ef 4c 8d 73 68 e8 4a f6 90 e0 48 89 6d 00 48 89
[ 4561.761441] RSP: 0000:ffff88810e49fdd0 EFLAGS: 00010282
[ 4561.764695] RAX: 0000000000000000 RBX: ffff888181be3300 RCX: ffffffffa0b98f56
[ 4561.768108] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 079a00c8000009df
[ 4561.771488] RBP: ffff888112d21500 R08: 0000000000000001 R09: ffffffff84286a47
[ 4561.774845] R10: fffffbfff0850d48 R11: 0000000000000001 R12: 079a00c8000009df
[ 4561.778140] R13: ffff888181be3320 R14: ffff88817f708800 R15: ffff888181be3308
[ 4561.781476] FS: 0000000000000000(0000) GS:ffff888261c80000(0000) knlGS:0000000000000000
[ 4561.784909] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4561.788164] CR2: 00007f7189264c58 CR3: 0000000111dec001 CR4: 0000000000170ee0
[ 4586.218197] XFS (blksnap-image_253:16): log I/O error -5
[ 4586.218889] XFS (blksnap-image_253:16): metadata I/O error in "xfs_buf_ioend+0x3ea/0xb50" at daddr 0x1 len 1 error 5
[ 4586.225814] XFS (blksnap-image_253:16): Filesystem has been shut down due to log error (0x2).
[ 4586.246191] XFS (blksnap-image_253:16): Please unmount the filesystem and rectify the problem(s).


I was actually targeting this deref in snapimage.c:

+static void snapimage_submit_bio(struct bio *bio)
+{
+ struct tracker *tracker = bio->bi_bdev->bd_disk->private_data;
+ struct diff_area *diff_area = tracker->diff_area;

but didn't even get to delete the tracker...


Best

Donald

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-04-12 20:07:12

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism

On 4/12/23 12:43, Sergei Shtepa wrote:
>
>
> On 4/11/23 08:25, Christoph Hellwig wrote:
>> Subject:
>> Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism
>> From:
>> Christoph Hellwig <[email protected]>
>> Date:
>> 4/11/23, 08:25
>>
>> To:
>> Donald Buczek <[email protected]>
>> CC:
>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>
>>
>> On Sat, Apr 08, 2023 at 05:30:19PM +0200, Donald Buczek wrote:
>>> Maybe detach the old filter and attach the new one instead? An atomic replace might be usefull and it wouldn't complicate the code to do that instead. If its the same filter, maybe just return success and don't go through ops->detach and ops->attach?
>> I don't think a replace makes any sense. We might want multiple
>> filters eventually, but unless we have a good use case for even just
>> more than a single driver we can deal with that once needed. The
>> interface is prepared to support multiple attached filters already.
>>
>
>
> Thank you Donald for your comment. It got me thinking.
>
> Despite the fact that only one filter is currently offered for the kernel,
> I think that out-of-tree filters of block devices may appear very soon.
> It would be good to think about it in advance.
> And, I agree with Christophe, we would not like to redo the blk-filter interface
> when new filters appear in the tree.
>
> We can consider a block device as a resource that two actor want to take over.
> There are two possible behavioral strategies:
> 1. If one owner occupies a resource, then for other actors, the ownership
> request will end with a refusal. The owner will not lose his resource.
> 2. Any actor can take away a resource from the owner and inform him about its
> loss using a callback.
>
> I think the first strategy is safer. When calling ioctl BLKFILTER_ATTACH, the
> kernel informs the actor that the resource is busy.
> Of course, there is still an option to grab someone else's occupied resource.
> To do this, he will have to call ioctl BLKFILTER_DETACH, specifying the name
> of the filter that needs to be detached. It is assumed that such detached
> should be performed by the same actor that attached it there.
>
> If we replace the owner at each ioctl BLKFILTER_ATTACH, then we can get a
> situation of competition between two actors. At the same time, they won't
> even get a message that something is going wrong.
>
> An example from life. The user compares different backup tools. Install one,
> then another. Each uses its own filter (And why not? this is technically
> possible).
> With the first strategy, the second tool will make it clear to the user that
> it cannot work, since the resource is already occupied by another.
> The user will have to experiment first with one tool, uninstall it, and then
> experiment with another.
> With the second strategy, both tools will unload each other's filters. In the
> best case, this will lead to disruption of their work. At a minimum, blksnap,
> when detached, will reset the change tracker and each backup will perform a
> full read of the block device. As a result, the user will receive distorted
> data, the system will not work as planned, although there will be no error
> message.

I had a more complicated scenario in mind. For example, some kind of live migration
from one block device to another, when you switch from the filter which clones from the
source device to the target device to the filter which just redirects from the source
device to the target device as the last step.

OTOH, that may be a very distant vision. Plus, one single and simple filter, which
redirects I/O into a DM stack, would be enough or better anyway to do the more
complicated things using the DM features, which include atomic replacement and
stacking and everything.

I don't have a strong opinion.

Best

Donald
--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-04-14 12:47:05

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module



On 4/12/23 21:38, Donald Buczek wrote:
> Subject:
> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
> From:
> Donald Buczek <[email protected]>
> Date:
> 4/12/23, 21:38
>
> To:
> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
> CC:
> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>
>
> I think, you can trigger all kind of user-after-free when userspace deletes a snapshot image or the snapshot image and the tracker while the disk device snapshot image is kept alive (mounted or just opened) and doing I/O.
>
> Here is what I did to provoke that:
>
> root@dose:~# s=$(blksnap snapshot_create -d /dev/vdb)
> root@dose:~# blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
> device path: '/dev/block/253:2'
> allocate range: ofs=11264624 cnt=2097152
> root@dose:~# blksnap snapshot_take -i $s
> root@dose:~# mount /dev/blksnap-image_253\:16 /mnt
> root@dose:~# dd if=/dev/zero of=/mnt/x.x &
> [1] 2514
> root@dose:~# blksnap snapshot_destroy -i $s
> dd: writing to '/mnt/x.x': No space left on device
> 1996041+0 records in
> 1996040+0 records out
> 1021972480 bytes (1.0 GB, 975 MiB) copied, 8.48923 s, 120 MB/s
> [1]+  Exit 1                  dd if=/dev/zero of=/mnt/x.x
>

Thanks!
I am very glad that the blksnap tool turned out to be useful in the review.
This snapshot deletion scenario is not the most typical, but of course it is
quite possible.
I will need to solve this problem and add such a scenario to the test suite.

2023-04-14 13:48:55

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism



On 4/12/23 21:59, Donald Buczek wrote:
> Subject:
> Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism
> From:
> Donald Buczek <[email protected]>
> Date:
> 4/12/23, 21:59
>
> To:
> Sergei Shtepa <[email protected]>, Christoph Hellwig <[email protected]>
> CC:
> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>
>
> On 4/12/23 12:43, Sergei Shtepa wrote:
>>
>>
>> On 4/11/23 08:25, Christoph Hellwig wrote:
>>> Subject:
>>> Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism
>>> From:
>>> Christoph Hellwig <[email protected]>
>>> Date:
>>> 4/11/23, 08:25
>>>
>>> To:
>>> Donald Buczek <[email protected]>
>>> CC:
>>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>>
>>>
>>> On Sat, Apr 08, 2023 at 05:30:19PM +0200, Donald Buczek wrote:
>>>> Maybe detach the old filter and attach the new one instead? An atomic replace might be usefull and it wouldn't complicate the code to do that instead. If its the same filter, maybe just return success and don't go through ops->detach and ops->attach?
>>> I don't think a replace makes any sense.  We might want multiple
>>> filters eventually, but unless we have a good use case for even just
>>> more than a single driver we can deal with that once needed.  The
>>> interface is prepared to support multiple attached filters already.
>>>
>>
>>
>> Thank you Donald for your comment. It got me thinking.
>>
>> Despite the fact that only one filter is currently offered for the kernel,
>> I think that out-of-tree filters of block devices may appear very soon.
>> It would be good to think about it in advance.
>> And, I agree with Christophe, we would not like to redo the blk-filter interface
>> when new filters appear in the tree.
>>
>> We can consider a block device as a resource that two actor want to take over.
>> There are two possible behavioral strategies:
>> 1. If one owner occupies a resource, then for other actors, the ownership
>> request will end with a refusal. The owner will not lose his resource.
>> 2. Any actor can take away a resource from the owner and inform him about its
>> loss using a callback.
>>
>> I think the first strategy is safer. When calling ioctl BLKFILTER_ATTACH, the
>> kernel informs the actor that the resource is busy.
>> Of course, there is still an option to grab someone else's occupied resource.
>> To do this, he will have to call ioctl BLKFILTER_DETACH, specifying the name
>> of the filter that needs to be detached. It is assumed that such detached
>> should be performed by the same actor that attached it there.
>>
>> If we replace the owner at each ioctl BLKFILTER_ATTACH, then we can get a
>> situation of competition between two actors. At the same time, they won't
>> even get a message that something is going wrong.
>>
>> An example from life. The user compares different backup tools. Install one,
>> then another. Each uses its own filter (And why not? this is technically
>> possible).
>> With the first strategy, the second tool will make it clear to the user that
>> it cannot work, since the resource is already occupied by another.
>> The user will have to experiment first with one tool, uninstall it, and then
>> experiment with another.
>> With the second strategy, both tools will unload each other's filters. In the
>> best case, this will lead to disruption of their work. At a minimum, blksnap,
>> when detached, will reset the change tracker and each backup will perform a
>> full read of the block device. As a result, the user will receive distorted
>> data, the system will not work as planned, although there will be no error
>> message.
>
> I had a more complicated scenario in mind. For example, some kind of live migration
> from one block device to another, when you switch from the filter which clones from the
> source device to the target device to the filter which just redirects from the source
> device to the target device as the last step.

'Live migration' - I've heard that idea before.
I think it makes sense to create a description of what 'live migration' should
look like for Linux. Describe the purpose, what cases it would be useful, the
general algorithm of work.
It seems to me that the implementation of 'live migration' may be similar to
what is implemented in blksnap. Perhaps I could be useful in such a project.

> OTOH, that may be a very distant vision. Plus, one single and simple filter, which
> redirects I/O into a DM stack, would be enough or better anyway to do the more
> complicated things using the DM features, which include atomic replacement and
> stacking and everything.
>
> I don't have a strong opinion.

About one single and simple filter, which redirects I/O into a DM stack.
Yes, at first glance, such an implementation looks simple and obvious.
I tried to go this way - I failed. Maybe someone can do it. I'll be glad.

2023-04-16 06:20:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 02/11] block: Block Device Filtering Mechanism

On Wed, Apr 12, 2023 at 12:43:40PM +0200, Sergei Shtepa wrote:
> We can consider a block device as a resource that two actor want to take over.
> There are two possible behavioral strategies:
> 1. If one owner occupies a resource, then for other actors, the ownership
> request will end with a refusal. The owner will not lose his resource.
> 2. Any actor can take away a resource from the owner and inform him about its
> loss using a callback.
>
> I think the first strategy is safer. When calling ioctl BLKFILTER_ATTACH, the
> kernel informs the actor that the resource is busy.
> Of course, there is still an option to grab someone else's occupied resource.
> To do this, he will have to call ioctl BLKFILTER_DETACH, specifying the name
> of the filter that needs to be detached. It is assumed that such detached
> should be performed by the same actor that attached it there.

Yes.

> If we replace the owner at each ioctl BLKFILTER_ATTACH, then we can get a
> situation of competition between two actors. At the same time, they won't
> even get a message that something is going wrong.

> With the second strategy, both tools will unload each other's filters. In the
> best case, this will lead to disruption of their work. At a minimum, blksnap,
> when detached, will reset the change tracker and each backup will perform a
> full read of the block device. As a result, the user will receive distorted
> data, the system will not work as planned, although there will be no error
> message.

Exactly. Silent replacement is a bad idea. Maybe we can stupport
multiple filters, but I'm not entirely sold on that either. But
silently replacing an existing one is a bad idea.

2023-04-18 10:35:06

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module



On 4/14/23 14:34, Sergei Shtepa wrote:
> Subject:
> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
> From:
> Sergei Shtepa <[email protected]>
> Date:
> 4/14/23, 14:34
>
> To:
> Donald Buczek <[email protected]>, [email protected], [email protected], [email protected], [email protected]
> CC:
> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>
>
>
> On 4/12/23 21:38, Donald Buczek wrote:
>> Subject:
>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>> From:
>> Donald Buczek <[email protected]>
>> Date:
>> 4/12/23, 21:38
>>
>> To:
>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>> CC:
>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>
>>
>> I think, you can trigger all kind of user-after-free when userspace deletes a snapshot image or the snapshot image and the tracker while the disk device snapshot image is kept alive (mounted or just opened) and doing I/O.
>>
>> Here is what I did to provoke that:
>>
>> root@dose:~# s=$(blksnap snapshot_create -d /dev/vdb)
>> root@dose:~# blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
>> device path: '/dev/block/253:2'
>> allocate range: ofs=11264624 cnt=2097152
>> root@dose:~# blksnap snapshot_take -i $s
>> root@dose:~# mount /dev/blksnap-image_253\:16 /mnt
>> root@dose:~# dd if=/dev/zero of=/mnt/x.x &
>> [1] 2514
>> root@dose:~# blksnap snapshot_destroy -i $s
>> dd: writing to '/mnt/x.x': No space left on device
>> 1996041+0 records in
>> 1996040+0 records out
>> 1021972480 bytes (1.0 GB, 975 MiB) copied, 8.48923 s, 120 MB/s
>> [1]+  Exit 1                  dd if=/dev/zero of=/mnt/x.x
>>
> Thanks!
> I am very glad that the blksnap tool turned out to be useful in the review.
> This snapshot deletion scenario is not the most typical, but of course it is
> quite possible.
> I will need to solve this problem and add such a scenario to the test suite.
>

Hi!

I have redesign the logic of ownership of the diff_area structure.
See patch in attach or commit.
Link: https://github.com/SergeiShtepa/linux/commit/7e927c381dcd2b2293be8315897a224d111b6f88
A test script for such a scenario has been added.
Link: https://github.com/veeam/blksnap/commit/fd0559dfedf094901d08bbf185fed288f0156433

I will be glad of any feedback.


Attachments:
fix_diff_area_ownership.patch (16.33 kB)

2023-04-18 14:53:51

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module

On 4/18/23 12:31, Sergei Shtepa wrote:
>
>
> On 4/14/23 14:34, Sergei Shtepa wrote:
>> Subject:
>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>> From:
>> Sergei Shtepa <[email protected]>
>> Date:
>> 4/14/23, 14:34
>>
>> To:
>> Donald Buczek <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>> CC:
>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>
>>
>>
>> On 4/12/23 21:38, Donald Buczek wrote:
>>> Subject:
>>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>>> From:
>>> Donald Buczek <[email protected]>
>>> Date:
>>> 4/12/23, 21:38
>>>
>>> To:
>>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>>> CC:
>>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>>
>>>
>>> I think, you can trigger all kind of user-after-free when userspace deletes a snapshot image or the snapshot image and the tracker while the disk device snapshot image is kept alive (mounted or just opened) and doing I/O.
>>>
>>> Here is what I did to provoke that:
>>>
>>> root@dose:~# s=$(blksnap snapshot_create -d /dev/vdb)
>>> root@dose:~# blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
>>> device path: '/dev/block/253:2'
>>> allocate range: ofs=11264624 cnt=2097152
>>> root@dose:~# blksnap snapshot_take -i $s
>>> root@dose:~# mount /dev/blksnap-image_253\:16 /mnt
>>> root@dose:~# dd if=/dev/zero of=/mnt/x.x &
>>> [1] 2514
>>> root@dose:~# blksnap snapshot_destroy -i $s
>>> dd: writing to '/mnt/x.x': No space left on device
>>> 1996041+0 records in
>>> 1996040+0 records out
>>> 1021972480 bytes (1.0 GB, 975 MiB) copied, 8.48923 s, 120 MB/s
>>> [1]+  Exit 1                  dd if=/dev/zero of=/mnt/x.x
>>>
>> Thanks!
>> I am very glad that the blksnap tool turned out to be useful in the review.
>> This snapshot deletion scenario is not the most typical, but of course it is
>> quite possible.
>> I will need to solve this problem and add such a scenario to the test suite.
>>
>
> Hi!
>
> I have redesign the logic of ownership of the diff_area structure.
> See patch in attach or commit.
> Link: https://github.com/SergeiShtepa/linux/commit/7e927c381dcd2b2293be8315897a224d111b6f88
> A test script for such a scenario has been added.
> Link: https://github.com/veeam/blksnap/commit/fd0559dfedf094901d08bbf185fed288f0156433
>
> I will be glad of any feedback.

Great, Thanks!

However, there are two leftover calls to diff_area_free() with its old prototype:

CC [M] drivers/block/blksnap/diff_area.o
drivers/block/blksnap/diff_area.c: In function ‘diff_area_new’:
drivers/block/blksnap/diff_area.c:283:18: error: passing argument 1 of ‘diff_area_free’ from incompatible pointer type [-Werror=incompatible-pointer-types]
283 | diff_area_free(diff_area);
| ^~~~~~~~~
| |
| struct diff_area *
drivers/block/blksnap/diff_area.c:110:34: note: expected ‘struct kref *’ but argument is of type ‘struct diff_area *’
110 | void diff_area_free(struct kref *kref)
| ~~~~~~~~~~~~~^~~~
cc1: some warnings being treated as errors
make[4]: *** [scripts/Makefile.build:252: drivers/block/blksnap/diff_area.o] Error 1
make[3]: *** [scripts/Makefile.build:494: drivers/block/blksnap] Error 2
make[2]: *** [scripts/Makefile.build:494: drivers/block] Error 2
make[1]: *** [scripts/Makefile.build:494: drivers] Error 2
make: *** [Makefile:2025: .] Error 2

The other one:

buczek@dose:/scratch/local/linux (blksnap-test)$ make drivers/block/blksnap/tracker.o
CALL scripts/checksyscalls.sh
DESCEND objtool
INSTALL libsubcmd_headers
CC [M] drivers/block/blksnap/tracker.o
drivers/block/blksnap/tracker.c: In function ‘tracker_free’:
drivers/block/blksnap/tracker.c:26:25: error: passing argument 1 of ‘diff_area_free’ from incompatible pointer type [-Werror=incompatible-pointer-types]
26 | diff_area_free(tracker->diff_area);
| ~~~~~~~^~~~~~~~~~~
| |
| struct diff_area *
In file included from drivers/block/blksnap/tracker.c:12:
drivers/block/blksnap/diff_area.h:116:34: note: expected ‘struct kref *’ but argument is of type ‘struct diff_area *’
116 | void diff_area_free(struct kref *kref);
| ~~~~~~~~~~~~~^~~~
cc1: some warnings being treated as errors
make[4]: *** [scripts/Makefile.build:252: drivers/block/blksnap/tracker.o] Error 1
make[3]: *** [scripts/Makefile.build:494: drivers/block/blksnap] Error 2
make[2]: *** [scripts/Makefile.build:494: drivers/block] Error 2
make[1]: *** [scripts/Makefile.build:494: drivers] Error 2
make: *** [Makefile:2025: .] Error 2

Am I missing something?

Best
Donald

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-04-19 13:14:11

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module



On 4/18/23 16:48, Donald Buczek wrote:
> Subject:
> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
> From:
> Donald Buczek <[email protected]>
> Date:
> 4/18/23, 16:48
>
> To:
> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
> CC:
> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>
>
> On 4/18/23 12:31, Sergei Shtepa wrote:
>>
>>
>> On 4/14/23 14:34, Sergei Shtepa wrote:
>>> Subject:
>>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>>> From:
>>> Sergei Shtepa <[email protected]>
>>> Date:
>>> 4/14/23, 14:34
>>>
>>> To:
>>> Donald Buczek <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>>> CC:
>>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>>
>>>
>>>
>>> On 4/12/23 21:38, Donald Buczek wrote:
>>>> Subject:
>>>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>>>> From:
>>>> Donald Buczek <[email protected]>
>>>> Date:
>>>> 4/12/23, 21:38
>>>>
>>>> To:
>>>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>>>> CC:
>>>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>>>
>>>>
>>>> I think, you can trigger all kind of user-after-free when userspace deletes a snapshot image or the snapshot image and the tracker while the disk device snapshot image is kept alive (mounted or just opened) and doing I/O.
>>>>
>>>> Here is what I did to provoke that:
>>>>
>>>> root@dose:~# s=$(blksnap snapshot_create -d /dev/vdb)
>>>> root@dose:~# blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
>>>> device path: '/dev/block/253:2'
>>>> allocate range: ofs=11264624 cnt=2097152
>>>> root@dose:~# blksnap snapshot_take -i $s
>>>> root@dose:~# mount /dev/blksnap-image_253\:16 /mnt
>>>> root@dose:~# dd if=/dev/zero of=/mnt/x.x &
>>>> [1] 2514
>>>> root@dose:~# blksnap snapshot_destroy -i $s
>>>> dd: writing to '/mnt/x.x': No space left on device
>>>> 1996041+0 records in
>>>> 1996040+0 records out
>>>> 1021972480 bytes (1.0 GB, 975 MiB) copied, 8.48923 s, 120 MB/s
>>>> [1]+  Exit 1                  dd if=/dev/zero of=/mnt/x.x
>>>>
>>> Thanks!
>>> I am very glad that the blksnap tool turned out to be useful in the review.
>>> This snapshot deletion scenario is not the most typical, but of course it is
>>> quite possible.
>>> I will need to solve this problem and add such a scenario to the test suite.
>>>
>>
>> Hi!
>>
>> I have redesign the logic of ownership of the diff_area structure.
>> See patch in attach or commit.
>> Link: https://github.com/SergeiShtepa/linux/commit/7e927c381dcd2b2293be8315897a224d111b6f88
>> A test script for such a scenario has been added.
>> Link: https://github.com/veeam/blksnap/commit/fd0559dfedf094901d08bbf185fed288f0156433
>>
>> I will be glad of any feedback.
>
> Great, Thanks!
>
> However, there are two leftover calls to diff_area_free() with its old prototype:
>
>  CC [M]  drivers/block/blksnap/diff_area.o
> drivers/block/blksnap/diff_area.c: In function ‘diff_area_new’:
> drivers/block/blksnap/diff_area.c:283:18: error: passing argument 1 of ‘diff_area_free’ from incompatible pointer type [-Werror=incompatible-pointer-types]
>   283 |   diff_area_free(diff_area);
>       |                  ^~~~~~~~~
>       |                  |
>       |                  struct diff_area *
> drivers/block/blksnap/diff_area.c:110:34: note: expected ‘struct kref *’ but argument is of type ‘struct diff_area *’
>   110 | void diff_area_free(struct kref *kref)
>       |                     ~~~~~~~~~~~~~^~~~
> cc1: some warnings being treated as errors
> make[4]: *** [scripts/Makefile.build:252: drivers/block/blksnap/diff_area.o] Error 1
> make[3]: *** [scripts/Makefile.build:494: drivers/block/blksnap] Error 2
> make[2]: *** [scripts/Makefile.build:494: drivers/block] Error 2
> make[1]: *** [scripts/Makefile.build:494: drivers] Error 2
> make: *** [Makefile:2025: .] Error 2
>
> The other one:
>
> buczek@dose:/scratch/local/linux (blksnap-test)$ make drivers/block/blksnap/tracker.o
>   CALL    scripts/checksyscalls.sh
>   DESCEND objtool
>   INSTALL libsubcmd_headers
>   CC [M]  drivers/block/blksnap/tracker.o
> drivers/block/blksnap/tracker.c: In function ‘tracker_free’:
> drivers/block/blksnap/tracker.c:26:25: error: passing argument 1 of ‘diff_area_free’ from incompatible pointer type [-Werror=incompatible-pointer-types]
>    26 |   diff_area_free(tracker->diff_area);
>       |                  ~~~~~~~^~~~~~~~~~~
>       |                         |
>       |                         struct diff_area *
> In file included from drivers/block/blksnap/tracker.c:12:
> drivers/block/blksnap/diff_area.h:116:34: note: expected ‘struct kref *’ but argument is of type ‘struct diff_area *’
>   116 | void diff_area_free(struct kref *kref);
>       |                     ~~~~~~~~~~~~~^~~~
> cc1: some warnings being treated as errors
> make[4]: *** [scripts/Makefile.build:252: drivers/block/blksnap/tracker.o] Error 1
> make[3]: *** [scripts/Makefile.build:494: drivers/block/blksnap] Error 2
> make[2]: *** [scripts/Makefile.build:494: drivers/block] Error 2
> make[1]: *** [scripts/Makefile.build:494: drivers] Error 2
> make: *** [Makefile:2025: .] Error 2
>
> Am I missing something?

Thanks!

It seems to me that I missed something.
The biggest mystery for me is why I was able to build and test the kernel.
I think it's some kind of incremental build effect.
I was only able to see the problem after 'make clean'.

Patches in attach and https://github.com/SergeiShtepa/linux/tree/blksnap-master


Attachments:
blksnap_cleanup_comment.patch (511.00 B)
blksnap_fix_diff_area_put.patch (860.00 B)
Download all attachments

2023-04-19 19:47:13

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module

Dear Sergei,

On 4/19/23 15:05, Sergei Shtepa wrote:
>
>
> On 4/18/23 16:48, Donald Buczek wrote:
>> Subject:
>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>> From:
>> Donald Buczek <[email protected]>
>> Date:
>> 4/18/23, 16:48
>>
>> To:
>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>> CC:
>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>
>>
>> On 4/18/23 12:31, Sergei Shtepa wrote:
>>>
>>>
>>> On 4/14/23 14:34, Sergei Shtepa wrote:
>>>> Subject:
>>>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>>>> From:
>>>> Sergei Shtepa <[email protected]>
>>>> Date:
>>>> 4/14/23, 14:34
>>>>
>>>> To:
>>>> Donald Buczek <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>>>> CC:
>>>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>>>
>>>>
>>>>
>>>> On 4/12/23 21:38, Donald Buczek wrote:
>>>>> Subject:
>>>>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>>>>> From:
>>>>> Donald Buczek <[email protected]>
>>>>> Date:
>>>>> 4/12/23, 21:38
>>>>>
>>>>> To:
>>>>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>>>>> CC:
>>>>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>>>>
>>>>>
>>>>> I think, you can trigger all kind of user-after-free when userspace deletes a snapshot image or the snapshot image and the tracker while the disk device snapshot image is kept alive (mounted or just opened) and doing I/O.
>>>>>
>>>>> Here is what I did to provoke that:
>>>>>
>>>>> root@dose:~# s=$(blksnap snapshot_create -d /dev/vdb)
>>>>> root@dose:~# blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
>>>>> device path: '/dev/block/253:2'
>>>>> allocate range: ofs=11264624 cnt=2097152
>>>>> root@dose:~# blksnap snapshot_take -i $s
>>>>> root@dose:~# mount /dev/blksnap-image_253\:16 /mnt
>>>>> root@dose:~# dd if=/dev/zero of=/mnt/x.x &
>>>>> [1] 2514
>>>>> root@dose:~# blksnap snapshot_destroy -i $s
>>>>> dd: writing to '/mnt/x.x': No space left on device
>>>>> 1996041+0 records in
>>>>> 1996040+0 records out
>>>>> 1021972480 bytes (1.0 GB, 975 MiB) copied, 8.48923 s, 120 MB/s
>>>>> [1]+  Exit 1                  dd if=/dev/zero of=/mnt/x.x
>>>>>
>>>> Thanks!
>>>> I am very glad that the blksnap tool turned out to be useful in the review.
>>>> This snapshot deletion scenario is not the most typical, but of course it is
>>>> quite possible.
>>>> I will need to solve this problem and add such a scenario to the test suite.
>>>>
>>>
>>> Hi!
>>>
>>> I have redesign the logic of ownership of the diff_area structure.
>>> See patch in attach or commit.
>>> Link: https://github.com/SergeiShtepa/linux/commit/7e927c381dcd2b2293be8315897a224d111b6f88
>>> A test script for such a scenario has been added.
>>> Link: https://github.com/veeam/blksnap/commit/fd0559dfedf094901d08bbf185fed288f0156433
>>>
>>> I will be glad of any feedback.
>>
>> Great, Thanks!
>>
>> However, there are two leftover calls to diff_area_free() with its old prototype:
>>
>>  CC [M]  drivers/block/blksnap/diff_area.o
>> drivers/block/blksnap/diff_area.c: In function ‘diff_area_new’:
>> drivers/block/blksnap/diff_area.c:283:18: error: passing argument 1 of ‘diff_area_free’ from incompatible pointer type [-Werror=incompatible-pointer-types]
>>   283 |   diff_area_free(diff_area);
>>       |                  ^~~~~~~~~
>>       |                  |
>>       |                  struct diff_area *
>> drivers/block/blksnap/diff_area.c:110:34: note: expected ‘struct kref *’ but argument is of type ‘struct diff_area *’
>>   110 | void diff_area_free(struct kref *kref)
>>       |                     ~~~~~~~~~~~~~^~~~
>> cc1: some warnings being treated as errors
>> make[4]: *** [scripts/Makefile.build:252: drivers/block/blksnap/diff_area.o] Error 1
>> make[3]: *** [scripts/Makefile.build:494: drivers/block/blksnap] Error 2
>> make[2]: *** [scripts/Makefile.build:494: drivers/block] Error 2
>> make[1]: *** [scripts/Makefile.build:494: drivers] Error 2
>> make: *** [Makefile:2025: .] Error 2
>>
>> The other one:
>>
>> buczek@dose:/scratch/local/linux (blksnap-test)$ make drivers/block/blksnap/tracker.o
>>   CALL    scripts/checksyscalls.sh
>>   DESCEND objtool
>>   INSTALL libsubcmd_headers
>>   CC [M]  drivers/block/blksnap/tracker.o
>> drivers/block/blksnap/tracker.c: In function ‘tracker_free’:
>> drivers/block/blksnap/tracker.c:26:25: error: passing argument 1 of ‘diff_area_free’ from incompatible pointer type [-Werror=incompatible-pointer-types]
>>    26 |   diff_area_free(tracker->diff_area);
>>       |                  ~~~~~~~^~~~~~~~~~~
>>       |                         |
>>       |                         struct diff_area *
>> In file included from drivers/block/blksnap/tracker.c:12:
>> drivers/block/blksnap/diff_area.h:116:34: note: expected ‘struct kref *’ but argument is of type ‘struct diff_area *’
>>   116 | void diff_area_free(struct kref *kref);
>>       |                     ~~~~~~~~~~~~~^~~~
>> cc1: some warnings being treated as errors
>> make[4]: *** [scripts/Makefile.build:252: drivers/block/blksnap/tracker.o] Error 1
>> make[3]: *** [scripts/Makefile.build:494: drivers/block/blksnap] Error 2
>> make[2]: *** [scripts/Makefile.build:494: drivers/block] Error 2
>> make[1]: *** [scripts/Makefile.build:494: drivers] Error 2
>> make: *** [Makefile:2025: .] Error 2
>>
>> Am I missing something?
>
> Thanks!
>
> It seems to me that I missed something.
> The biggest mystery for me is why I was able to build and test the kernel.
> I think it's some kind of incremental build effect.
> I was only able to see the problem after 'make clean'.
>
> Patches in attach and https://github.com/SergeiShtepa/linux/tree/blksnap-master

Thanks. I can confirm that this fixes the reported problem and I no longer can trigger the UAF. :-)

Tested-Bny: Donald Buczek <[email protected]>

Maybe you can add me to the cc list for v4 as I'm not subscribed to the lists.

Best

Donald
--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-04-20 14:58:58

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module

On 4/19/23 21:42, Donald Buczek wrote:
> Dear Sergei,
>
> On 4/19/23 15:05, Sergei Shtepa wrote:
>> [...]
>>
>> Patches in attach and https://github.com/SergeiShtepa/linux/tree/blksnap-master
>
> Thanks. I can confirm that this fixes the reported problem and I no longer can trigger the UAF. :-)
>
> Tested-Bny: Donald Buczek <[email protected]>
>
> Maybe you can add me to the cc list for v4 as I'm not subscribed to the lists.


Sorry, found another one. Reproducer:

=====
#! /bin/bash
set -xe
modprobe blksnap
test -e /scratch/local/test.dat || fallocate -l 1G /scratch/local/test.dat
s=$(blksnap snapshot_create -d /dev/vdb)
blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
blksnap snapshot_take -i $s
s2=$(blksnap snapshot_create -d /dev/vdb)
blksnap snapshot_destroy -i $s2
blksnap snapshot_destroy -i $s
=====


[20382.402921] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was created
[20382.535933] blksnap-image: Create snapshot image device for original device [253:16]
[20382.542405] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was taken successfully
[20382.572564] blksnap-snapshot: Snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966 was created
[20382.600521] blksnap-snapshot: Destroy snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
[20382.602373] blksnap-snapshot: Release snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
[20382.722137] blksnap-snapshot: Destroy snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
[20382.724033] blksnap-snapshot: Release snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
[20382.725850] ==================================================================
[20382.727641] BUG: KASAN: wild-memory-access in snapshot_free+0x73/0x170 [blksnap]
[20382.729326] Write of size 8 at addr dead000000000108 by task blksnap/8297

[20382.731212] CPU: 4 PID: 8297 Comm: blksnap Not tainted 6.3.0-rc5.mx64.428-00094-g21dc08a94f59-dirty #41
[20382.733293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[20382.735807] Call Trace:
[20382.736395] <TASK>
[20382.736900] dump_stack_lvl+0x37/0x50
[20382.737767] ? snapshot_free+0x73/0x170 [blksnap]
[20382.738873] kasan_report+0xb2/0xe0
[20382.739690] ? snapshot_free+0x73/0x170 [blksnap]
[20382.740799] snapshot_free+0x73/0x170 [blksnap]
[20382.741868] snapshot_destroy+0x119/0x170 [blksnap]
[20382.743009] ioctl_snapshot_destroy+0x7a/0xc0 [blksnap]
[20382.744241] ? __pfx_ioctl_snapshot_destroy+0x10/0x10 [blksnap]
[20382.745606] ? __fget_light+0x1ca/0x200
[20382.746493] ctrl_unlocked_ioctl+0x3a/0x60 [blksnap]
[20382.747654] __x64_sys_ioctl+0xc6/0xe0
[20382.748528] do_syscall_64+0x47/0xa0
[20382.749369] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[20382.750524] RIP: 0033:0x7f27fbf7f4db
[20382.751351] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00
[20382.760773] RSP: 002b:00007ffcf157de50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[20382.767031] RAX: ffffffffffffffda RBX: 00007ffcf157df00 RCX: 00007f27fbf7f4db
[20382.772597] RDX: 00007ffcf157dec0 RSI: 0000000080105602 RDI: 0000000000000003
[20382.777677] RBP: 00007ffcf157df60 R08: 00007ffcf157df30 R09: 0000000000000000
[20382.782318] R10: 00007f27fc18d8f0 R11: 0000000000000246 R12: 00007ffcf157e2f8
[20382.786617] R13: 00000000004079f6 R14: 00000000005653f8 R15: 00007f27fc19e040
[20382.790634] </TASK>
[20382.794012] ==================================================================
[20382.797799] Disabling lock debugging due to kernel taint
[20382.801245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP KASAN PTI
[20382.805060] CPU: 4 PID: 8297 Comm: blksnap Tainted: G B 6.3.0-rc5.mx64.428-00094-g21dc08a94f59-dirty #41
[20382.808757] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[20382.812441] RIP: 0010:snapshot_free+0x73/0x170 [blksnap]
[20382.815569] Code: 4d 8b 74 24 50 4c 89 f7 49 8d 6e f8 e8 56 a7 f3 e0 49 8b 1e 49 8d 7e 08 e8 4a a7 f3 e0 4d 8b 7e 08 48 8d 7b 08 e8 ed a7 f3 e0 <4c> 89 7b 08 4c 89 ff e8 e1 a7 f3 e0 49 89 1f 48 89 ef 48 b8 00 01
[20382.822394] RSP: 0018:ffff888120fafe18 EFLAGS: 00010292
[20382.825587] RAX: 0000000000000001 RBX: dead000000000100 RCX: ffffffff810cb82a
[20382.828866] RDX: fffffbfff0850d49 RSI: 0000000000000008 RDI: ffffffff84286a40
[20382.832142] RBP: ffff888105a4f400 R08: 0000000000000001 R09: ffffffff84286a47
[20382.835366] R10: fffffbfff0850d48 R11: 0000000000000001 R12: ffff888124ac9b10
[20382.838557] R13: ffff888124ac9b60 R14: ffff888105a4f408 R15: dead000000000122
[20382.841759] FS: 00007f27fbe7c780(0000) GS:ffff888261800000(0000) knlGS:0000000000000000
[20382.845049] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20382.848186] CR2: 00007f27fbed0d00 CR3: 0000000104608004 CR4: 0000000000170ee0
[20382.851439] Call Trace:
[20382.854366] <TASK>
[20382.857219] snapshot_destroy+0x119/0x170 [blksnap]
[20382.860286] ioctl_snapshot_destroy+0x7a/0xc0 [blksnap]
[20382.863356] ? __pfx_ioctl_snapshot_destroy+0x10/0x10 [blksnap]
[20382.866471] ? __fget_light+0x1ca/0x200
[20382.869377] ctrl_unlocked_ioctl+0x3a/0x60 [blksnap]
[20382.872322] __x64_sys_ioctl+0xc6/0xe0
[20382.875124] do_syscall_64+0x47/0xa0
[20382.877921] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[20382.880831] RIP: 0033:0x7f27fbf7f4db
[20382.883628] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00
[20382.890371] RSP: 002b:00007ffcf157de50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[20382.893694] RAX: ffffffffffffffda RBX: 00007ffcf157df00 RCX: 00007f27fbf7f4db
[20382.897004] RDX: 00007ffcf157dec0 RSI: 0000000080105602 RDI: 0000000000000003
[20382.900301] RBP: 00007ffcf157df60 R08: 00007ffcf157df30 R09: 0000000000000000
[20382.903572] R10: 00007f27fc18d8f0 R11: 0000000000000246 R12: 00007ffcf157e2f8
[20382.906879] R13: 00000000004079f6 R14: 00000000005653f8 R15: 00007f27fc19e040
[20382.910171] </TASK>
[20382.913069] Modules linked in: blksnap rpcsec_gss_krb5 nfsv4 nfs 8021q garp stp mrp llc kvm_intel bochs drm_vram_helper drm_ttm_helper virtio_net kvm ttm net_failover drm_kms_helper input_leds led_class irqbypass drm failover crc32c_intel syscopyarea sysfillrect i2c_piix4 intel_agp sysimgblt intel_gtt floppy nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables unix ipv6 autofs4 [last unloaded: blksnap]
[20382.924858] ---[ end trace 0000000000000000 ]---
[20382.928229] RIP: 0010:snapshot_free+0x73/0x170 [blksnap]
[20382.931564] Code: 4d 8b 74 24 50 4c 89 f7 49 8d 6e f8 e8 56 a7 f3 e0 49 8b 1e 49 8d 7e 08 e8 4a a7 f3 e0 4d 8b 7e 08 48 8d 7b 08 e8 ed a7 f3 e0 <4c> 89 7b 08 4c 89 ff e8 e1 a7 f3 e0 49 89 1f 48 89 ef 48 b8 00 01
[20382.939104] RSP: 0018:ffff888120fafe18 EFLAGS: 00010292
[20382.942536] RAX: 0000000000000001 RBX: dead000000000100 RCX: ffffffff810cb82a
[20382.946224] RDX: fffffbfff0850d49 RSI: 0000000000000008 RDI: ffffffff84286a40
[20382.949927] RBP: ffff888105a4f400 R08: 0000000000000001 R09: ffffffff84286a47
[20382.953591] R10: fffffbfff0850d48 R11: 0000000000000001 R12: ffff888124ac9b10
[20382.957253] R13: ffff888124ac9b60 R14: ffff888105a4f408 R15: dead000000000122
[20382.960914] FS: 00007f27fbe7c780(0000) GS:ffff888261800000(0000) knlGS:0000000000000000
[20382.964666] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[20382.968148] CR2: 00007f27fbed0d00 CR3: 0000000104608004 CR4: 0000000000170ee0


Best
Donald

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433

2023-04-20 19:28:26

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module



On 4/20/23 16:44, Donald Buczek wrote:
> Subject:
> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
> From:
> Donald Buczek <[email protected]>
> Date:
> 4/20/23, 16:44
>
> To:
> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
> CC:
> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>
>
> On 4/19/23 21:42, Donald Buczek wrote:
>> Dear Sergei,
>>
>> On 4/19/23 15:05, Sergei Shtepa wrote:
>>> [...]
>>>
>>> Patches in attach and https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FSergeiShtepa%2Flinux%2Ftree%2Fblksnap-master&data=05%7C01%7Csergei.shtepa%40veeam.com%7Cccc78e2cdf7845c6c0cd08db41add281%7Cba07baab431b49edadd7cbc3542f5140%7C1%7C0%7C638175987085694967%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RdrWqUwvk7gjfSRYvrPfz2E0%2BIOY6IQxK4xvpzJqcnk%3D&reserved=0
>>
>> Thanks. I can confirm that this fixes the reported problem and I no longer can trigger the UAF. ????
>>
>> Tested-Bny: Donald Buczek <[email protected]>
>>
>> Maybe you can add me to the cc list for v4 as I'm not subscribed to the lists.
>
>
> Sorry, found another one. Reproducer:
>
> =====
> #! /bin/bash
> set -xe
> modprobe blksnap
> test -e /scratch/local/test.dat || fallocate -l 1G /scratch/local/test.dat
> s=$(blksnap snapshot_create -d /dev/vdb)
> blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
> blksnap snapshot_take -i $s
> s2=$(blksnap snapshot_create -d /dev/vdb)
> blksnap snapshot_destroy -i $s2
> blksnap snapshot_destroy -i $s
> =====
>
>
> [20382.402921] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was created
> [20382.535933] blksnap-image: Create snapshot image device for original device [253:16]
> [20382.542405] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was taken successfully
> [20382.572564] blksnap-snapshot: Snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966 was created
> [20382.600521] blksnap-snapshot: Destroy snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
> [20382.602373] blksnap-snapshot: Release snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
> [20382.722137] blksnap-snapshot: Destroy snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
> [20382.724033] blksnap-snapshot: Release snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
> [20382.725850] ==================================================================
> [20382.727641] BUG: KASAN: wild-memory-access in snapshot_free+0x73/0x170 [blksnap]
> [20382.729326] Write of size 8 at addr dead000000000108 by task blksnap/8297
> ...

Great! Thanks.

There is no protection against re-adding a block device to the snapshot.
I'll take care of it.

And small update. I have made a correction to the bio allocation algorithm
for saving and loading chunks. Please, see attach and commit.
Link: https://github.com/SergeiShtepa/linux/commit/2628dd193fd3d563d26d5ccc82d35b2e11bbda38
But cases of large chunks or very large disks have not yet been covered
by tests, yet.

I also had concerns that the snapshot writing algorithm was not working
correctly. But the concerns were in vain. The new test is working.
Link: https://github.com/veeam/blksnap/blob/stable-v2.0/tests/6000-snapimage_write.sh


Attachments:
fix_page_inx_increment.patch (2.29 kB)

2023-04-21 17:38:16

by Sergei Shtepa

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module



On 4/20/23 21:17, Sergei Shtepa wrote:
> Subject:
> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
> From:
> Sergei Shtepa <[email protected]>
> Date:
> 4/20/23, 21:17
>
> To:
> Donald Buczek <[email protected]>, [email protected], [email protected], [email protected], [email protected]
> CC:
> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>
>
>
> On 4/20/23 16:44, Donald Buczek wrote:
>> Subject:
>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>> From:
>> Donald Buczek <[email protected]>
>> Date:
>> 4/20/23, 16:44
>>
>> To:
>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>> CC:
>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>
>>
>> On 4/19/23 21:42, Donald Buczek wrote:
>>> Dear Sergei,
>>>
>>> On 4/19/23 15:05, Sergei Shtepa wrote:
>>>> [...]
>>>>
>>>> Patches in attach and https://github.com/SergeiShtepa/linux/tree/blksnap-master
>>> Thanks. I can confirm that this fixes the reported problem and I no longer can trigger the UAF. ????
>>>
>>> Tested-Bny: Donald Buczek <[email protected]>
>>>
>>> Maybe you can add me to the cc list for v4 as I'm not subscribed to the lists.
>>
>> Sorry, found another one. Reproducer:
>>
>> =====
>> #! /bin/bash
>> set -xe
>> modprobe blksnap
>> test -e /scratch/local/test.dat || fallocate -l 1G /scratch/local/test.dat
>> s=$(blksnap snapshot_create -d /dev/vdb)
>> blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
>> blksnap snapshot_take -i $s
>> s2=$(blksnap snapshot_create -d /dev/vdb)
>> blksnap snapshot_destroy -i $s2
>> blksnap snapshot_destroy -i $s
>> =====
>>
>>
>> [20382.402921] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was created
>> [20382.535933] blksnap-image: Create snapshot image device for original device [253:16]
>> [20382.542405] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was taken successfully
>> [20382.572564] blksnap-snapshot: Snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966 was created
>> [20382.600521] blksnap-snapshot: Destroy snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
>> [20382.602373] blksnap-snapshot: Release snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
>> [20382.722137] blksnap-snapshot: Destroy snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
>> [20382.724033] blksnap-snapshot: Release snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
>> [20382.725850] ==================================================================
>> [20382.727641] BUG: KASAN: wild-memory-access in snapshot_free+0x73/0x170 [blksnap]
>> [20382.729326] Write of size 8 at addr dead000000000108 by task blksnap/8297
>> ...
> Great! Thanks.
>
> There is no protection against re-adding a block device to the snapshot.
> I'll take care of it.
>

Hi!

I think the fix turned out to be quite beautiful.
Now you will get an error "Device or resource busy".
Fix in attach and on github.
Link: https://github.com/SergeiShtepa/linux/commit/43a5d3dd9858f092b734187b6a62ce75acaa47c7


Attachments:
fix_no_protection_re-adding.patch (2.56 kB)

2023-04-22 20:18:47

by Donald Buczek

[permalink] [raw]
Subject: Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module



On 4/21/23 19:32, Sergei Shtepa wrote:
>
>
> On 4/20/23 21:17, Sergei Shtepa wrote:
>> Subject:
>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>> From:
>> Sergei Shtepa <[email protected]>
>> Date:
>> 4/20/23, 21:17
>>
>> To:
>> Donald Buczek <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>> CC:
>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>
>>
>>
>> On 4/20/23 16:44, Donald Buczek wrote:
>>> Subject:
>>> Re: [PATCH v3 03/11] documentation: Block Devices Snapshots Module
>>> From:
>>> Donald Buczek <[email protected]>
>>> Date:
>>> 4/20/23, 16:44
>>>
>>> To:
>>> Sergei Shtepa <[email protected]>, [email protected], [email protected], [email protected], [email protected]
>>> CC:
>>> [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
>>>
>>>
>>> On 4/19/23 21:42, Donald Buczek wrote:
>>>> Dear Sergei,
>>>>
>>>> On 4/19/23 15:05, Sergei Shtepa wrote:
>>>>> [...]
>>>>>
>>>>> Patches in attach and https://github.com/SergeiShtepa/linux/tree/blksnap-master
>>>> Thanks. I can confirm that this fixes the reported problem and I no longer can trigger the UAF. ????
>>>>
>>>> Tested-Bny: Donald Buczek <[email protected]>
>>>>
>>>> Maybe you can add me to the cc list for v4 as I'm not subscribed to the lists.
>>>
>>> Sorry, found another one. Reproducer:
>>>
>>> =====
>>> #! /bin/bash
>>> set -xe
>>> modprobe blksnap
>>> test -e /scratch/local/test.dat || fallocate -l 1G /scratch/local/test.dat
>>> s=$(blksnap snapshot_create -d /dev/vdb)
>>> blksnap snapshot_appendstorage -i $s -f /scratch/local/test.dat
>>> blksnap snapshot_take -i $s
>>> s2=$(blksnap snapshot_create -d /dev/vdb)
>>> blksnap snapshot_destroy -i $s2
>>> blksnap snapshot_destroy -i $s
>>> =====
>>>
>>>
>>> [20382.402921] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was created
>>> [20382.535933] blksnap-image: Create snapshot image device for original device [253:16]
>>> [20382.542405] blksnap-snapshot: Snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa was taken successfully
>>> [20382.572564] blksnap-snapshot: Snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966 was created
>>> [20382.600521] blksnap-snapshot: Destroy snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
>>> [20382.602373] blksnap-snapshot: Release snapshot 4b2d571d-9a24-419d-96c2-8d64a07c4966
>>> [20382.722137] blksnap-snapshot: Destroy snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
>>> [20382.724033] blksnap-snapshot: Release snapshot ff1c54f1-3e8c-4c99-bb26-35e82dc1c9fa
>>> [20382.725850] ==================================================================
>>> [20382.727641] BUG: KASAN: wild-memory-access in snapshot_free+0x73/0x170 [blksnap]
>>> [20382.729326] Write of size 8 at addr dead000000000108 by task blksnap/8297
>>> ...
>> Great! Thanks.
>>
>> There is no protection against re-adding a block device to the snapshot.
>> I'll take care of it.
>>
>
> Hi!
>
> I think the fix turned out to be quite beautiful.
> Now you will get an error "Device or resource busy".
> Fix in attach and on github.
> Link: https://github.com/SergeiShtepa/linux/commit/43a5d3dd9858f092b734187b6a62ce75acaa47c7

I can confirm, that this fixes the problem.

root@dose:~# blksnap snapshot_create -d /dev/vda -d /dev/vda
fdcd3ee3-a25f-4c2a-93d7-2d951520e938
Operation already in progress
root@dose:~# echo $?
1


Tested-By: Donald Buczek <[email protected]>

Best
Donald

--
Donald Buczek
[email protected]
Tel: +49 30 8413 1433