2023-06-12 13:40:54

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v5 00/11] blksnap - block devices snapshots module

Hi all.

I am happy to offer a improved version of the Block Devices Snapshots
Module. It allows to create non-persistent snapshots of any block devices.
The main purpose of such snapshots is to provide backups of block devices.
See more in Documentation/block/blksnap.rst.

The Block Device Filtering Mechanism is added to the block layer. This
allows to attach and detach block device filters to the block layer.
Filters allow to extend the functionality of the block layer.
See more in Documentation/block/blkfilter.rst.

The tool, library and tests for working with blksnap can be found on github.
Link: https://github.com/veeam/blksnap/tree/stable-v2.0

There are few changes in this patch version. The experience of using the
out-of-tree version of the blksnap module on real servers was taken into
account.

v5 changes:
- Rebase for "kernel/git/axboe/linux-block.git" branch "for-6.5/block".
Link: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=for-6.5/block

v4 changes:
- Structures for describing the state of chunks are allocated dynamically.
This reduces memory consumption, since the struct chunk is allocated only
for those blocks for which the snapshot image state differs from the
original block device.
- The algorithm for calculating the chunk size depending on the size of the
block device has been changed. For large block devices, it is now
possible to allocate a larger number of chunks, and their size is smaller.
- For block devices, a 'filter' file has been added to /sys/block/<device>.
It displays the name of the filter that is attached to the block device.
- Fixed a problem with the lack of protection against re-adding a block
device to a snapshot.
- Fixed a bug in the algorithm of allocating the next bio for a chunk.
This problem was accurred on large disks, for which a chunk consists of
at least two bio.
- The ownership mechanism of the diff_area structure has been changed.
This fixed the error of prematurely releasing the diff_area structure
when destroying the snapshot.
- Documentation corrected.
- The Sparse analyzer is passed.
- Use __u64 type instead pointers in UAPI.

v3 changes:
- New block device I/O controls BLKFILTER_ATTACH and BLKFILTER_DETACH allow
to attach and detach filters.
- New block device I/O control BLKFILTER_CTL allow send command to attached
block device filter.
- The copy-on-write algorithm for processing I/O units has been optimized
and has become asynchronous.
- The snapshot image reading algorithm has been optimized and has become
asynchronous.
- Optimized the finite state machine for processing chunks.
- Fixed a tracking block size calculation bug.

v2 changes:
- Added documentation for Block Device Filtering Mechanism.
- Added documentation for Block Devices Snapshots Module (blksnap).
- The MAINTAINERS file has been updated.
- Optimized queue code for snapshot images.
- Fixed comments, log messages and code for better readability.

v1 changes:
- Forgotten "static" declarations have been added.
- The text of the comments has been corrected.
- It is possible to connect only one filter, since there are no others in
upstream.
- Do not have additional locks for attach/detach filter.
- blksnap.h moved to include/uapi/.
- #pragma once and commented code removed.
- uuid_t removed from user API.
- Removed default values for module parameters from the configuration file.
- The debugging code for tracking memory leaks has been removed.
- Simplified Makefile.
- Optimized work with large memory buffers, CBT tables are now in virtual
memory.
- The allocation code of minor numbers has been optimized.
- The implementation of the snapshot image block device has been
simplified, now it is a bio-based block device.
- Removed initialization of global variables with null values.
- only one bio is used to copy one chunk.
- Checked on ppc64le.

Thanks for preparing v4 patch:
- Christoph Hellwig <[email protected]> for his significant contribution
to the project.
- Fabio Fantoni <[email protected]> for his participation in the
project, useful advice and faith in the success of the project.
- Donald Buczek <[email protected]> for researching the module and
user-space tool. His fresh look revealed a number of flaw.
- Bagas Sanjaya <[email protected]> for comments on the documentation.


Sergei Shtepa (11):
documentation: Block Device Filtering Mechanism
block: Block Device Filtering Mechanism
documentation: Block Devices Snapshots Module
blksnap: header file of the module interface
blksnap: module management interface functions
blksnap: handling and tracking I/O units
blksnap: minimum data storage unit of the original block device
blksnap: difference storage
blksnap: event queue from the difference storage
blksnap: snapshot and snapshot image block device
blksnap: Kconfig and Makefile

Documentation/block/blkfilter.rst | 64 ++++
Documentation/block/blksnap.rst | 345 +++++++++++++++++
Documentation/block/index.rst | 2 +
MAINTAINERS | 17 +
block/Makefile | 3 +-
block/bdev.c | 1 +
block/blk-core.c | 27 ++
block/blk-filter.c | 213 ++++++++++
block/blk.h | 11 +
block/genhd.c | 10 +
block/ioctl.c | 7 +
block/partitions/core.c | 10 +
drivers/block/Kconfig | 2 +
drivers/block/Makefile | 2 +
drivers/block/blksnap/Kconfig | 12 +
drivers/block/blksnap/Makefile | 15 +
drivers/block/blksnap/cbt_map.c | 227 +++++++++++
drivers/block/blksnap/cbt_map.h | 90 +++++
drivers/block/blksnap/chunk.c | 454 ++++++++++++++++++++++
drivers/block/blksnap/chunk.h | 114 ++++++
drivers/block/blksnap/diff_area.c | 554 +++++++++++++++++++++++++++
drivers/block/blksnap/diff_area.h | 144 +++++++
drivers/block/blksnap/diff_buffer.c | 127 ++++++
drivers/block/blksnap/diff_buffer.h | 37 ++
drivers/block/blksnap/diff_storage.c | 316 +++++++++++++++
drivers/block/blksnap/diff_storage.h | 111 ++++++
drivers/block/blksnap/event_queue.c | 87 +++++
drivers/block/blksnap/event_queue.h | 65 ++++
drivers/block/blksnap/main.c | 483 +++++++++++++++++++++++
drivers/block/blksnap/params.h | 16 +
drivers/block/blksnap/snapimage.c | 124 ++++++
drivers/block/blksnap/snapimage.h | 10 +
drivers/block/blksnap/snapshot.c | 443 +++++++++++++++++++++
drivers/block/blksnap/snapshot.h | 68 ++++
drivers/block/blksnap/tracker.c | 339 ++++++++++++++++
drivers/block/blksnap/tracker.h | 75 ++++
include/linux/blk-filter.h | 51 +++
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 1 +
include/uapi/linux/blk-filter.h | 35 ++
include/uapi/linux/blksnap.h | 421 ++++++++++++++++++++
include/uapi/linux/fs.h | 3 +
42 files changed, 5137 insertions(+), 1 deletion(-)
create mode 100644 Documentation/block/blkfilter.rst
create mode 100644 Documentation/block/blksnap.rst
create mode 100644 block/blk-filter.c
create mode 100644 drivers/block/blksnap/Kconfig
create mode 100644 drivers/block/blksnap/Makefile
create mode 100644 drivers/block/blksnap/cbt_map.c
create mode 100644 drivers/block/blksnap/cbt_map.h
create mode 100644 drivers/block/blksnap/chunk.c
create mode 100644 drivers/block/blksnap/chunk.h
create mode 100644 drivers/block/blksnap/diff_area.c
create mode 100644 drivers/block/blksnap/diff_area.h
create mode 100644 drivers/block/blksnap/diff_buffer.c
create mode 100644 drivers/block/blksnap/diff_buffer.h
create mode 100644 drivers/block/blksnap/diff_storage.c
create mode 100644 drivers/block/blksnap/diff_storage.h
create mode 100644 drivers/block/blksnap/event_queue.c
create mode 100644 drivers/block/blksnap/event_queue.h
create mode 100644 drivers/block/blksnap/main.c
create mode 100644 drivers/block/blksnap/params.h
create mode 100644 drivers/block/blksnap/snapimage.c
create mode 100644 drivers/block/blksnap/snapimage.h
create mode 100644 drivers/block/blksnap/snapshot.c
create mode 100644 drivers/block/blksnap/snapshot.h
create mode 100644 drivers/block/blksnap/tracker.c
create mode 100644 drivers/block/blksnap/tracker.h
create mode 100644 include/linux/blk-filter.h
create mode 100644 include/uapi/linux/blk-filter.h
create mode 100644 include/uapi/linux/blksnap.h

--
2.20.1



2023-06-12 13:51:43

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v5 06/11] blksnap: handling and tracking I/O units

The struct tracker contains callback functions for handling a I/O units
of a block device. When a write request is handled, the change block
tracking (CBT) map functions are called and initiates the process of
copying data from the original block device to the change store.
Registering and unregistering the tracker is provided by the functions
blkfilter_register() and blkfilter_unregister().
The struct cbt_map allows to store the history of block device changes.

Co-developed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sergei Shtepa <[email protected]>
---
drivers/block/blksnap/cbt_map.c | 227 +++++++++++++++++++++
drivers/block/blksnap/cbt_map.h | 90 +++++++++
drivers/block/blksnap/tracker.c | 339 ++++++++++++++++++++++++++++++++
drivers/block/blksnap/tracker.h | 75 +++++++
4 files changed, 731 insertions(+)
create mode 100644 drivers/block/blksnap/cbt_map.c
create mode 100644 drivers/block/blksnap/cbt_map.h
create mode 100644 drivers/block/blksnap/tracker.c
create mode 100644 drivers/block/blksnap/tracker.h

diff --git a/drivers/block/blksnap/cbt_map.c b/drivers/block/blksnap/cbt_map.c
new file mode 100644
index 000000000000..a0aeef8c2e94
--- /dev/null
+++ b/drivers/block/blksnap/cbt_map.c
@@ -0,0 +1,227 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#define pr_fmt(fmt) KBUILD_MODNAME "-cbt_map: " fmt
+
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <uapi/linux/blksnap.h>
+#include "cbt_map.h"
+#include "params.h"
+
+static inline unsigned long long count_by_shift(sector_t capacity,
+ unsigned long long shift)
+{
+ sector_t blk_size = 1ull << (shift - SECTOR_SHIFT);
+
+ return round_up(capacity, blk_size) / blk_size;
+}
+
+static void cbt_map_calculate_block_size(struct cbt_map *cbt_map)
+{
+ unsigned long long count;
+ unsigned long long shift = get_tracking_block_minimum_shift();
+
+ pr_debug("Device capacity %llu sectors\n", cbt_map->device_capacity);
+ /*
+ * The size of the tracking block is calculated based on the size of the disk
+ * so that the CBT table does not exceed a reasonable size.
+ */
+ count = count_by_shift(cbt_map->device_capacity, shift);
+ pr_debug("Blocks count %llu\n", count);
+ while (count > get_tracking_block_maximum_count()) {
+ if (shift >= get_tracking_block_maximum_shift()) {
+ pr_info("The maximum allowable CBT block size has been reached.\n");
+ break;
+ }
+ shift = shift + 1ull;
+ count = count_by_shift(cbt_map->device_capacity, shift);
+ pr_debug("Blocks count %llu\n", count);
+ }
+
+ cbt_map->blk_size_shift = shift;
+ cbt_map->blk_count = count;
+ pr_debug("The optimal CBT block size was calculated as %llu bytes\n",
+ (1ull << cbt_map->blk_size_shift));
+}
+
+static int cbt_map_allocate(struct cbt_map *cbt_map)
+{
+ unsigned char *read_map = NULL;
+ unsigned char *write_map = NULL;
+ size_t size = cbt_map->blk_count;
+
+ pr_debug("Allocate CBT map of %zu blocks\n", size);
+
+ if (cbt_map->read_map || cbt_map->write_map)
+ return -EINVAL;
+
+ read_map = __vmalloc(size, GFP_NOIO | __GFP_ZERO);
+ if (!read_map)
+ return -ENOMEM;
+
+ write_map = __vmalloc(size, GFP_NOIO | __GFP_ZERO);
+ if (!write_map) {
+ vfree(read_map);
+ return -ENOMEM;
+ }
+
+ cbt_map->read_map = read_map;
+ cbt_map->write_map = write_map;
+
+ cbt_map->snap_number_previous = 0;
+ cbt_map->snap_number_active = 1;
+ generate_random_uuid(cbt_map->generation_id.b);
+ cbt_map->is_corrupted = false;
+
+ return 0;
+}
+
+static void cbt_map_deallocate(struct cbt_map *cbt_map)
+{
+ cbt_map->is_corrupted = false;
+
+ if (cbt_map->read_map) {
+ vfree(cbt_map->read_map);
+ cbt_map->read_map = NULL;
+ }
+
+ if (cbt_map->write_map) {
+ vfree(cbt_map->write_map);
+ cbt_map->write_map = NULL;
+ }
+}
+
+int cbt_map_reset(struct cbt_map *cbt_map, sector_t device_capacity)
+{
+ cbt_map_deallocate(cbt_map);
+
+ cbt_map->device_capacity = device_capacity;
+ cbt_map_calculate_block_size(cbt_map);
+
+ return cbt_map_allocate(cbt_map);
+}
+
+void cbt_map_destroy(struct cbt_map *cbt_map)
+{
+ pr_debug("CBT map destroy\n");
+
+ cbt_map_deallocate(cbt_map);
+ kfree(cbt_map);
+}
+
+struct cbt_map *cbt_map_create(struct block_device *bdev)
+{
+ struct cbt_map *cbt_map = NULL;
+ int ret;
+
+ pr_debug("CBT map create\n");
+
+ cbt_map = kzalloc(sizeof(struct cbt_map), GFP_KERNEL);
+ if (cbt_map == NULL)
+ return NULL;
+
+ cbt_map->device_capacity = bdev_nr_sectors(bdev);
+ cbt_map_calculate_block_size(cbt_map);
+
+ ret = cbt_map_allocate(cbt_map);
+ if (ret) {
+ pr_err("Failed to create tracker. errno=%d\n", abs(ret));
+ cbt_map_destroy(cbt_map);
+ return NULL;
+ }
+
+ spin_lock_init(&cbt_map->locker);
+ cbt_map->is_corrupted = false;
+
+ return cbt_map;
+}
+
+void cbt_map_switch(struct cbt_map *cbt_map)
+{
+ pr_debug("CBT map switch\n");
+ spin_lock(&cbt_map->locker);
+
+ cbt_map->snap_number_previous = cbt_map->snap_number_active;
+ ++cbt_map->snap_number_active;
+ if (cbt_map->snap_number_active == 256) {
+ cbt_map->snap_number_active = 1;
+
+ memset(cbt_map->write_map, 0, cbt_map->blk_count);
+
+ generate_random_uuid(cbt_map->generation_id.b);
+
+ pr_debug("CBT reset\n");
+ } else
+ memcpy(cbt_map->read_map, cbt_map->write_map, cbt_map->blk_count);
+ spin_unlock(&cbt_map->locker);
+}
+
+static inline int _cbt_map_set(struct cbt_map *cbt_map, sector_t sector_start,
+ sector_t sector_cnt, u8 snap_number,
+ unsigned char *map)
+{
+ int res = 0;
+ u8 num;
+ size_t inx;
+ size_t cbt_block_first = (size_t)(
+ sector_start >> (cbt_map->blk_size_shift - SECTOR_SHIFT));
+ size_t cbt_block_last = (size_t)(
+ (sector_start + sector_cnt - 1) >>
+ (cbt_map->blk_size_shift - SECTOR_SHIFT));
+
+ for (inx = cbt_block_first; inx <= cbt_block_last; ++inx) {
+ if (unlikely(inx >= cbt_map->blk_count)) {
+ pr_err("Block index is too large\n");
+ pr_err("Block #%zu was demanded, map size %zu blocks\n",
+ inx, cbt_map->blk_count);
+ res = -EINVAL;
+ break;
+ }
+
+ num = map[inx];
+ if (num < snap_number)
+ map[inx] = snap_number;
+ }
+ return res;
+}
+
+int cbt_map_set(struct cbt_map *cbt_map, sector_t sector_start,
+ sector_t sector_cnt)
+{
+ int res;
+
+ spin_lock(&cbt_map->locker);
+ if (unlikely(cbt_map->is_corrupted)) {
+ spin_unlock(&cbt_map->locker);
+ return -EINVAL;
+ }
+ res = _cbt_map_set(cbt_map, sector_start, sector_cnt,
+ (u8)cbt_map->snap_number_active, cbt_map->write_map);
+ if (unlikely(res))
+ cbt_map->is_corrupted = true;
+
+ spin_unlock(&cbt_map->locker);
+
+ return res;
+}
+
+int cbt_map_set_both(struct cbt_map *cbt_map, sector_t sector_start,
+ sector_t sector_cnt)
+{
+ int res;
+
+ spin_lock(&cbt_map->locker);
+ if (unlikely(cbt_map->is_corrupted)) {
+ spin_unlock(&cbt_map->locker);
+ return -EINVAL;
+ }
+ res = _cbt_map_set(cbt_map, sector_start, sector_cnt,
+ (u8)cbt_map->snap_number_active, cbt_map->write_map);
+ if (!res)
+ res = _cbt_map_set(cbt_map, sector_start, sector_cnt,
+ (u8)cbt_map->snap_number_previous,
+ cbt_map->read_map);
+ spin_unlock(&cbt_map->locker);
+
+ return res;
+}
diff --git a/drivers/block/blksnap/cbt_map.h b/drivers/block/blksnap/cbt_map.h
new file mode 100644
index 000000000000..f87bffd5b3a7
--- /dev/null
+++ b/drivers/block/blksnap/cbt_map.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef __BLKSNAP_CBT_MAP_H
+#define __BLKSNAP_CBT_MAP_H
+
+#include <linux/kernel.h>
+#include <linux/kref.h>
+#include <linux/uuid.h>
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+
+struct blksnap_sectors;
+
+/**
+ * struct cbt_map - The table of changes for a block device.
+ *
+ * @locker:
+ * Locking for atomic modification of structure members.
+ * @blk_size_shift:
+ * The power of 2 used to specify the change tracking block size.
+ * @blk_count:
+ * The number of change tracking blocks.
+ * @device_capacity:
+ * The actual capacity of the device.
+ * @read_map:
+ * A table of changes available for reading. This is the table that can
+ * be read after taking a snapshot.
+ * @write_map:
+ * The current table for tracking changes.
+ * @snap_number_active:
+ * The current sequential number of changes. This is the number that is written to
+ * the current table when the block data changes.
+ * @snap_number_previous:
+ * The previous sequential number of changes. This number is used to identify the
+ * blocks that were changed between the penultimate snapshot and the last snapshot.
+ * @generation_id:
+ * UUID of the generation of changes.
+ * @is_corrupted:
+ * A flag that the change tracking data is no longer reliable.
+ *
+ * The change block tracking map is a byte table. Each byte stores the
+ * sequential number of changes for one block. To determine which blocks have changed
+ * since the previous snapshot with the change number 4, it is enough to
+ * find all bytes with the number more than 4.
+ *
+ * Since one byte is allocated to track changes in one block, the change
+ * table is created again at the 255th snapshot. At the same time, a new
+ * unique generation identifier is generated. Tracking changes is
+ * possible only for tables of the same generation.
+ *
+ * There are two tables on the change block tracking map. One is
+ * available for reading, and the other is available for writing. At the moment of taking
+ * a snapshot, the tables are synchronized. The user's process, when
+ * calling the corresponding ioctl, can read the readable table.
+ * At the same time, the change tracking mechanism continues to work with
+ * the writable table.
+ *
+ * To provide the ability to mount a snapshot image as writeable, it is
+ * possible to make changes to both of these tables simultaneously.
+ *
+ */
+struct cbt_map {
+ spinlock_t locker;
+
+ size_t blk_size_shift;
+ size_t blk_count;
+ sector_t device_capacity;
+
+ unsigned char *read_map;
+ unsigned char *write_map;
+
+ unsigned long snap_number_active;
+ unsigned long snap_number_previous;
+ uuid_t generation_id;
+
+ bool is_corrupted;
+};
+
+struct cbt_map *cbt_map_create(struct block_device *bdev);
+int cbt_map_reset(struct cbt_map *cbt_map, sector_t device_capacity);
+
+void cbt_map_destroy(struct cbt_map *cbt_map);
+
+void cbt_map_switch(struct cbt_map *cbt_map);
+int cbt_map_set(struct cbt_map *cbt_map, sector_t sector_start,
+ sector_t sector_cnt);
+int cbt_map_set_both(struct cbt_map *cbt_map, sector_t sector_start,
+ sector_t sector_cnt);
+
+#endif /* __BLKSNAP_CBT_MAP_H */
diff --git a/drivers/block/blksnap/tracker.c b/drivers/block/blksnap/tracker.c
new file mode 100644
index 000000000000..da6539fb6f54
--- /dev/null
+++ b/drivers/block/blksnap/tracker.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#define pr_fmt(fmt) KBUILD_MODNAME "-tracker: " fmt
+
+#include <linux/slab.h>
+#include <linux/blk-mq.h>
+#include <linux/sched/mm.h>
+#include <linux/build_bug.h>
+#include <uapi/linux/blksnap.h>
+#include "tracker.h"
+#include "cbt_map.h"
+#include "diff_area.h"
+#include "snapimage.h"
+#include "snapshot.h"
+
+void tracker_free(struct kref *kref)
+{
+ struct tracker *tracker = container_of(kref, struct tracker, kref);
+
+ might_sleep();
+
+ pr_debug("Free tracker for device [%u:%u]\n", MAJOR(tracker->dev_id),
+ MINOR(tracker->dev_id));
+
+ if (tracker->diff_area)
+ diff_area_put(tracker->diff_area);
+ if (tracker->cbt_map)
+ cbt_map_destroy(tracker->cbt_map);
+
+ kfree(tracker);
+}
+
+static bool tracker_submit_bio(struct bio *bio)
+{
+ struct blkfilter *flt = bio->bi_bdev->bd_filter;
+ struct tracker *tracker = container_of(flt, struct tracker, filter);
+ sector_t count = bio_sectors(bio);
+ struct bvec_iter copy_iter;
+
+ if (!op_is_write(bio_op(bio)) || !count)
+ return false;
+
+ copy_iter = bio->bi_iter;
+ if (bio_flagged(bio, BIO_REMAPPED))
+ copy_iter.bi_sector -= bio->bi_bdev->bd_start_sect;
+
+ if (cbt_map_set(tracker->cbt_map, copy_iter.bi_sector, count) ||
+ !atomic_read(&tracker->snapshot_is_taken))
+ return false;
+ /*
+ * The diff_area is not blocked from releasing now, because
+ * changing the value of the snapshot_is_taken is performed when
+ * the block device queue is frozen in tracker_release_snapshot().
+ */
+ if (diff_area_is_corrupted(tracker->diff_area))
+ return false;
+
+ return diff_area_cow(bio, tracker->diff_area, &copy_iter);
+}
+
+static struct blkfilter *tracker_attach(struct block_device *bdev)
+{
+ struct tracker *tracker = NULL;
+ struct cbt_map *cbt_map;
+
+ pr_debug("Creating tracker for device [%u:%u]\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
+
+ cbt_map = cbt_map_create(bdev);
+ if (!cbt_map) {
+ pr_err("Failed to create CBT map for device [%u:%u]\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev));
+ return ERR_PTR(-ENOMEM);
+ }
+
+ tracker = kzalloc(sizeof(struct tracker), GFP_KERNEL);
+ if (tracker == NULL) {
+ cbt_map_destroy(cbt_map);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ mutex_init(&tracker->ctl_lock);
+ INIT_LIST_HEAD(&tracker->link);
+ kref_init(&tracker->kref);
+ tracker->dev_id = bdev->bd_dev;
+ atomic_set(&tracker->snapshot_is_taken, false);
+ tracker->cbt_map = cbt_map;
+ tracker->diff_area = NULL;
+
+ pr_debug("New tracker for device [%u:%u] was created\n",
+ MAJOR(tracker->dev_id), MINOR(tracker->dev_id));
+
+ return &tracker->filter;
+}
+
+static void tracker_detach(struct blkfilter *flt)
+{
+ struct tracker *tracker = container_of(flt, struct tracker, filter);
+
+ pr_debug("Detach tracker from device [%u:%u]\n",
+ MAJOR(tracker->dev_id), MINOR(tracker->dev_id));
+
+ tracker_put(tracker);
+}
+
+static int ctl_cbtinfo(struct tracker *tracker, __u8 __user *buf, __u32 *plen)
+{
+ struct cbt_map *cbt_map = tracker->cbt_map;
+ struct blksnap_cbtinfo arg;
+
+ if (!cbt_map)
+ return -ESRCH;
+
+ if (*plen < sizeof(arg))
+ return -EINVAL;
+
+ arg.device_capacity = (__u64)(cbt_map->device_capacity << SECTOR_SHIFT);
+ arg.block_size = (__u32)(1 << cbt_map->blk_size_shift);
+ arg.block_count = (__u32)cbt_map->blk_count;
+ export_uuid(arg.generation_id.b, &cbt_map->generation_id);
+ arg.changes_number = (__u8)cbt_map->snap_number_previous;
+
+ if (copy_to_user(buf, &arg, sizeof(arg)))
+ return -ENODATA;
+
+ *plen = sizeof(arg);
+ return 0;
+}
+
+static int ctl_cbtmap(struct tracker *tracker, __u8 __user *buf, __u32 *plen)
+{
+ struct cbt_map *cbt_map = tracker->cbt_map;
+ struct blksnap_cbtmap arg;
+
+ if (!cbt_map)
+ return -ESRCH;
+
+ if (unlikely(cbt_map->is_corrupted)) {
+ pr_err("CBT table was corrupted\n");
+ return -EFAULT;
+ }
+
+ if (*plen < sizeof(arg))
+ return -EINVAL;
+
+ if (copy_from_user(&arg, buf, sizeof(arg)))
+ return -ENODATA;
+
+ if (arg.length > (cbt_map->blk_count - arg.offset))
+ return -ENODATA;
+
+ if (copy_to_user(u64_to_user_ptr(arg.buffer),
+ cbt_map->read_map + arg.offset, arg.length))
+
+ return -EINVAL;
+
+ *plen = 0;
+ return 0;
+}
+static int ctl_cbtdirty(struct tracker *tracker, __u8 __user *buf, __u32 *plen)
+{
+ struct cbt_map *cbt_map = tracker->cbt_map;
+ struct blksnap_cbtdirty arg;
+ unsigned int inx;
+
+ if (!cbt_map)
+ return -ESRCH;
+
+ if (*plen < sizeof(arg))
+ return -EINVAL;
+
+ if (copy_from_user(&arg, buf, sizeof(arg)))
+ return -ENODATA;
+
+ for (inx = 0; inx < arg.count; inx++) {
+ struct blksnap_sectors range;
+ int ret;
+
+ if (copy_from_user(&range, u64_to_user_ptr(arg.dirty_sectors),
+ sizeof(range)))
+ return -ENODATA;
+
+ ret = cbt_map_set_both(cbt_map, range.offset, range.count);
+ if (ret)
+ return ret;
+ }
+ *plen = 0;
+ return 0;
+}
+static int ctl_snapshotadd(struct tracker *tracker,
+ __u8 __user *buf, __u32 *plen)
+{
+ struct blksnap_snapshotadd arg;
+
+ if (*plen < sizeof(arg))
+ return -EINVAL;
+
+ if (copy_from_user(&arg, buf, sizeof(arg)))
+ return -ENODATA;
+
+ *plen = 0;
+ return snapshot_add_device((uuid_t *)&arg.id, tracker);
+}
+static int ctl_snapshotinfo(struct tracker *tracker,
+ __u8 __user *buf, __u32 *plen)
+{
+ struct blksnap_snapshotinfo arg = {0};
+
+ if (*plen < sizeof(arg))
+ return -EINVAL;
+
+ if (copy_from_user(&arg, buf, sizeof(arg)))
+ return -ENODATA;
+
+
+ if (tracker->diff_area && diff_area_is_corrupted(tracker->diff_area))
+ arg.error_code = tracker->diff_area->error_code;
+ else
+ arg.error_code = 0;
+
+ if (tracker->snap_disk)
+ strncpy(arg.image, tracker->snap_disk->disk_name, IMAGE_DISK_NAME_LEN);
+
+ if (copy_to_user(buf, &arg, sizeof(arg)))
+ return -ENODATA;
+
+ *plen = sizeof(arg);
+ return 0;
+}
+
+static int (*const ctl_table[])(struct tracker *tracker,
+ __u8 __user *buf, __u32 *plen) = {
+ ctl_cbtinfo,
+ ctl_cbtmap,
+ ctl_cbtdirty,
+ ctl_snapshotadd,
+ ctl_snapshotinfo,
+};
+
+static int tracker_ctl(struct blkfilter *flt, const unsigned int cmd,
+ __u8 __user *buf, __u32 *plen)
+{
+ int ret = 0;
+ struct tracker *tracker = container_of(flt, struct tracker, filter);
+
+ if (cmd > ARRAY_SIZE(ctl_table))
+ return -ENOTTY;
+
+ mutex_lock(&tracker->ctl_lock);
+ ret = ctl_table[cmd](tracker, buf, plen);
+ mutex_unlock(&tracker->ctl_lock);
+
+ return ret;
+}
+
+static struct blkfilter_operations tracker_ops = {
+ .owner = THIS_MODULE,
+ .name = "blksnap",
+ .attach = tracker_attach,
+ .detach = tracker_detach,
+ .ctl = tracker_ctl,
+ .submit_bio = tracker_submit_bio,
+};
+
+int tracker_take_snapshot(struct tracker *tracker)
+{
+ int ret = 0;
+ bool cbt_reset_needed = false;
+ struct block_device *orig_bdev = tracker->diff_area->orig_bdev;
+ sector_t capacity;
+ unsigned int current_flag;
+
+ blk_mq_freeze_queue(orig_bdev->bd_queue);
+ current_flag = memalloc_noio_save();
+
+ if (tracker->cbt_map->is_corrupted) {
+ cbt_reset_needed = true;
+ pr_warn("Corrupted CBT table detected. CBT fault\n");
+ }
+
+ capacity = bdev_nr_sectors(orig_bdev);
+ if (tracker->cbt_map->device_capacity != capacity) {
+ cbt_reset_needed = true;
+ pr_warn("Device resize detected. CBT fault\n");
+ }
+
+ if (cbt_reset_needed) {
+ ret = cbt_map_reset(tracker->cbt_map, capacity);
+ if (ret) {
+ pr_err("Failed to create tracker. errno=%d\n",
+ abs(ret));
+ return ret;
+ }
+ }
+
+ cbt_map_switch(tracker->cbt_map);
+ atomic_set(&tracker->snapshot_is_taken, true);
+
+ memalloc_noio_restore(current_flag);
+ blk_mq_unfreeze_queue(orig_bdev->bd_queue);
+
+ return 0;
+}
+
+void tracker_release_snapshot(struct tracker *tracker)
+{
+ struct diff_area *diff_area = tracker->diff_area;
+
+ if (unlikely(!diff_area))
+ return;
+
+ snapimage_free(tracker);
+
+ blk_mq_freeze_queue(diff_area->orig_bdev->bd_queue);
+
+ pr_debug("Tracker for device [%u:%u] release snapshot\n",
+ MAJOR(tracker->dev_id), MINOR(tracker->dev_id));
+
+ atomic_set(&tracker->snapshot_is_taken, false);
+ tracker->diff_area = NULL;
+
+ blk_mq_unfreeze_queue(diff_area->orig_bdev->bd_queue);
+
+ diff_area_put(diff_area);
+}
+
+int __init tracker_init(void)
+{
+ pr_debug("Register filter '%s'", tracker_ops.name);
+
+ return blkfilter_register(&tracker_ops);
+}
+
+void tracker_done(void)
+{
+ pr_debug("Unregister filter '%s'", tracker_ops.name);
+
+ blkfilter_unregister(&tracker_ops);
+}
diff --git a/drivers/block/blksnap/tracker.h b/drivers/block/blksnap/tracker.h
new file mode 100644
index 000000000000..dbf8295f9518
--- /dev/null
+++ b/drivers/block/blksnap/tracker.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef __BLKSNAP_TRACKER_H
+#define __BLKSNAP_TRACKER_H
+
+#include <linux/blk-filter.h>
+#include <linux/kref.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/rwsem.h>
+#include <linux/blkdev.h>
+#include <linux/fs.h>
+
+struct cbt_map;
+struct diff_area;
+
+/**
+ * struct tracker - Tracker for a block device.
+ *
+ * @filter:
+ * The block device filter structure.
+ * @ctl_lock:
+ * The mutex blocks simultaneous management of the tracker from different
+ * treads.
+ * @link:
+ * List header. Allows to combine trackers into a list in a snapshot.
+ * @kref:
+ * The link counter allows to control the lifetime of the tracker.
+ * @dev_id:
+ * Original block device ID.
+ * @snapshot_is_taken:
+ * Indicates that a snapshot was taken for the device whose I/O unit are
+ * handled by this tracker.
+ * @cbt_map:
+ * Pointer to a change block tracker map.
+ * @diff_area:
+ * Pointer to a difference area.
+ * @snap_disk:
+ * Snapshot image disk.
+ *
+ * The goal of the tracker is to handle I/O unit. The tracker detectes
+ * the range of sectors that will change and transmits them to the CBT map
+ * and to the difference area.
+ */
+struct tracker {
+ struct blkfilter filter;
+ struct mutex ctl_lock;
+ struct list_head link;
+ struct kref kref;
+ dev_t dev_id;
+
+ atomic_t snapshot_is_taken;
+
+ struct cbt_map *cbt_map;
+ struct diff_area *diff_area;
+ struct gendisk *snap_disk;
+};
+
+int __init tracker_init(void);
+void tracker_done(void);
+
+void tracker_free(struct kref *kref);
+static inline void tracker_put(struct tracker *tracker)
+{
+ if (likely(tracker))
+ kref_put(&tracker->kref, tracker_free);
+};
+static inline void tracker_get(struct tracker *tracker)
+{
+ kref_get(&tracker->kref);
+};
+int tracker_take_snapshot(struct tracker *tracker);
+void tracker_release_snapshot(struct tracker *tracker);
+
+#endif /* __BLKSNAP_TRACKER_H */
--
2.20.1


2023-06-12 13:57:22

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v5 05/11] blksnap: module management interface functions

Contains callback functions for loading and unloading the module and
implementation of module management interface functions. The module
parameters and other mandatory declarations for the kernel module are
also defined.

Co-developed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sergei Shtepa <[email protected]>
---
MAINTAINERS | 1 +
drivers/block/blksnap/main.c | 483 +++++++++++++++++++++++++++++++++
drivers/block/blksnap/params.h | 16 ++
3 files changed, 500 insertions(+)
create mode 100644 drivers/block/blksnap/main.c
create mode 100644 drivers/block/blksnap/params.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 76b14ad604dc..a26eee956aec 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3594,6 +3594,7 @@ M: Sergei Shtepa <[email protected]>
L: [email protected]
S: Supported
F: Documentation/block/blksnap.rst
+F: drivers/block/blksnap/*
F: include/uapi/linux/blksnap.h

BLOCK LAYER
diff --git a/drivers/block/blksnap/main.c b/drivers/block/blksnap/main.c
new file mode 100644
index 000000000000..7bb37c191fda
--- /dev/null
+++ b/drivers/block/blksnap/main.c
@@ -0,0 +1,483 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/miscdevice.h>
+#include <linux/build_bug.h>
+#include <uapi/linux/blksnap.h>
+#include "snapimage.h"
+#include "snapshot.h"
+#include "tracker.h"
+#include "chunk.h"
+#include "params.h"
+
+/*
+ * The power of 2 for minimum tracking block size.
+ * If we make the tracking block size small, we will get detailed information
+ * about the changes, but the size of the change tracker table will be too
+ * large, which will lead to inefficient memory usage.
+ */
+static unsigned int tracking_block_minimum_shift = 16;
+
+/*
+ * The maximum number of tracking blocks.
+ * A table is created to store information about the status of all tracking
+ * blocks in RAM. So, if the size of the tracking block is small, then the size
+ * of the table turns out to be large and memory is consumed inefficiently.
+ * As the size of the block device grows, the size of the tracking block
+ * size should also grow. For this purpose, the limit of the maximum
+ * number of block size is set.
+ */
+static unsigned int tracking_block_maximum_count = 2097152;
+
+/*
+ * The power of 2 for maximum tracking block size.
+ * On very large capacity disks, the block size may be too large. To prevent
+ * this, the maximum block size is limited.
+ * If the limit on the maximum block size has been reached, then the number of
+ * blocks may exceed the tracking_block_maximum_count.
+ */
+static unsigned int tracking_block_maximum_shift = 26;
+
+/*
+ * The power of 2 for minimum chunk size.
+ * The size of the chunk depends on how much data will be copied to the
+ * difference storage when at least one sector of the block device is changed.
+ * If the size is small, then small I/O units will be generated, which will
+ * reduce performance. Too large a chunk size will lead to inefficient use of
+ * the difference storage.
+ */
+static unsigned int chunk_minimum_shift = 18;
+
+/*
+ * The power of 2 for maximum number of chunks.
+ * To store information about the state of the chunks, a table is created
+ * in RAM. So, if the size of the chunk is small, then the size of the table
+ * turns out to be large and memory is consumed inefficiently.
+ * As the size of the block device grows, the size of the chunk should also
+ * grow. For this purpose, the maximum number of chunks is set.
+ * The table expands dynamically when new chunks are allocated. Therefore,
+ * memory consumption also depends on the intensity of writing to the block
+ * device under the snapshot.
+ */
+static unsigned int chunk_maximum_count_shift = 40;
+
+/*
+ * The power of 2 for maximum chunk size.
+ * On very large capacity disks, the block size may be too large. To prevent
+ * this, the maximum block size is limited.
+ * If the limit on the maximum block size has been reached, then the number of
+ * blocks may exceed the chunk_maximum_count.
+ */
+static unsigned int chunk_maximum_shift = 26;
+/*
+ * The maximum number of chunks in queue.
+ * The chunk is not immediately stored to the difference storage. The chunks
+ * are put in a store queue. The store queue allows to postpone the operation
+ * of storing a chunks data to the difference storage and perform it later in
+ * the worker thread.
+ */
+static unsigned int chunk_maximum_in_queue = 16;
+
+/*
+ * The size of the pool of preallocated difference buffers.
+ * A buffer can be allocated for each chunk. After use, this buffer is not
+ * released immediately, but is sent to the pool of free buffers.
+ * However, if there are too many free buffers in the pool, then these free
+ * buffers will be released immediately.
+ */
+static unsigned int free_diff_buffer_pool_size = 128;
+
+/*
+ * The minimum allowable size of the difference storage in sectors.
+ * The difference storage is a part of the disk space allocated for storing
+ * snapshot data. If there is less free space in the storage than the minimum,
+ * an event is generated about the lack of free space.
+ */
+static unsigned int diff_storage_minimum = 2097152;
+
+#define VERSION_STR "2.0.0.0"
+static const struct blksnap_version version = {
+ .major = 2,
+ .minor = 0,
+ .revision = 0,
+ .build = 0,
+};
+
+unsigned int get_tracking_block_minimum_shift(void)
+{
+ return tracking_block_minimum_shift;
+}
+
+unsigned int get_tracking_block_maximum_shift(void)
+{
+ return tracking_block_maximum_shift;
+}
+
+unsigned int get_tracking_block_maximum_count(void)
+{
+ return tracking_block_maximum_count;
+}
+
+unsigned int get_chunk_minimum_shift(void)
+{
+ return chunk_minimum_shift;
+}
+
+unsigned int get_chunk_maximum_shift(void)
+{
+ return chunk_maximum_shift;
+}
+
+unsigned long get_chunk_maximum_count(void)
+{
+ return (1ul << chunk_maximum_count_shift);
+}
+
+unsigned int get_chunk_maximum_in_queue(void)
+{
+ return chunk_maximum_in_queue;
+}
+
+unsigned int get_free_diff_buffer_pool_size(void)
+{
+ return free_diff_buffer_pool_size;
+}
+
+unsigned int get_diff_storage_minimum(void)
+{
+ return diff_storage_minimum;
+}
+
+static int ioctl_version(unsigned long arg)
+{
+ struct blksnap_version __user *user_version =
+ (struct blksnap_version __user *)arg;
+
+ if (copy_to_user(user_version, &version, sizeof(version))) {
+ pr_err("Unable to get version: invalid user buffer\n");
+ return -ENODATA;
+ }
+
+ return 0;
+}
+
+static_assert(sizeof(uuid_t) == sizeof(struct blksnap_uuid),
+ "Invalid size of struct blksnap_uuid.");
+
+static int ioctl_snapshot_create(unsigned long arg)
+{
+ struct blksnap_uuid __user *user_id = (struct blksnap_uuid __user *)arg;
+ uuid_t kernel_id;
+ int ret;
+
+ ret = snapshot_create(&kernel_id);
+ if (ret)
+ return ret;
+
+ if (copy_to_user(user_id->b, kernel_id.b, sizeof(uuid_t))) {
+ pr_err("Unable to create snapshot: invalid user buffer\n");
+ return -ENODATA;
+ }
+
+ return 0;
+}
+
+static int ioctl_snapshot_destroy(unsigned long arg)
+{
+ struct blksnap_uuid __user *user_id = (struct blksnap_uuid __user *)arg;
+ uuid_t kernel_id;
+
+ if (copy_from_user(kernel_id.b, user_id->b, sizeof(uuid_t))) {
+ pr_err("Unable to destroy snapshot: invalid user buffer\n");
+ return -ENODATA;
+ }
+
+ return snapshot_destroy(&kernel_id);
+}
+
+static int ioctl_snapshot_append_storage(unsigned long arg)
+{
+ int ret;
+ struct blksnap_snapshot_append_storage __user *uarg =
+ (struct blksnap_snapshot_append_storage __user *)arg;
+ struct blksnap_snapshot_append_storage karg;
+ char *bdev_path = NULL;
+
+ pr_debug("Append difference storage\n");
+
+ if (copy_from_user(&karg, uarg, sizeof(karg))) {
+ pr_err("Unable to append difference storage: invalid user buffer\n");
+ return -EINVAL;
+ }
+
+ bdev_path = strndup_user(u64_to_user_ptr(karg.bdev_path),
+ karg.bdev_path_size);
+ if (IS_ERR(bdev_path)) {
+ pr_err("Unable to append difference storage: invalid block device name buffer\n");
+ return PTR_ERR(bdev_path);
+ }
+
+ ret = snapshot_append_storage((uuid_t *)karg.id.b, bdev_path,
+ u64_to_user_ptr(karg.ranges), karg.count);
+ kfree(bdev_path);
+ return ret;
+}
+
+static int ioctl_snapshot_take(unsigned long arg)
+{
+ struct blksnap_uuid __user *user_id = (struct blksnap_uuid __user *)arg;
+ uuid_t kernel_id;
+
+ if (copy_from_user(kernel_id.b, user_id->b, sizeof(uuid_t))) {
+ pr_err("Unable to take snapshot: invalid user buffer\n");
+ return -ENODATA;
+ }
+
+ return snapshot_take(&kernel_id);
+}
+
+static int ioctl_snapshot_collect(unsigned long arg)
+{
+ int ret;
+ struct blksnap_snapshot_collect karg;
+
+ if (copy_from_user(&karg, (const void __user *)arg, sizeof(karg))) {
+ pr_err("Unable to collect available snapshots: invalid user buffer\n");
+ return -ENODATA;
+ }
+
+ ret = snapshot_collect(&karg.count, u64_to_user_ptr(karg.ids));
+
+ if (copy_to_user((void __user *)arg, &karg, sizeof(karg))) {
+ pr_err("Unable to collect available snapshots: invalid user buffer\n");
+ return -ENODATA;
+ }
+
+ return ret;
+}
+
+static_assert(sizeof(struct blksnap_snapshot_event) == 4096,
+ "The size struct blksnap_snapshot_event should be equal to the size of the page.");
+
+static int ioctl_snapshot_wait_event(unsigned long arg)
+{
+ int ret = 0;
+ struct blksnap_snapshot_event __user *uarg =
+ (struct blksnap_snapshot_event __user *)arg;
+ struct blksnap_snapshot_event *karg;
+ struct event *ev;
+
+ karg = kzalloc(sizeof(struct blksnap_snapshot_event), GFP_KERNEL);
+ if (!karg)
+ return -ENOMEM;
+
+ /* Copy only snapshot ID and timeout*/
+ if (copy_from_user(karg, uarg, sizeof(uuid_t) + sizeof(__u32))) {
+ pr_err("Unable to get snapshot event. Invalid user buffer\n");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ev = snapshot_wait_event((uuid_t *)karg->id.b, karg->timeout_ms);
+ if (IS_ERR(ev)) {
+ ret = PTR_ERR(ev);
+ goto out;
+ }
+
+ pr_debug("Received event=%lld code=%d data_size=%d\n", ev->time,
+ ev->code, ev->data_size);
+ karg->code = ev->code;
+ karg->time_label = ev->time;
+
+ if (ev->data_size > sizeof(karg->data)) {
+ pr_err("Event size %d is too big\n", ev->data_size);
+ ret = -ENOSPC;
+ /* If we can't copy all the data, we copy only part of it. */
+ }
+ memcpy(karg->data, ev->data, ev->data_size);
+ event_free(ev);
+
+ if (copy_to_user(uarg, karg, sizeof(struct blksnap_snapshot_event))) {
+ pr_err("Unable to get snapshot event. Invalid user buffer\n");
+ ret = -EINVAL;
+ }
+out:
+ kfree(karg);
+
+ return ret;
+}
+
+static int (*const blksnap_ioctl_table[])(unsigned long arg) = {
+ ioctl_version,
+ ioctl_snapshot_create,
+ ioctl_snapshot_destroy,
+ ioctl_snapshot_append_storage,
+ ioctl_snapshot_take,
+ ioctl_snapshot_collect,
+ ioctl_snapshot_wait_event,
+};
+
+static_assert(
+ sizeof(blksnap_ioctl_table) ==
+ ((blksnap_ioctl_snapshot_wait_event + 1) * sizeof(void *)),
+ "The size of table blksnap_ioctl_table does not match the enum blksnap_ioctl.");
+
+static long ctrl_unlocked_ioctl(struct file *filp, unsigned int cmd,
+ unsigned long arg)
+{
+ int nr = _IOC_NR(cmd);
+
+ if (nr > (sizeof(blksnap_ioctl_table) / sizeof(void *)))
+ return -ENOTTY;
+
+ if (!blksnap_ioctl_table[nr])
+ return -ENOTTY;
+
+ return blksnap_ioctl_table[nr](arg);
+}
+
+static const struct file_operations blksnap_ctrl_fops = {
+ .owner = THIS_MODULE,
+ .unlocked_ioctl = ctrl_unlocked_ioctl,
+};
+
+static struct miscdevice blksnap_ctrl_misc = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = BLKSNAP_CTL,
+ .fops = &blksnap_ctrl_fops,
+};
+
+static int __init parameters_init(void)
+{
+ pr_debug("tracking_block_minimum_shift: %d\n",
+ tracking_block_minimum_shift);
+ pr_debug("tracking_block_maximum_shift: %d\n",
+ tracking_block_maximum_shift);
+ pr_debug("tracking_block_maximum_count: %d\n",
+ tracking_block_maximum_count);
+
+ pr_debug("chunk_minimum_shift: %d\n", chunk_minimum_shift);
+ pr_debug("chunk_maximum_shift: %d\n", chunk_maximum_shift);
+ pr_debug("chunk_maximum_count_shift: %u\n", chunk_maximum_count_shift);
+
+ pr_debug("chunk_maximum_in_queue: %d\n", chunk_maximum_in_queue);
+ pr_debug("free_diff_buffer_pool_size: %d\n",
+ free_diff_buffer_pool_size);
+ pr_debug("diff_storage_minimum: %d\n", diff_storage_minimum);
+
+ if (tracking_block_maximum_shift < tracking_block_minimum_shift) {
+ tracking_block_maximum_shift = tracking_block_minimum_shift;
+ pr_warn("fixed tracking_block_maximum_shift: %d\n",
+ tracking_block_maximum_shift);
+ }
+
+ if (chunk_maximum_shift < chunk_minimum_shift) {
+ chunk_maximum_shift = chunk_minimum_shift;
+ pr_warn("fixed chunk_maximum_shift: %d\n",
+ chunk_maximum_shift);
+ }
+
+ /*
+ * The XArray is used to store chunks. And 'unsigned long' is used as
+ * chunk number parameter. So, The number of chunks cannot exceed the
+ * limits of ULONG_MAX.
+ */
+ if (sizeof(unsigned long) < 4u)
+ chunk_maximum_count_shift = min(16u, chunk_maximum_count_shift);
+ else if (sizeof(unsigned long) == 4)
+ chunk_maximum_count_shift = min(32u, chunk_maximum_count_shift);
+ else if (sizeof(unsigned long) >= 8)
+ chunk_maximum_count_shift = min(64u, chunk_maximum_count_shift);
+
+ return 0;
+}
+
+static int __init blksnap_init(void)
+{
+ int ret;
+
+ pr_debug("Loading\n");
+ pr_debug("Version: %s\n", VERSION_STR);
+
+ ret = parameters_init();
+ if (ret)
+ return ret;
+
+ ret = chunk_init();
+ if (ret)
+ goto fail_chunk_init;
+
+ ret = tracker_init();
+ if (ret)
+ goto fail_tracker_init;
+
+ ret = misc_register(&blksnap_ctrl_misc);
+ if (ret)
+ goto fail_misc_register;
+
+ return 0;
+
+fail_misc_register:
+ tracker_done();
+fail_tracker_init:
+ chunk_done();
+fail_chunk_init:
+
+ return ret;
+}
+
+static void __exit blksnap_exit(void)
+{
+ pr_debug("Unloading module\n");
+
+ misc_deregister(&blksnap_ctrl_misc);
+
+ chunk_done();
+ snapshot_done();
+ tracker_done();
+
+ pr_debug("Module was unloaded\n");
+}
+
+module_init(blksnap_init);
+module_exit(blksnap_exit);
+
+module_param_named(tracking_block_minimum_shift, tracking_block_minimum_shift,
+ uint, 0644);
+MODULE_PARM_DESC(tracking_block_minimum_shift,
+ "The power of 2 for minimum tracking block size");
+module_param_named(tracking_block_maximum_count, tracking_block_maximum_count,
+ uint, 0644);
+MODULE_PARM_DESC(tracking_block_maximum_count,
+ "The maximum number of tracking blocks");
+module_param_named(tracking_block_maximum_shift, tracking_block_maximum_shift,
+ uint, 0644);
+MODULE_PARM_DESC(tracking_block_maximum_shift,
+ "The power of 2 for maximum trackings block size");
+module_param_named(chunk_minimum_shift, chunk_minimum_shift, uint, 0644);
+MODULE_PARM_DESC(chunk_minimum_shift,
+ "The power of 2 for minimum chunk size");
+module_param_named(chunk_maximum_count_shift, chunk_maximum_count_shift,
+ uint, 0644);
+MODULE_PARM_DESC(chunk_maximum_count_shift,
+ "The power of 2 for maximum number of chunks");
+module_param_named(chunk_maximum_shift, chunk_maximum_shift, uint, 0644);
+MODULE_PARM_DESC(chunk_maximum_shift,
+ "The power of 2 for maximum snapshots chunk size");
+module_param_named(chunk_maximum_in_queue, chunk_maximum_in_queue, uint, 0644);
+MODULE_PARM_DESC(chunk_maximum_in_queue,
+ "The maximum number of chunks in store queue");
+module_param_named(free_diff_buffer_pool_size, free_diff_buffer_pool_size,
+ uint, 0644);
+MODULE_PARM_DESC(free_diff_buffer_pool_size,
+ "The size of the pool of preallocated difference buffers");
+module_param_named(diff_storage_minimum, diff_storage_minimum, uint, 0644);
+MODULE_PARM_DESC(diff_storage_minimum,
+ "The minimum allowable size of the difference storage in sectors");
+
+MODULE_DESCRIPTION("Block Device Snapshots Module");
+MODULE_VERSION(VERSION_STR);
+MODULE_AUTHOR("Veeam Software Group GmbH");
+MODULE_LICENSE("GPL");
diff --git a/drivers/block/blksnap/params.h b/drivers/block/blksnap/params.h
new file mode 100644
index 000000000000..85606e1a8746
--- /dev/null
+++ b/drivers/block/blksnap/params.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef __BLKSNAP_PARAMS_H
+#define __BLKSNAP_PARAMS_H
+
+unsigned int get_tracking_block_minimum_shift(void);
+unsigned int get_tracking_block_maximum_shift(void);
+unsigned int get_tracking_block_maximum_count(void);
+unsigned int get_chunk_minimum_shift(void);
+unsigned int get_chunk_maximum_shift(void);
+unsigned long get_chunk_maximum_count(void);
+unsigned int get_chunk_maximum_in_queue(void);
+unsigned int get_free_diff_buffer_pool_size(void);
+unsigned int get_diff_storage_minimum(void);
+
+#endif /* __BLKSNAP_PARAMS_H */
--
2.20.1


2023-06-12 13:57:57

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v5 07/11] blksnap: minimum data storage unit of the original block device

The struct chunk describes the minimum data storage unit of the original
block device. Functions for working with these minimal blocks implement
algorithms for reading and writing blocks.

Co-developed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sergei Shtepa <[email protected]>
---
drivers/block/blksnap/chunk.c | 454 ++++++++++++++++++++++++++++++++++
drivers/block/blksnap/chunk.h | 114 +++++++++
2 files changed, 568 insertions(+)
create mode 100644 drivers/block/blksnap/chunk.c
create mode 100644 drivers/block/blksnap/chunk.h

diff --git a/drivers/block/blksnap/chunk.c b/drivers/block/blksnap/chunk.c
new file mode 100644
index 000000000000..fe1e9b0e3323
--- /dev/null
+++ b/drivers/block/blksnap/chunk.c
@@ -0,0 +1,454 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#define pr_fmt(fmt) KBUILD_MODNAME "-chunk: " fmt
+
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include "chunk.h"
+#include "diff_buffer.h"
+#include "diff_storage.h"
+#include "params.h"
+
+struct chunk_bio {
+ struct work_struct work;
+ struct list_head chunks;
+ struct bio *orig_bio;
+ struct bvec_iter orig_iter;
+ struct bio bio;
+};
+
+static struct bio_set chunk_io_bioset;
+static struct bio_set chunk_clone_bioset;
+
+static inline sector_t chunk_sector(struct chunk *chunk)
+{
+ return (sector_t)(chunk->number)
+ << (chunk->diff_area->chunk_shift - SECTOR_SHIFT);
+}
+
+void chunk_store_failed(struct chunk *chunk, int error)
+{
+ struct diff_area *diff_area = diff_area_get(chunk->diff_area);
+
+ WARN_ON_ONCE(chunk->state != CHUNK_ST_NEW &&
+ chunk->state != CHUNK_ST_IN_MEMORY);
+ chunk->state = CHUNK_ST_FAILED;
+
+ if (likely(chunk->diff_buffer)) {
+ diff_buffer_release(diff_area, chunk->diff_buffer);
+ chunk->diff_buffer = NULL;
+ }
+ diff_storage_free_region(chunk->diff_region);
+ chunk->diff_region = NULL;
+
+ chunk_up(chunk);
+ if (error)
+ diff_area_set_corrupted(diff_area, error);
+ diff_area_put(diff_area);
+};
+
+static void chunk_schedule_storing(struct chunk *chunk)
+{
+ struct diff_area *diff_area = diff_area_get(chunk->diff_area);
+ int queue_count;
+
+ WARN_ON_ONCE(chunk->state != CHUNK_ST_NEW &&
+ chunk->state != CHUNK_ST_STORED);
+ chunk->state = CHUNK_ST_IN_MEMORY;
+
+ spin_lock(&diff_area->store_queue_lock);
+ list_add_tail(&chunk->link, &diff_area->store_queue);
+ queue_count = atomic_inc_return(&diff_area->store_queue_count);
+ spin_unlock(&diff_area->store_queue_lock);
+
+ chunk_up(chunk);
+
+ /* Initiate the queue clearing process */
+ if (queue_count > get_chunk_maximum_in_queue())
+ queue_work(system_wq, &diff_area->store_queue_work);
+ diff_area_put(diff_area);
+}
+
+void chunk_copy_bio(struct chunk *chunk, struct bio *bio,
+ struct bvec_iter *iter)
+{
+ unsigned int chunk_ofs, chunk_left;
+
+ chunk_ofs = (iter->bi_sector - chunk_sector(chunk)) << SECTOR_SHIFT;
+ chunk_left = chunk->diff_buffer->size - chunk_ofs;
+ while (chunk_left && iter->bi_size) {
+ struct bio_vec bvec = bio_iter_iovec(bio, *iter);
+ unsigned int page_ofs = offset_in_page(chunk_ofs);
+ struct page *page;
+ unsigned int len;
+
+ page = chunk->diff_buffer->pages[chunk_ofs >> PAGE_SHIFT];
+ len = min3(bvec.bv_len,
+ chunk_left,
+ (unsigned int)PAGE_SIZE - page_ofs);
+
+ if (op_is_write(bio_op(bio))) {
+ /* from bio to buffer */
+ memcpy_page(page, page_ofs,
+ bvec.bv_page, bvec.bv_offset,
+ len);
+ } else {
+ /* from buffer to bio */
+ memcpy_page(bvec.bv_page, bvec.bv_offset,
+ page, page_ofs,
+ len);
+ }
+
+ chunk_ofs += len;
+ chunk_left -= len;
+ bio_advance_iter_single(bio, iter, len);
+ }
+}
+
+static void chunk_clone_endio(struct bio *bio)
+{
+ struct bio *orig_bio = bio->bi_private;
+
+ if (unlikely(bio->bi_status != BLK_STS_OK))
+ bio_io_error(orig_bio);
+ else
+ bio_endio(orig_bio);
+}
+
+static inline sector_t chunk_offset(struct chunk *chunk, struct bio *bio)
+{
+ return bio->bi_iter.bi_sector - chunk_sector(chunk);
+}
+
+static inline void chunk_limit_iter(struct chunk *chunk, struct bio *bio,
+ sector_t sector, struct bvec_iter *iter)
+{
+ sector_t chunk_ofs = chunk_offset(chunk, bio);
+
+ iter->bi_sector = sector + chunk_ofs;
+ iter->bi_size = min_t(unsigned int,
+ bio->bi_iter.bi_size,
+ (chunk->sector_count - chunk_ofs) << SECTOR_SHIFT);
+}
+
+static inline unsigned int chunk_limit(struct chunk *chunk, struct bio *bio)
+{
+ unsigned int chunk_ofs, chunk_left;
+
+ chunk_ofs = (unsigned int)chunk_offset(chunk, bio) << SECTOR_SHIFT;
+ chunk_left = chunk->diff_buffer->size - chunk_ofs;
+
+ return min(bio->bi_iter.bi_size, chunk_left);
+}
+
+struct bio *chunk_alloc_clone(struct block_device *bdev, struct bio *bio)
+{
+ return bio_alloc_clone(bdev, bio, GFP_NOIO, &chunk_clone_bioset);
+}
+
+void chunk_clone_bio(struct chunk *chunk, struct bio *bio)
+{
+ struct bio *new_bio;
+ struct block_device *bdev;
+ sector_t sector;
+
+ if (chunk->state == CHUNK_ST_STORED) {
+ bdev = chunk->diff_region->bdev;
+ sector = chunk->diff_region->sector;
+ } else {
+ bdev = chunk->diff_area->orig_bdev;
+ sector = chunk_sector(chunk);
+ }
+
+ new_bio = chunk_alloc_clone(bdev, bio);
+ WARN_ON(!new_bio);
+
+ chunk_limit_iter(chunk, bio, sector, &new_bio->bi_iter);
+ bio_set_flag(new_bio, BIO_FILTERED);
+ new_bio->bi_end_io = chunk_clone_endio;
+ new_bio->bi_private = bio;
+
+ bio_advance(bio, new_bio->bi_iter.bi_size);
+ bio_inc_remaining(bio);
+
+ submit_bio_noacct(new_bio);
+}
+
+static inline struct chunk *get_chunk_from_cbio(struct chunk_bio *cbio)
+{
+ struct chunk *chunk = list_first_entry_or_null(&cbio->chunks,
+ struct chunk, link);
+
+ if (chunk)
+ list_del_init(&chunk->link);
+ return chunk;
+}
+
+static void notify_load_and_schedule_io(struct work_struct *work)
+{
+ struct chunk_bio *cbio = container_of(work, struct chunk_bio, work);
+ struct chunk *chunk;
+
+ while ((chunk = get_chunk_from_cbio(cbio))) {
+ if (unlikely(cbio->bio.bi_status != BLK_STS_OK)) {
+ chunk_store_failed(chunk, -EIO);
+ continue;
+ }
+ if (chunk->state == CHUNK_ST_FAILED) {
+ chunk_up(chunk);
+ continue;
+ }
+
+ chunk_copy_bio(chunk, cbio->orig_bio, &cbio->orig_iter);
+ bio_endio(cbio->orig_bio);
+
+ chunk_schedule_storing(chunk);
+ }
+
+ bio_put(&cbio->bio);
+}
+
+static void notify_load_and_postpone_io(struct work_struct *work)
+{
+ struct chunk_bio *cbio = container_of(work, struct chunk_bio, work);
+ struct chunk *chunk;
+
+ while ((chunk = get_chunk_from_cbio(cbio))) {
+ if (unlikely(cbio->bio.bi_status != BLK_STS_OK)) {
+ chunk_store_failed(chunk, -EIO);
+ continue;
+ }
+ if (chunk->state == CHUNK_ST_FAILED) {
+ chunk_up(chunk);
+ continue;
+ }
+
+ chunk_schedule_storing(chunk);
+ }
+
+ /* submit the original bio fed into the tracker */
+ submit_bio_noacct_nocheck(cbio->orig_bio);
+ bio_put(&cbio->bio);
+}
+
+static void chunk_notify_store(struct work_struct *work)
+{
+ struct chunk_bio *cbio = container_of(work, struct chunk_bio, work);
+ struct chunk *chunk;
+
+ while ((chunk = get_chunk_from_cbio(cbio))) {
+ if (unlikely(cbio->bio.bi_status != BLK_STS_OK)) {
+ chunk_store_failed(chunk, -EIO);
+ continue;
+ }
+
+ WARN_ON_ONCE(chunk->state != CHUNK_ST_IN_MEMORY);
+ chunk->state = CHUNK_ST_STORED;
+
+ if (chunk->diff_buffer) {
+ diff_buffer_release(chunk->diff_area,
+ chunk->diff_buffer);
+ chunk->diff_buffer = NULL;
+ }
+ chunk_up(chunk);
+ }
+
+ bio_put(&cbio->bio);
+}
+
+static void chunk_io_endio(struct bio *bio)
+{
+ struct chunk_bio *cbio = container_of(bio, struct chunk_bio, bio);
+
+ queue_work(system_wq, &cbio->work);
+}
+
+static void chunk_submit_bio(struct bio *bio)
+{
+ bio->bi_end_io = chunk_io_endio;
+ submit_bio_noacct(bio);
+}
+
+static inline unsigned short calc_max_vecs(sector_t left)
+{
+ return bio_max_segs(round_up(left, PAGE_SECTORS) / PAGE_SECTORS);
+}
+
+void chunk_store(struct chunk *chunk)
+{
+ struct block_device *bdev = chunk->diff_region->bdev;
+ sector_t sector = chunk->diff_region->sector;
+ sector_t count = chunk->diff_region->count;
+ unsigned int page_idx = 0;
+ struct bio *bio;
+ struct chunk_bio *cbio;
+
+ bio = bio_alloc_bioset(bdev, calc_max_vecs(count),
+ REQ_OP_WRITE | REQ_SYNC | REQ_FUA, GFP_NOIO,
+ &chunk_io_bioset);
+ bio->bi_iter.bi_sector = sector;
+ bio_set_flag(bio, BIO_FILTERED);
+
+ while (count) {
+ struct bio *next;
+ sector_t portion = min_t(sector_t, count, PAGE_SECTORS);
+ unsigned int bytes = portion << SECTOR_SHIFT;
+
+ if (bio_add_page(bio, chunk->diff_buffer->pages[page_idx],
+ bytes, 0) == bytes) {
+ page_idx++;
+ count -= portion;
+ continue;
+ }
+
+ /* Create next bio */
+ next = bio_alloc_bioset(bdev, calc_max_vecs(count),
+ REQ_OP_WRITE | REQ_SYNC | REQ_FUA,
+ GFP_NOIO, &chunk_io_bioset);
+ next->bi_iter.bi_sector = bio_end_sector(bio);
+ bio_set_flag(next, BIO_FILTERED);
+ bio_chain(bio, next);
+ submit_bio_noacct(bio);
+ bio = next;
+ }
+
+ cbio = container_of(bio, struct chunk_bio, bio);
+
+ INIT_WORK(&cbio->work, chunk_notify_store);
+ INIT_LIST_HEAD(&cbio->chunks);
+ list_add_tail(&chunk->link, &cbio->chunks);
+ cbio->orig_bio = NULL;
+ chunk_submit_bio(bio);
+}
+
+static struct bio *__chunk_load(struct chunk *chunk)
+{
+ struct diff_buffer *diff_buffer;
+ unsigned int page_idx = 0;
+ struct bio *bio;
+ struct block_device *bdev;
+ sector_t sector, count;
+
+ diff_buffer = diff_buffer_take(chunk->diff_area);
+ if (IS_ERR(diff_buffer))
+ return ERR_CAST(diff_buffer);
+ chunk->diff_buffer = diff_buffer;
+
+ if (chunk->state == CHUNK_ST_STORED) {
+ bdev = chunk->diff_region->bdev;
+ sector = chunk->diff_region->sector;
+ count = chunk->diff_region->count;
+ } else {
+ bdev = chunk->diff_area->orig_bdev;
+ sector = chunk_sector(chunk);
+ count = chunk->sector_count;
+ }
+
+ bio = bio_alloc_bioset(bdev, calc_max_vecs(count),
+ REQ_OP_READ, GFP_NOIO, &chunk_io_bioset);
+ bio->bi_iter.bi_sector = sector;
+ bio_set_flag(bio, BIO_FILTERED);
+
+ while (count) {
+ struct bio *next;
+ sector_t portion = min_t(sector_t, count, PAGE_SECTORS);
+ unsigned int bytes = portion << SECTOR_SHIFT;
+
+ if (bio_add_page(bio, chunk->diff_buffer->pages[page_idx],
+ bytes, 0) == bytes) {
+ page_idx++;
+ count -= portion;
+ continue;
+ }
+
+ /* Create next bio */
+ next = bio_alloc_bioset(bdev, calc_max_vecs(count),
+ REQ_OP_READ, GFP_NOIO,
+ &chunk_io_bioset);
+ next->bi_iter.bi_sector = bio_end_sector(bio);
+ bio_set_flag(next, BIO_FILTERED);
+ bio_chain(bio, next);
+ submit_bio_noacct(bio);
+ bio = next;
+ }
+ return bio;
+}
+
+int chunk_load_and_postpone_io(struct chunk *chunk, struct bio **chunk_bio)
+{
+ struct bio *prev = *chunk_bio, *bio;
+
+ bio = __chunk_load(chunk);
+ if (IS_ERR(bio))
+ return PTR_ERR(bio);
+
+ if (prev) {
+ bio_chain(prev, bio);
+ submit_bio_noacct(prev);
+ }
+
+ *chunk_bio = bio;
+ return 0;
+}
+
+void chunk_load_and_postpone_io_finish(struct list_head *chunks,
+ struct bio *chunk_bio, struct bio *orig_bio)
+{
+ struct chunk_bio *cbio;
+
+ cbio = container_of(chunk_bio, struct chunk_bio, bio);
+ INIT_LIST_HEAD(&cbio->chunks);
+ while (!list_empty(chunks)) {
+ struct chunk *it;
+
+ it = list_first_entry(chunks, struct chunk, link);
+ list_del_init(&it->link);
+
+ list_add_tail(&it->link, &cbio->chunks);
+ }
+ INIT_WORK(&cbio->work, notify_load_and_postpone_io);
+ cbio->orig_bio = orig_bio;
+ chunk_submit_bio(chunk_bio);
+}
+
+int chunk_load_and_schedule_io(struct chunk *chunk, struct bio *orig_bio)
+{
+ struct chunk_bio *cbio;
+ struct bio *bio;
+
+ bio = __chunk_load(chunk);
+ if (IS_ERR(bio))
+ return PTR_ERR(bio);
+
+ cbio = container_of(bio, struct chunk_bio, bio);
+ INIT_LIST_HEAD(&cbio->chunks);
+ list_add_tail(&chunk->link, &cbio->chunks);
+ INIT_WORK(&cbio->work, notify_load_and_schedule_io);
+ cbio->orig_bio = orig_bio;
+ cbio->orig_iter = orig_bio->bi_iter;
+ bio_advance_iter_single(orig_bio, &orig_bio->bi_iter,
+ chunk_limit(chunk, orig_bio));
+ bio_inc_remaining(orig_bio);
+
+ chunk_submit_bio(bio);
+ return 0;
+}
+
+int __init chunk_init(void)
+{
+ int ret;
+
+ ret = bioset_init(&chunk_io_bioset, 64,
+ offsetof(struct chunk_bio, bio),
+ BIOSET_NEED_BVECS | BIOSET_NEED_RESCUER);
+ if (!ret)
+ ret = bioset_init(&chunk_clone_bioset, 64, 0,
+ BIOSET_NEED_BVECS | BIOSET_NEED_RESCUER);
+ return ret;
+}
+
+void chunk_done(void)
+{
+ bioset_exit(&chunk_io_bioset);
+ bioset_exit(&chunk_clone_bioset);
+}
diff --git a/drivers/block/blksnap/chunk.h b/drivers/block/blksnap/chunk.h
new file mode 100644
index 000000000000..cd119ac729df
--- /dev/null
+++ b/drivers/block/blksnap/chunk.h
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef __BLKSNAP_CHUNK_H
+#define __BLKSNAP_CHUNK_H
+
+#include <linux/blk_types.h>
+#include <linux/blkdev.h>
+#include <linux/rwsem.h>
+#include <linux/atomic.h>
+#include "diff_area.h"
+
+struct diff_area;
+struct diff_region;
+
+/**
+ * enum chunk_st - Possible states for a chunk.
+ *
+ * @CHUNK_ST_NEW:
+ * No data is associated with the chunk.
+ * @CHUNK_ST_IN_MEMORY:
+ * The data of the chunk is ready to be read from the RAM buffer.
+ * The flag is removed when a chunk is removed from the store queue
+ * and its buffer is released.
+ * @CHUNK_ST_STORED:
+ * The data of the chunk has been written to the difference storage.
+ * @CHUNK_ST_FAILED:
+ * An error occurred while processing the chunk data.
+ *
+ * Chunks life circle:
+ * CHUNK_ST_NEW -> CHUNK_ST_IN_MEMORY <-> CHUNK_ST_STORED
+ */
+
+enum chunk_st {
+ CHUNK_ST_NEW,
+ CHUNK_ST_IN_MEMORY,
+ CHUNK_ST_STORED,
+ CHUNK_ST_FAILED,
+};
+
+/**
+ * struct chunk - Minimum data storage unit.
+ *
+ * @link:
+ * The list header allows to create queue of chunks.
+ * @number:
+ * Sequential number of the chunk.
+ * @sector_count:
+ * Number of sectors in the current chunk. This is especially true
+ * for the last chunk.
+ * @lock:
+ * Binary semaphore. Syncs access to the chunks fields: state,
+ * diff_buffer and diff_region.
+ * @diff_area:
+ * Pointer to the difference area - the difference storage area for a
+ * specific device. This field is only available when the chunk is locked.
+ * Allows to protect the difference area from early release.
+ * @state:
+ * Defines the state of a chunk.
+ * @diff_buffer:
+ * Pointer to &struct diff_buffer. Describes a buffer in the memory
+ * for storing the chunk data.
+ * @diff_region:
+ * Pointer to &struct diff_region. Describes a copy of the chunk data
+ * on the difference storage.
+ *
+ * This structure describes the block of data that the module operates
+ * with when executing the copy-on-write algorithm and when performing I/O
+ * to snapshot images.
+ *
+ * If the data of the chunk has been changed or has just been read, then
+ * the chunk gets into store queue.
+ *
+ * The semaphore is blocked for writing if there is no actual data in the
+ * buffer, since a block of data is being read from the original device or
+ * from a diff storage. If data is being read from or written to the
+ * diff_buffer, the semaphore must be locked.
+ */
+struct chunk {
+ struct list_head link;
+ unsigned long number;
+ sector_t sector_count;
+
+ struct semaphore lock;
+ struct diff_area *diff_area;
+
+ enum chunk_st state;
+ struct diff_buffer *diff_buffer;
+ struct diff_region *diff_region;
+};
+
+static inline void chunk_up(struct chunk *chunk)
+{
+ struct diff_area *diff_area = chunk->diff_area;
+
+ chunk->diff_area = NULL;
+ up(&chunk->lock);
+ diff_area_put(diff_area);
+};
+
+void chunk_store_failed(struct chunk *chunk, int error);
+struct bio *chunk_alloc_clone(struct block_device *bdev, struct bio *bio);
+
+void chunk_copy_bio(struct chunk *chunk, struct bio *bio,
+ struct bvec_iter *iter);
+void chunk_clone_bio(struct chunk *chunk, struct bio *bio);
+void chunk_store(struct chunk *chunk);
+int chunk_load_and_schedule_io(struct chunk *chunk, struct bio *orig_bio);
+int chunk_load_and_postpone_io(struct chunk *chunk, struct bio **chunk_bio);
+void chunk_load_and_postpone_io_finish(struct list_head *chunks,
+ struct bio *chunk_bio, struct bio *orig_bio);
+
+int __init chunk_init(void);
+void chunk_done(void);
+#endif /* __BLKSNAP_CHUNK_H */
--
2.20.1


2023-06-12 14:40:57

by Sergei Shtepa

[permalink] [raw]
Subject: [PATCH v5 08/11] blksnap: difference storage

Provides management of difference blocks of block devices. Storing
difference blocks, and reading them to get a snapshot images.

Co-developed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sergei Shtepa <[email protected]>
---
drivers/block/blksnap/diff_area.c | 554 +++++++++++++++++++++++++++
drivers/block/blksnap/diff_area.h | 144 +++++++
drivers/block/blksnap/diff_buffer.c | 127 ++++++
drivers/block/blksnap/diff_buffer.h | 37 ++
drivers/block/blksnap/diff_storage.c | 316 +++++++++++++++
drivers/block/blksnap/diff_storage.h | 111 ++++++
6 files changed, 1289 insertions(+)
create mode 100644 drivers/block/blksnap/diff_area.c
create mode 100644 drivers/block/blksnap/diff_area.h
create mode 100644 drivers/block/blksnap/diff_buffer.c
create mode 100644 drivers/block/blksnap/diff_buffer.h
create mode 100644 drivers/block/blksnap/diff_storage.c
create mode 100644 drivers/block/blksnap/diff_storage.h

diff --git a/drivers/block/blksnap/diff_area.c b/drivers/block/blksnap/diff_area.c
new file mode 100644
index 000000000000..169fa003b6d6
--- /dev/null
+++ b/drivers/block/blksnap/diff_area.c
@@ -0,0 +1,554 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#define pr_fmt(fmt) KBUILD_MODNAME "-diff-area: " fmt
+
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/build_bug.h>
+#include <uapi/linux/blksnap.h>
+#include "chunk.h"
+#include "diff_buffer.h"
+#include "diff_storage.h"
+#include "params.h"
+
+static inline sector_t diff_area_chunk_offset(struct diff_area *diff_area,
+ sector_t sector)
+{
+ return sector & ((1ull << (diff_area->chunk_shift - SECTOR_SHIFT)) - 1);
+}
+
+static inline unsigned long diff_area_chunk_number(struct diff_area *diff_area,
+ sector_t sector)
+{
+ return (unsigned long)(sector >>
+ (diff_area->chunk_shift - SECTOR_SHIFT));
+}
+
+static inline sector_t chunk_sector(struct chunk *chunk)
+{
+ return (sector_t)(chunk->number)
+ << (chunk->diff_area->chunk_shift - SECTOR_SHIFT);
+}
+
+static inline sector_t last_chunk_size(sector_t sector_count, sector_t capacity)
+{
+ sector_t capacity_rounded = round_down(capacity, sector_count);
+
+ if (capacity > capacity_rounded)
+ sector_count = capacity - capacity_rounded;
+
+ return sector_count;
+}
+
+static inline unsigned long long count_by_shift(sector_t capacity,
+ unsigned long long shift)
+{
+ unsigned long long shift_sector = (shift - SECTOR_SHIFT);
+
+ return round_up(capacity, (1ull << shift_sector)) >> shift_sector;
+}
+
+static inline struct chunk *chunk_alloc(struct diff_area *diff_area,
+ unsigned long number)
+{
+ struct chunk *chunk;
+
+ chunk = kzalloc(sizeof(struct chunk), GFP_NOIO);
+ if (!chunk)
+ return NULL;
+
+ INIT_LIST_HEAD(&chunk->link);
+ sema_init(&chunk->lock, 1);
+ chunk->diff_area = NULL;
+ chunk->number = number;
+ chunk->state = CHUNK_ST_NEW;
+
+ chunk->sector_count = diff_area_chunk_sectors(diff_area);
+ /*
+ * The last chunk has a special size.
+ */
+ if (unlikely((number + 1) == diff_area->chunk_count)) {
+ chunk->sector_count = bdev_nr_sectors(diff_area->orig_bdev) -
+ (chunk->sector_count * number);
+ }
+
+ return chunk;
+}
+
+static inline void chunk_free(struct diff_area *diff_area, struct chunk *chunk)
+{
+ down(&chunk->lock);
+ if (chunk->diff_buffer)
+ diff_buffer_release(diff_area, chunk->diff_buffer);
+ diff_storage_free_region(chunk->diff_region);
+ up(&chunk->lock);
+ kfree(chunk);
+}
+
+static void diff_area_calculate_chunk_size(struct diff_area *diff_area)
+{
+ unsigned long count;
+ unsigned long shift = get_chunk_minimum_shift();
+ sector_t capacity;
+ sector_t min_io_sect;
+
+ min_io_sect = (sector_t)(bdev_io_min(diff_area->orig_bdev) >>
+ SECTOR_SHIFT);
+ capacity = bdev_nr_sectors(diff_area->orig_bdev);
+ pr_debug("Minimal IO block %llu sectors\n", min_io_sect);
+ pr_debug("Device capacity %llu sectors\n", capacity);
+
+ count = count_by_shift(capacity, shift);
+ pr_debug("Chunks count %lu\n", count);
+ while ((count > get_chunk_maximum_count()) ||
+ ((1ul << (shift - SECTOR_SHIFT)) < min_io_sect)) {
+ shift++;
+ count = count_by_shift(capacity, shift);
+ pr_debug("Chunks count %lu\n", count);
+ }
+
+ diff_area->chunk_shift = shift;
+ diff_area->chunk_count = (unsigned long)DIV_ROUND_UP_ULL(capacity,
+ (1ul << (shift - SECTOR_SHIFT)));
+}
+
+void diff_area_free(struct kref *kref)
+{
+ unsigned long inx = 0;
+ struct chunk *chunk;
+ struct diff_area *diff_area =
+ container_of(kref, struct diff_area, kref);
+
+ might_sleep();
+
+ flush_work(&diff_area->store_queue_work);
+ xa_for_each(&diff_area->chunk_map, inx, chunk)
+ if (chunk)
+ chunk_free(diff_area, chunk);
+ xa_destroy(&diff_area->chunk_map);
+
+ if (diff_area->orig_bdev) {
+ blkdev_put(diff_area->orig_bdev, FMODE_READ | FMODE_WRITE);
+ diff_area->orig_bdev = NULL;
+ }
+
+ /* Clean up free_diff_buffers */
+ diff_buffer_cleanup(diff_area);
+
+ kfree(diff_area);
+}
+
+static inline bool diff_area_store_one(struct diff_area *diff_area)
+{
+ struct chunk *iter, *chunk = NULL;
+
+ spin_lock(&diff_area->store_queue_lock);
+ list_for_each_entry(iter, &diff_area->store_queue, link) {
+ if (!down_trylock(&iter->lock)) {
+ chunk = iter;
+ atomic_dec(&diff_area->store_queue_count);
+ list_del_init(&chunk->link);
+ chunk->diff_area = diff_area_get(diff_area);
+ break;
+ }
+ /*
+ * If it is not possible to lock a chunk for writing,
+ * then it is currently in use, and we try to clean up the
+ * next chunk.
+ */
+ }
+ spin_unlock(&diff_area->store_queue_lock);
+ if (!chunk)
+ return false;
+
+ if (chunk->state != CHUNK_ST_IN_MEMORY) {
+ /*
+ * There cannot be a chunk in the store queue whose buffer has
+ * not been read into memory.
+ */
+ chunk_up(chunk);
+ pr_warn("Cannot release empty buffer for chunk #%ld",
+ chunk->number);
+ return true;
+ }
+
+ if (diff_area_is_corrupted(diff_area)) {
+ chunk_store_failed(chunk, 0);
+ return true;
+ }
+
+ if (!chunk->diff_region) {
+ struct diff_region *diff_region;
+
+ diff_region = diff_storage_new_region(
+ diff_area->diff_storage,
+ diff_area_chunk_sectors(diff_area),
+ diff_area->logical_blksz);
+
+ if (IS_ERR(diff_region)) {
+ pr_debug("Cannot get store for chunk #%ld\n",
+ chunk->number);
+ chunk_store_failed(chunk, PTR_ERR(diff_region));
+ return true;
+ }
+ chunk->diff_region = diff_region;
+ }
+ chunk_store(chunk);
+ return true;
+}
+
+static void diff_area_store_queue_work(struct work_struct *work)
+{
+ struct diff_area *diff_area = container_of(
+ work, struct diff_area, store_queue_work);
+
+ while (diff_area_store_one(diff_area))
+ ;
+}
+
+struct diff_area *diff_area_new(dev_t dev_id, struct diff_storage *diff_storage)
+{
+ int ret = 0;
+ struct diff_area *diff_area = NULL;
+ struct block_device *bdev;
+
+ pr_debug("Open device [%u:%u]\n", MAJOR(dev_id), MINOR(dev_id));
+
+ bdev = blkdev_get_by_dev(dev_id, FMODE_READ | FMODE_WRITE, NULL, NULL);
+ if (IS_ERR(bdev)) {
+ int err = PTR_ERR(bdev);
+
+ pr_err("Failed to open device. errno=%d\n", abs(err));
+ return ERR_PTR(err);
+ }
+
+ diff_area = kzalloc(sizeof(struct diff_area), GFP_KERNEL);
+ if (!diff_area) {
+ blkdev_put(bdev, FMODE_READ | FMODE_WRITE);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ kref_init(&diff_area->kref);
+ diff_area->orig_bdev = bdev;
+ diff_area->diff_storage = diff_storage;
+
+ diff_area_calculate_chunk_size(diff_area);
+ if (diff_area->chunk_shift > get_chunk_maximum_shift()) {
+ pr_info("The maximum allowable chunk size has been reached.\n");
+ return ERR_PTR(-EFAULT);
+ }
+ pr_debug("The optimal chunk size was calculated as %llu bytes for device [%d:%d]\n",
+ (1ull << diff_area->chunk_shift),
+ MAJOR(diff_area->orig_bdev->bd_dev),
+ MINOR(diff_area->orig_bdev->bd_dev));
+
+ xa_init(&diff_area->chunk_map);
+
+ spin_lock_init(&diff_area->store_queue_lock);
+ INIT_LIST_HEAD(&diff_area->store_queue);
+ atomic_set(&diff_area->store_queue_count, 0);
+ INIT_WORK(&diff_area->store_queue_work, diff_area_store_queue_work);
+
+ spin_lock_init(&diff_area->free_diff_buffers_lock);
+ INIT_LIST_HEAD(&diff_area->free_diff_buffers);
+ atomic_set(&diff_area->free_diff_buffers_count, 0);
+
+ diff_area->physical_blksz = bdev->bd_queue->limits.physical_block_size;
+ diff_area->logical_blksz = bdev->bd_queue->limits.logical_block_size;
+ diff_area->corrupt_flag = 0;
+
+ if (!diff_storage->capacity) {
+ pr_err("Difference storage is empty\n");
+ pr_err("In-memory difference storage is not supported\n");
+ ret = -EFAULT;
+ }
+
+ if (ret) {
+ diff_area_put(diff_area);
+ return ERR_PTR(ret);
+ }
+
+ return diff_area;
+}
+
+static inline unsigned int chunk_limit(struct chunk *chunk,
+ struct bvec_iter *iter)
+{
+ sector_t chunk_ofs = iter->bi_sector - chunk_sector(chunk);
+ sector_t chunk_left = chunk->sector_count - chunk_ofs;
+
+ return min(iter->bi_size, (unsigned int)(chunk_left << SECTOR_SHIFT));
+}
+
+/*
+ * Implements the copy-on-write mechanism.
+ */
+bool diff_area_cow(struct bio *bio, struct diff_area *diff_area,
+ struct bvec_iter *iter)
+{
+ bool nowait = bio->bi_opf & REQ_NOWAIT;
+ struct bio *chunk_bio = NULL;
+ LIST_HEAD(chunks);
+ int ret = 0;
+
+ while (iter->bi_size) {
+ unsigned long nr = diff_area_chunk_number(diff_area,
+ iter->bi_sector);
+ struct chunk *chunk = xa_load(&diff_area->chunk_map, nr);
+ unsigned int len;
+
+ if (!chunk) {
+ chunk = chunk_alloc(diff_area, nr);
+ if (!chunk) {
+ diff_area_set_corrupted(diff_area, -EINVAL);
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ ret = xa_insert(&diff_area->chunk_map, nr, chunk,
+ GFP_NOIO);
+ if (likely(!ret)) {
+ /* new chunk has been added */
+ } else if (ret == -EBUSY) {
+ /* another chunk has just been created */
+ chunk_free(diff_area, chunk);
+ chunk = xa_load(&diff_area->chunk_map, nr);
+ WARN_ON_ONCE(!chunk);
+ if (unlikely(!chunk)) {
+ ret = -EINVAL;
+ diff_area_set_corrupted(diff_area, ret);
+ goto fail;
+ }
+ } else if (ret) {
+ pr_err("Failed insert chunk to chunk map\n");
+ chunk_free(diff_area, chunk);
+ diff_area_set_corrupted(diff_area, ret);
+ goto fail;
+ }
+ }
+
+ if (nowait) {
+ if (down_trylock(&chunk->lock)) {
+ ret = -EAGAIN;
+ goto fail;
+ }
+ } else {
+ ret = down_killable(&chunk->lock);
+ if (unlikely(ret))
+ goto fail;
+ }
+ chunk->diff_area = diff_area_get(diff_area);
+
+ len = chunk_limit(chunk, iter);
+ if (chunk->state == CHUNK_ST_NEW) {
+ if (nowait) {
+ /*
+ * If the data of this chunk has not yet been
+ * copied to the difference storage, then it is
+ * impossible to process the I/O write unit with
+ * the NOWAIT flag.
+ */
+ chunk_up(chunk);
+ ret = -EAGAIN;
+ goto fail;
+ }
+
+ /*
+ * Load the chunk asynchronously.
+ */
+ ret = chunk_load_and_postpone_io(chunk, &chunk_bio);
+ if (ret) {
+ chunk_up(chunk);
+ goto fail;
+ }
+ list_add_tail(&chunk->link, &chunks);
+ } else {
+ /*
+ * The chunk has already been:
+ * - failed, when the snapshot is corrupted
+ * - read into the buffer
+ * - stored into the diff storage
+ * In this case, we do not change the chunk.
+ */
+ chunk_up(chunk);
+ }
+ bio_advance_iter_single(bio, iter, len);
+ }
+
+ if (chunk_bio) {
+ /* Postpone bio processing in a callback. */
+ chunk_load_and_postpone_io_finish(&chunks, chunk_bio, bio);
+ return true;
+ }
+ /* Pass bio to the low level */
+ return false;
+
+fail:
+ if (chunk_bio) {
+ chunk_bio->bi_status = errno_to_blk_status(ret);
+ bio_endio(chunk_bio);
+ }
+
+ if (ret == -EAGAIN) {
+ /*
+ * The -EAGAIN error code means that it is not possible to
+ * process a I/O unit with a flag REQ_NOWAIT.
+ * I/O unit processing is being completed with such error.
+ */
+ bio->bi_status = BLK_STS_AGAIN;
+ bio_endio(bio);
+ return true;
+ }
+ /* In any other case, the processing of the I/O unit continues. */
+ return false;
+}
+
+static void orig_clone_endio(struct bio *bio)
+{
+ struct bio *orig_bio = bio->bi_private;
+
+ if (unlikely(bio->bi_status != BLK_STS_OK))
+ bio_io_error(orig_bio);
+ else
+ bio_endio(orig_bio);
+}
+
+static void orig_clone_bio(struct diff_area *diff_area, struct bio *bio)
+{
+ struct bio *new_bio;
+ struct block_device *bdev = diff_area->orig_bdev;
+ sector_t chunk_limit;
+
+ new_bio = chunk_alloc_clone(bdev, bio);
+ WARN_ON(!new_bio);
+
+ chunk_limit = diff_area_chunk_sectors(diff_area) -
+ diff_area_chunk_offset(diff_area, bio->bi_iter.bi_sector);
+
+ new_bio->bi_iter.bi_sector = bio->bi_iter.bi_sector;
+ new_bio->bi_iter.bi_size = min_t(unsigned int,
+ bio->bi_iter.bi_size, chunk_limit << SECTOR_SHIFT);
+
+ bio_set_flag(new_bio, BIO_FILTERED);
+ new_bio->bi_end_io = orig_clone_endio;
+ new_bio->bi_private = bio;
+
+ bio_advance(bio, new_bio->bi_iter.bi_size);
+ bio_inc_remaining(bio);
+
+ submit_bio_noacct(new_bio);
+}
+
+bool diff_area_submit_chunk(struct diff_area *diff_area, struct bio *bio)
+{
+ int ret;
+ struct chunk *chunk;
+ unsigned long nr = diff_area_chunk_number(diff_area,
+ bio->bi_iter.bi_sector);
+
+ chunk = xa_load(&diff_area->chunk_map, nr);
+ /*
+ * If this chunk is not in the chunk map, then the COW algorithm did
+ * not access this part of the disk space, and writing to the snapshot
+ * in this part was also not performed.
+ */
+ if (!chunk) {
+ if (op_is_write(bio_op(bio))) {
+ /*
+ * To process a write bio, we need to allocate a new
+ * chunk.
+ */
+ chunk = chunk_alloc(diff_area, nr);
+ WARN_ON_ONCE(!chunk);
+ if (unlikely(!chunk))
+ return false;
+
+ ret = xa_insert(&diff_area->chunk_map, nr, chunk,
+ GFP_NOIO);
+ if (likely(!ret)) {
+ /* new chunk has been added */
+ } else if (ret == -EBUSY) {
+ /* another chunk has just been created */
+ chunk_free(diff_area, chunk);
+ chunk = xa_load(&diff_area->chunk_map, nr);
+ WARN_ON_ONCE(!chunk);
+ if (unlikely(!chunk))
+ return false;
+ } else if (ret) {
+ pr_err("Failed insert chunk to chunk map\n");
+ chunk_free(diff_area, chunk);
+ return false;
+ }
+ } else {
+ /*
+ * To read, we simply redirect the bio to the original
+ * block device.
+ */
+ orig_clone_bio(diff_area, bio);
+ return true;
+ }
+ }
+
+ if (down_killable(&chunk->lock))
+ return false;
+ chunk->diff_area = diff_area_get(diff_area);
+
+ if (unlikely(chunk->state == CHUNK_ST_FAILED)) {
+ pr_err("Chunk #%ld corrupted\n", chunk->number);
+ chunk_up(chunk);
+ return false;
+ }
+ if (chunk->state == CHUNK_ST_IN_MEMORY) {
+ /*
+ * Directly copy data from the in-memory chunk or
+ * copy to the in-memory chunk for write operation.
+ */
+ chunk_copy_bio(chunk, bio, &bio->bi_iter);
+ chunk_up(chunk);
+ return true;
+ }
+ if ((chunk->state == CHUNK_ST_STORED) || !op_is_write(bio_op(bio))) {
+ /*
+ * Read data from the chunk on difference storage.
+ */
+ chunk_clone_bio(chunk, bio);
+ chunk_up(chunk);
+ return true;
+ }
+ /*
+ * Starts asynchronous loading of a chunk from the original block device
+ * or difference storage and schedule copying data to (or from) the
+ * in-memory chunk.
+ */
+ if (chunk_load_and_schedule_io(chunk, bio)) {
+ chunk_up(chunk);
+ return false;
+ }
+ return true;
+}
+
+static inline void diff_area_event_corrupted(struct diff_area *diff_area)
+{
+ struct blksnap_event_corrupted data = {
+ .dev_id_mj = MAJOR(diff_area->orig_bdev->bd_dev),
+ .dev_id_mn = MINOR(diff_area->orig_bdev->bd_dev),
+ .err_code = abs(diff_area->error_code),
+ };
+
+ event_gen(&diff_area->diff_storage->event_queue, GFP_NOIO,
+ blksnap_event_code_corrupted, &data,
+ sizeof(struct blksnap_event_corrupted));
+}
+
+void diff_area_set_corrupted(struct diff_area *diff_area, int err_code)
+{
+ if (test_and_set_bit(0, &diff_area->corrupt_flag))
+ return;
+
+ diff_area->error_code = err_code;
+ diff_area_event_corrupted(diff_area);
+
+ pr_err("Set snapshot device is corrupted for [%u:%u] with error code %d\n",
+ MAJOR(diff_area->orig_bdev->bd_dev),
+ MINOR(diff_area->orig_bdev->bd_dev), abs(err_code));
+}
diff --git a/drivers/block/blksnap/diff_area.h b/drivers/block/blksnap/diff_area.h
new file mode 100644
index 000000000000..6ecec9390282
--- /dev/null
+++ b/drivers/block/blksnap/diff_area.h
@@ -0,0 +1,144 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef __BLKSNAP_DIFF_AREA_H
+#define __BLKSNAP_DIFF_AREA_H
+
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/kref.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/xarray.h>
+#include "event_queue.h"
+
+struct diff_storage;
+struct chunk;
+
+/**
+ * struct diff_area - Describes the difference area for one original device.
+ *
+ * @kref:
+ * The reference counter allows to manage the lifetime of an object.
+ * @orig_bdev:
+ * A pointer to the structure of an opened block device.
+ * @diff_storage:
+ * Pointer to difference storage for storing difference data.
+ * @chunk_shift:
+ * Power of 2 used to specify the chunk size. This allows to set different
+ * chunk sizes for huge and small block devices.
+ * @chunk_count:
+ * Count of chunks. The number of chunks into which the block device
+ * is divided.
+ * @chunk_map:
+ * A map of chunks.
+ * @store_queue_lock:
+ * This spinlock guarantees consistency of the linked lists of chunks
+ * queue.
+ * @store_queue:
+ * The queue of chunks waiting to be stored to the difference storage.
+ * @store_queue_count:
+ * The number of chunks in the store queue.
+ * @store_queue_work:
+ * The workqueue work item. This worker limits the number of chunks
+ * that store their data in RAM.
+ * @free_diff_buffers_lock:
+ * This spinlock guarantees consistency of the linked lists of
+ * free difference buffers.
+ * @free_diff_buffers:
+ * Linked list of free difference buffers allows to reduce the number
+ * of buffer allocation and release operations.
+ * @physical_blksz:
+ * The physical block size for the snapshot image is equal to the
+ * physical block size of the original device.
+ * @logical_blksz:
+ * The logical block size for the snapshot image is equal to the
+ * logical block size of the original device.
+ * @free_diff_buffers_count:
+ * The number of free difference buffers in the linked list.
+ * @corrupt_flag:
+ * The flag is set if an error occurred in the operation of the data
+ * saving mechanism in the diff area. In this case, an error will be
+ * generated when reading from the snapshot image.
+ * @error_code:
+ * The error code that caused the snapshot to be corrupted.
+ *
+ * The &struct diff_area is created for each block device in the snapshot.
+ * It is used to save the differences between the original block device and
+ * the snapshot image. That is, when writing data to the original device,
+ * the differences are copied as chunks to the difference storage.
+ * Reading and writing from the snapshot image is also performed using
+ * &struct diff_area.
+ *
+ * The xarray has a limit on the maximum size. This can be especially
+ * noticeable on 32-bit systems. This creates a limit in the size of
+ * supported disks.
+ *
+ * For example, for a 256 TiB disk with a block size of 65536 bytes, the
+ * number of elements in the chunk map will be equal to 2 with a power of 32.
+ * Therefore, the number of chunks into which the block device is divided is
+ * limited.
+ *
+ * The store queue allows to postpone the operation of storing a chunks data
+ * to the difference storage and perform it later in the worker thread.
+ *
+ * The linked list of difference buffers allows to have a certain number of
+ * "hot" buffers. This allows to reduce the number of allocations and releases
+ * of memory.
+ *
+ *
+ */
+struct diff_area {
+ struct kref kref;
+ struct block_device *orig_bdev;
+ struct diff_storage *diff_storage;
+
+ unsigned long chunk_shift;
+ unsigned long chunk_count;
+ struct xarray chunk_map;
+
+ spinlock_t store_queue_lock;
+ struct list_head store_queue;
+ atomic_t store_queue_count;
+ struct work_struct store_queue_work;
+
+ spinlock_t free_diff_buffers_lock;
+ struct list_head free_diff_buffers;
+ atomic_t free_diff_buffers_count;
+
+ unsigned int physical_blksz;
+ unsigned int logical_blksz;
+
+ unsigned long corrupt_flag;
+ int error_code;
+};
+
+struct diff_area *diff_area_new(dev_t dev_id,
+ struct diff_storage *diff_storage);
+void diff_area_free(struct kref *kref);
+static inline struct diff_area *diff_area_get(struct diff_area *diff_area)
+{
+ kref_get(&diff_area->kref);
+ return diff_area;
+};
+static inline void diff_area_put(struct diff_area *diff_area)
+{
+ kref_put(&diff_area->kref, diff_area_free);
+};
+
+void diff_area_set_corrupted(struct diff_area *diff_area, int err_code);
+static inline bool diff_area_is_corrupted(struct diff_area *diff_area)
+{
+ return !!diff_area->corrupt_flag;
+};
+static inline sector_t diff_area_chunk_sectors(struct diff_area *diff_area)
+{
+ return (sector_t)(1ull << (diff_area->chunk_shift - SECTOR_SHIFT));
+};
+bool diff_area_cow(struct bio *bio, struct diff_area *diff_area,
+ struct bvec_iter *iter);
+
+bool diff_area_submit_chunk(struct diff_area *diff_area, struct bio *bio);
+void diff_area_rw_chunk(struct kref *kref);
+
+#endif /* __BLKSNAP_DIFF_AREA_H */
diff --git a/drivers/block/blksnap/diff_buffer.c b/drivers/block/blksnap/diff_buffer.c
new file mode 100644
index 000000000000..77ad59cc46b3
--- /dev/null
+++ b/drivers/block/blksnap/diff_buffer.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#define pr_fmt(fmt) KBUILD_MODNAME "-diff-buffer: " fmt
+
+#include "diff_buffer.h"
+#include "diff_area.h"
+#include "params.h"
+
+static void diff_buffer_free(struct diff_buffer *diff_buffer)
+{
+ size_t inx = 0;
+
+ if (unlikely(!diff_buffer))
+ return;
+
+ for (inx = 0; inx < diff_buffer->page_count; inx++) {
+ struct page *page = diff_buffer->pages[inx];
+
+ if (page)
+ __free_page(page);
+ }
+
+ kfree(diff_buffer);
+}
+
+static struct diff_buffer *
+diff_buffer_new(size_t page_count, size_t buffer_size, gfp_t gfp_mask)
+{
+ struct diff_buffer *diff_buffer;
+ size_t inx = 0;
+ struct page *page;
+
+ if (unlikely(page_count <= 0))
+ return NULL;
+
+ /*
+ * In case of overflow, it is better to get a null pointer
+ * than a pointer to some memory area. Therefore + 1.
+ */
+ diff_buffer = kzalloc(sizeof(struct diff_buffer) +
+ (page_count + 1) * sizeof(struct page *),
+ gfp_mask);
+ if (!diff_buffer)
+ return NULL;
+
+ INIT_LIST_HEAD(&diff_buffer->link);
+ diff_buffer->size = buffer_size;
+ diff_buffer->page_count = page_count;
+
+ for (inx = 0; inx < page_count; inx++) {
+ page = alloc_page(gfp_mask);
+ if (!page)
+ goto fail;
+
+ diff_buffer->pages[inx] = page;
+ }
+ return diff_buffer;
+fail:
+ diff_buffer_free(diff_buffer);
+ return NULL;
+}
+
+struct diff_buffer *diff_buffer_take(struct diff_area *diff_area)
+{
+ struct diff_buffer *diff_buffer = NULL;
+ sector_t chunk_sectors;
+ size_t page_count;
+ size_t buffer_size;
+
+ spin_lock(&diff_area->free_diff_buffers_lock);
+ diff_buffer = list_first_entry_or_null(&diff_area->free_diff_buffers,
+ struct diff_buffer, link);
+ if (diff_buffer) {
+ list_del(&diff_buffer->link);
+ atomic_dec(&diff_area->free_diff_buffers_count);
+ }
+ spin_unlock(&diff_area->free_diff_buffers_lock);
+
+ /* Return free buffer if it was found in a pool */
+ if (diff_buffer)
+ return diff_buffer;
+
+ /* Allocate new buffer */
+ chunk_sectors = diff_area_chunk_sectors(diff_area);
+ page_count = round_up(chunk_sectors, PAGE_SECTORS) / PAGE_SECTORS;
+ buffer_size = chunk_sectors << SECTOR_SHIFT;
+
+ diff_buffer =
+ diff_buffer_new(page_count, buffer_size, GFP_NOIO);
+ if (unlikely(!diff_buffer))
+ return ERR_PTR(-ENOMEM);
+ return diff_buffer;
+}
+
+void diff_buffer_release(struct diff_area *diff_area,
+ struct diff_buffer *diff_buffer)
+{
+ if (atomic_read(&diff_area->free_diff_buffers_count) >
+ get_free_diff_buffer_pool_size()) {
+ diff_buffer_free(diff_buffer);
+ return;
+ }
+ spin_lock(&diff_area->free_diff_buffers_lock);
+ list_add_tail(&diff_buffer->link, &diff_area->free_diff_buffers);
+ atomic_inc(&diff_area->free_diff_buffers_count);
+ spin_unlock(&diff_area->free_diff_buffers_lock);
+}
+
+void diff_buffer_cleanup(struct diff_area *diff_area)
+{
+ struct diff_buffer *diff_buffer = NULL;
+
+ do {
+ spin_lock(&diff_area->free_diff_buffers_lock);
+ diff_buffer =
+ list_first_entry_or_null(&diff_area->free_diff_buffers,
+ struct diff_buffer, link);
+ if (diff_buffer) {
+ list_del(&diff_buffer->link);
+ atomic_dec(&diff_area->free_diff_buffers_count);
+ }
+ spin_unlock(&diff_area->free_diff_buffers_lock);
+
+ if (diff_buffer)
+ diff_buffer_free(diff_buffer);
+ } while (diff_buffer);
+}
diff --git a/drivers/block/blksnap/diff_buffer.h b/drivers/block/blksnap/diff_buffer.h
new file mode 100644
index 000000000000..f81e56cf4b9a
--- /dev/null
+++ b/drivers/block/blksnap/diff_buffer.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef __BLKSNAP_DIFF_BUFFER_H
+#define __BLKSNAP_DIFF_BUFFER_H
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/list.h>
+#include <linux/blkdev.h>
+
+struct diff_area;
+
+/**
+ * struct diff_buffer - Difference buffer.
+ * @link:
+ * The list header allows to create a pool of the diff_buffer structures.
+ * @size:
+ * Count of bytes in the buffer.
+ * @page_count:
+ * The number of pages reserved for the buffer.
+ * @pages:
+ * An array of pointers to pages.
+ *
+ * Describes the memory buffer for a chunk in the memory.
+ */
+struct diff_buffer {
+ struct list_head link;
+ size_t size;
+ size_t page_count;
+ struct page *pages[0];
+};
+
+struct diff_buffer *diff_buffer_take(struct diff_area *diff_area);
+void diff_buffer_release(struct diff_area *diff_area,
+ struct diff_buffer *diff_buffer);
+void diff_buffer_cleanup(struct diff_area *diff_area);
+#endif /* __BLKSNAP_DIFF_BUFFER_H */
diff --git a/drivers/block/blksnap/diff_storage.c b/drivers/block/blksnap/diff_storage.c
new file mode 100644
index 000000000000..1787fa6931a8
--- /dev/null
+++ b/drivers/block/blksnap/diff_storage.c
@@ -0,0 +1,316 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#define pr_fmt(fmt) KBUILD_MODNAME "-diff-storage: " fmt
+
+#include <linux/slab.h>
+#include <linux/sched/mm.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/build_bug.h>
+#include <uapi/linux/blksnap.h>
+#include "chunk.h"
+#include "diff_buffer.h"
+#include "diff_storage.h"
+#include "params.h"
+
+/**
+ * struct storage_bdev - Information about the opened block device.
+ *
+ * @link:
+ * Allows to combine structures into a linked list.
+ * @bdev:
+ * A pointer to an open block device.
+ */
+struct storage_bdev {
+ struct list_head link;
+ struct block_device *bdev;
+};
+
+/**
+ * struct storage_block - A storage unit reserved for storing differences.
+ *
+ * @link:
+ * Allows to combine structures into a linked list.
+ * @bdev:
+ * A pointer to a block device.
+ * @sector:
+ * The number of the first sector of the range of allocated space for
+ * storing the difference.
+ * @count:
+ * The count of sectors in the range of allocated space for storing the
+ * difference.
+ * @used:
+ * The count of used sectors in the range of allocated space for storing
+ * the difference.
+ */
+struct storage_block {
+ struct list_head link;
+ struct block_device *bdev;
+ sector_t sector;
+ sector_t count;
+ sector_t used;
+};
+
+static inline void diff_storage_event_low(struct diff_storage *diff_storage)
+{
+ struct blksnap_event_low_free_space data = {
+ .requested_nr_sect = get_diff_storage_minimum(),
+ };
+
+ diff_storage->requested += data.requested_nr_sect;
+ pr_debug("Diff storage low free space. Portion: %llu sectors, requested: %llu\n",
+ data.requested_nr_sect, diff_storage->requested);
+ event_gen(&diff_storage->event_queue, GFP_NOIO,
+ blksnap_event_code_low_free_space, &data, sizeof(data));
+}
+
+struct diff_storage *diff_storage_new(void)
+{
+ struct diff_storage *diff_storage;
+
+ diff_storage = kzalloc(sizeof(struct diff_storage), GFP_KERNEL);
+ if (!diff_storage)
+ return NULL;
+
+ kref_init(&diff_storage->kref);
+ spin_lock_init(&diff_storage->lock);
+ INIT_LIST_HEAD(&diff_storage->storage_bdevs);
+ INIT_LIST_HEAD(&diff_storage->empty_blocks);
+ INIT_LIST_HEAD(&diff_storage->filled_blocks);
+
+ event_queue_init(&diff_storage->event_queue);
+ diff_storage_event_low(diff_storage);
+
+ return diff_storage;
+}
+
+static inline struct storage_block *
+first_empty_storage_block(struct diff_storage *diff_storage)
+{
+ return list_first_entry_or_null(&diff_storage->empty_blocks,
+ struct storage_block, link);
+};
+
+static inline struct storage_block *
+first_filled_storage_block(struct diff_storage *diff_storage)
+{
+ return list_first_entry_or_null(&diff_storage->filled_blocks,
+ struct storage_block, link);
+};
+
+static inline struct storage_bdev *
+first_storage_bdev(struct diff_storage *diff_storage)
+{
+ return list_first_entry_or_null(&diff_storage->storage_bdevs,
+ struct storage_bdev, link);
+};
+
+void diff_storage_free(struct kref *kref)
+{
+ struct diff_storage *diff_storage =
+ container_of(kref, struct diff_storage, kref);
+ struct storage_block *blk;
+ struct storage_bdev *storage_bdev;
+
+ while ((blk = first_empty_storage_block(diff_storage))) {
+ list_del(&blk->link);
+ kfree(blk);
+ }
+
+ while ((blk = first_filled_storage_block(diff_storage))) {
+ list_del(&blk->link);
+ kfree(blk);
+ }
+
+ while ((storage_bdev = first_storage_bdev(diff_storage))) {
+ blkdev_put(storage_bdev->bdev, FMODE_READ | FMODE_WRITE);
+ list_del(&storage_bdev->link);
+ kfree(storage_bdev);
+ }
+ event_queue_done(&diff_storage->event_queue);
+
+ kfree(diff_storage);
+}
+
+static struct block_device *diff_storage_add_storage_bdev(
+ struct diff_storage *diff_storage, const char *bdev_path)
+{
+ struct storage_bdev *storage_bdev, *existing_bdev = NULL;
+ struct block_device *bdev;
+
+ bdev = blkdev_get_by_path(bdev_path, FMODE_READ | FMODE_WRITE,
+ NULL, NULL);
+ if (IS_ERR(bdev)) {
+ pr_err("Failed to open device. errno=%ld\n", PTR_ERR(bdev));
+ return bdev;
+ }
+
+ spin_lock(&diff_storage->lock);
+ list_for_each_entry(existing_bdev, &diff_storage->storage_bdevs, link) {
+ if (existing_bdev->bdev == bdev)
+ break;
+ }
+ spin_unlock(&diff_storage->lock);
+
+ if (existing_bdev->bdev == bdev) {
+ blkdev_put(bdev, FMODE_READ | FMODE_WRITE);
+ return existing_bdev->bdev;
+ }
+
+ storage_bdev = kzalloc(sizeof(struct storage_bdev) +
+ strlen(bdev_path) + 1, GFP_KERNEL);
+ if (!storage_bdev) {
+ blkdev_put(bdev, FMODE_READ | FMODE_WRITE);
+ return ERR_PTR(-ENOMEM);
+ }
+
+ INIT_LIST_HEAD(&storage_bdev->link);
+ storage_bdev->bdev = bdev;
+
+ spin_lock(&diff_storage->lock);
+ list_add_tail(&storage_bdev->link, &diff_storage->storage_bdevs);
+ spin_unlock(&diff_storage->lock);
+
+ return bdev;
+}
+
+static inline int diff_storage_add_range(struct diff_storage *diff_storage,
+ struct block_device *bdev,
+ sector_t sector, sector_t count)
+{
+ struct storage_block *storage_block;
+
+ pr_debug("Add range to diff storage: [%u:%u] %llu:%llu\n",
+ MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev), sector, count);
+
+ storage_block = kzalloc(sizeof(struct storage_block), GFP_KERNEL);
+ if (!storage_block)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&storage_block->link);
+ storage_block->bdev = bdev;
+ storage_block->sector = sector;
+ storage_block->count = count;
+
+ spin_lock(&diff_storage->lock);
+ list_add_tail(&storage_block->link, &diff_storage->empty_blocks);
+ diff_storage->capacity += count;
+ spin_unlock(&diff_storage->lock);
+
+ return 0;
+}
+
+int diff_storage_append_block(struct diff_storage *diff_storage,
+ const char *bdev_path,
+ struct blksnap_sectors __user *ranges,
+ unsigned int range_count)
+{
+ int ret;
+ int inx;
+ struct block_device *bdev;
+ struct blksnap_sectors range;
+
+ pr_debug("Append %u blocks\n", range_count);
+
+ bdev = diff_storage_add_storage_bdev(diff_storage, bdev_path);
+ if (IS_ERR(bdev))
+ return PTR_ERR(bdev);
+
+ for (inx = 0; inx < range_count; inx++) {
+ if (unlikely(copy_from_user(&range, ranges+inx, sizeof(range))))
+ return -EINVAL;
+
+ ret = diff_storage_add_range(diff_storage, bdev,
+ range.offset,
+ range.count);
+ if (unlikely(ret))
+ return ret;
+ }
+
+ if (atomic_read(&diff_storage->low_space_flag) &&
+ (diff_storage->capacity >= diff_storage->requested))
+ atomic_set(&diff_storage->low_space_flag, 0);
+
+ return 0;
+}
+
+static inline bool is_halffull(const sector_t sectors_left)
+{
+ return sectors_left <=
+ ((get_diff_storage_minimum() >> 1) & ~(PAGE_SECTORS - 1));
+}
+
+struct diff_region *diff_storage_new_region(struct diff_storage *diff_storage,
+ sector_t count,
+ unsigned int logical_blksz)
+{
+ int ret = 0;
+ struct diff_region *diff_region;
+ sector_t sectors_left;
+
+ if (atomic_read(&diff_storage->overflow_flag))
+ return ERR_PTR(-ENOSPC);
+
+ diff_region = kzalloc(sizeof(struct diff_region), GFP_NOIO);
+ if (!diff_region)
+ return ERR_PTR(-ENOMEM);
+
+ spin_lock(&diff_storage->lock);
+ do {
+ struct storage_block *storage_block;
+ sector_t available;
+ struct request_queue *q;
+
+ storage_block = first_empty_storage_block(diff_storage);
+ if (unlikely(!storage_block)) {
+ atomic_inc(&diff_storage->overflow_flag);
+ ret = -ENOSPC;
+ break;
+ }
+
+ q = storage_block->bdev->bd_queue;
+ if (logical_blksz < q->limits.logical_block_size) {
+ pr_err("Incompatibility of block sizes was detected.");
+ ret = -ENOTBLK;
+ break;
+ }
+
+ available = storage_block->count - storage_block->used;
+ if (likely(available >= count)) {
+ diff_region->bdev = storage_block->bdev;
+ diff_region->sector =
+ storage_block->sector + storage_block->used;
+ diff_region->count = count;
+
+ storage_block->used += count;
+ diff_storage->filled += count;
+ break;
+ }
+
+ list_del(&storage_block->link);
+ list_add_tail(&storage_block->link,
+ &diff_storage->filled_blocks);
+ /*
+ * If there is still free space in the storage block, but
+ * it is not enough to store a piece, then such a block is
+ * considered used.
+ * We believe that the storage blocks are large enough
+ * to accommodate several pieces entirely.
+ */
+ diff_storage->filled += available;
+ } while (1);
+ sectors_left = diff_storage->requested - diff_storage->filled;
+ spin_unlock(&diff_storage->lock);
+
+ if (ret) {
+ pr_err("Cannot get empty storage block\n");
+ diff_storage_free_region(diff_region);
+ return ERR_PTR(ret);
+ }
+
+ if (is_halffull(sectors_left) &&
+ (atomic_inc_return(&diff_storage->low_space_flag) == 1))
+ diff_storage_event_low(diff_storage);
+
+ return diff_region;
+}
diff --git a/drivers/block/blksnap/diff_storage.h b/drivers/block/blksnap/diff_storage.h
new file mode 100644
index 000000000000..0913a0114ac0
--- /dev/null
+++ b/drivers/block/blksnap/diff_storage.h
@@ -0,0 +1,111 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef __BLKSNAP_DIFF_STORAGE_H
+#define __BLKSNAP_DIFF_STORAGE_H
+
+#include "event_queue.h"
+
+struct blksnap_sectors;
+
+/**
+ * struct diff_region - Describes the location of the chunks data on
+ * difference storage.
+ * @bdev:
+ * The target block device.
+ * @sector:
+ * The sector offset of the region's first sector.
+ * @count:
+ * The count of sectors in the region.
+ */
+struct diff_region {
+ struct block_device *bdev;
+ sector_t sector;
+ sector_t count;
+};
+
+/**
+ * struct diff_storage - Difference storage.
+ *
+ * @kref:
+ * The reference counter.
+ * @lock:
+ * Spinlock allows to guarantee the safety of linked lists.
+ * @storage_bdevs:
+ * List of opened block devices. Blocks for storing snapshot data can be
+ * located on different block devices. So, all opened block devices are
+ * located in this list. Blocks on opened block devices are allocated for
+ * storing the chunks data.
+ * @empty_blocks:
+ * List of empty blocks on storage. This list can be updated while
+ * holding a snapshot. This allows us to dynamically increase the
+ * storage size for these snapshots.
+ * @filled_blocks:
+ * List of filled blocks. When the blocks from the list of empty blocks are filled,
+ * we move them to the list of filled blocks.
+ * @capacity:
+ * Total amount of available storage space.
+ * @filled:
+ * The number of sectors already filled in.
+ * @requested:
+ * The number of sectors already requested from user space.
+ * @low_space_flag:
+ * The flag is set if the number of free regions available in the
+ * difference storage is less than the allowed minimum.
+ * @overflow_flag:
+ * The request for a free region failed due to the absence of free
+ * regions in the difference storage.
+ * @event_queue:
+ * A queue of events to pass events to user space. Diff storage and its
+ * owner can notify its snapshot about events like snapshot overflow,
+ * low free space and snapshot terminated.
+ *
+ * The difference storage manages the regions of block devices that are used
+ * to store the data of the original block devices in the snapshot.
+ * The difference storage is created one per snapshot and is used to store
+ * data from all the original snapshot block devices. At the same time, the
+ * difference storage itself can contain regions on various block devices.
+ */
+struct diff_storage {
+ struct kref kref;
+ spinlock_t lock;
+
+ struct list_head storage_bdevs;
+ struct list_head empty_blocks;
+ struct list_head filled_blocks;
+
+ sector_t capacity;
+ sector_t filled;
+ sector_t requested;
+
+ atomic_t low_space_flag;
+ atomic_t overflow_flag;
+
+ struct event_queue event_queue;
+};
+
+struct diff_storage *diff_storage_new(void);
+void diff_storage_free(struct kref *kref);
+
+static inline void diff_storage_get(struct diff_storage *diff_storage)
+{
+ kref_get(&diff_storage->kref);
+};
+static inline void diff_storage_put(struct diff_storage *diff_storage)
+{
+ if (likely(diff_storage))
+ kref_put(&diff_storage->kref, diff_storage_free);
+};
+
+int diff_storage_append_block(struct diff_storage *diff_storage,
+ const char *bdev_path,
+ struct blksnap_sectors __user *ranges,
+ unsigned int range_count);
+struct diff_region *diff_storage_new_region(struct diff_storage *diff_storage,
+ sector_t count,
+ unsigned int logical_blksz);
+
+static inline void diff_storage_free_region(struct diff_region *region)
+{
+ kfree(region);
+}
+#endif /* __BLKSNAP_DIFF_STORAGE_H */
--
2.20.1