Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp1273388rdh; Fri, 24 Nov 2023 08:39:49 -0800 (PST) X-Google-Smtp-Source: AGHT+IEdPEL7K1GhQIZ616weFk4xw1SFSaIDQf3wxB7sX0dMrquMVL+YLwoSMenzBKAowucZo9w3 X-Received: by 2002:a17:90a:f2d2:b0:27d:348:94a8 with SMTP id gt18-20020a17090af2d200b0027d034894a8mr3616173pjb.6.1700843989373; Fri, 24 Nov 2023 08:39:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700843989; cv=none; d=google.com; s=arc-20160816; b=m53TFK7yLskuPyXHeyN8c0ATx3+aw4D/ib6FI3oOy+IYl8+GARqH6CRXR7wPXQ/a96 mhkRj7HO5jEz6DJM5cZUtdr40pFwe14Ic+yo328Np61482uXyNyu9ZIDJq3CbxBbGuYO eAdSZwMpLFLKq+BEytMlZ7J7GPxszGBZR3DczZvlzrXxG6kaeVU44r8TM36QomcqMRKi C3O9aMGQhfQo3fI2qAplKKn18y26evIge4HnBHLEqR5jy3RkjTJ/uS99oN1q7wub3PBR JVneNiiQamQIQ2XKgADZRFUijCRyJgMvPcDicnudgRyihE5GTLpG0SnwH16OcjydGQay aXKA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=MDyOlhsQ3H7gKFA4WT2r4Df9eQ2QXv4dHEaGU3xXd2I=; fh=NuK72W7JEUD09Pgi5wI8gU+tvX9a8fn9tWWz3Ia8N2E=; b=Xbzk0Yb7DuRqOXsHVnOkDXZ+G7jCxNkD5gNWaJrJgauUAWyDgPOq8QUhctTDydaYCk JcTtrcRDZNiXkUGHEDO7jgLF9pKbBmELQWg9lIPeIwFlI/yf9/nx0ehBsvfG+FQ7/uGx UYETbisPj0lS6M64qoKA3gZ974/xMmfsEbDae/cqpfqoJLbb/8QNzvMpVFDFRVj8facy cfVigsiJ1mR6+UOb74NFVbMnYtejw5hQYG+iT7vINYn22rOyOi/It9n5IIGwUi/rPm2A 3gxBl26mT+DZMJajMKMV0QY4efUMvYd65WRVBvulpvmh0B1Y4VNaHngm1epXQP4JZa7v nEvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=CCNvNQpA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id lb5-20020a17090b4a4500b002859ad34f6csi589822pjb.166.2023.11.24.08.39.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 Nov 2023 08:39:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b=CCNvNQpA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 7B64680B19EB; Fri, 24 Nov 2023 08:39:38 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345505AbjKXQjA (ORCPT + 99 others); Fri, 24 Nov 2023 11:39:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231259AbjKXQi5 (ORCPT ); Fri, 24 Nov 2023 11:38:57 -0500 Received: from out-189.mta0.migadu.com (out-189.mta0.migadu.com [IPv6:2001:41d0:1004:224b::bd]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9D9F319A5; Fri, 24 Nov 2023 08:39:02 -0800 (PST) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1700843941; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MDyOlhsQ3H7gKFA4WT2r4Df9eQ2QXv4dHEaGU3xXd2I=; b=CCNvNQpAUtYYO4BXwxddkMkAZGmbKESK269T+NKqFyTmvnPCV8Kol0Of+LZBjtjvRUQAD2 Y1Yly8zB8Gx8CHgcxQB2aeOgjDzFDywX3HQOMiLDNsnTJwh7aaGZySYqimUiyH9nDdnmhj pZVEm7ChI5RVnCsEt5JVH33N5JNeyZM= From: Sergei Shtepa To: axboe@kernel.dk, hch@infradead.org, corbet@lwn.net, snitzer@kernel.org Cc: mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, mgorman@suse.de, vschneid@redhat.com, viro@zeniv.linux.org.uk, brauner@kernel.org, linux-block@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Sergei Shtepa , Bagas Sanjaya , Fabio Fantoni Subject: [PATCH v6 03/11] documentation: Block Devices Snapshots Module Date: Fri, 24 Nov 2023 17:38:28 +0100 Message-Id: <20231124163836.27256-4-sergei.shtepa@linux.dev> In-Reply-To: <20231124163836.27256-1-sergei.shtepa@linux.dev> References: <20231124163836.27256-1-sergei.shtepa@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Fri, 24 Nov 2023 08:39:38 -0800 (PST) From: Sergei Shtepa The document contains: * Describes the purpose of the mechanism * Description of features * Description of algorithms * Recommendations about using the module from the user-space side * Reference to module interface description Reviewed-by: Bagas Sanjaya Reviewed-by: Fabio Fantoni Signed-off-by: Sergei Shtepa --- Documentation/block/blksnap.rst | 352 ++++++++++++++++++++++++++++++++ Documentation/block/index.rst | 1 + MAINTAINERS | 6 + 3 files changed, 359 insertions(+) create mode 100644 Documentation/block/blksnap.rst diff --git a/Documentation/block/blksnap.rst b/Documentation/block/blksnap.rst new file mode 100644 index 000000000000..ef6010e46858 --- /dev/null +++ b/Documentation/block/blksnap.rst @@ -0,0 +1,352 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================== +Block Devices Snapshots Module (blksnap) +======================================== + +Introduction +============ + +At first glance, there is no novelty in the idea of creating snapshots for +block devices. The Linux kernel already has mechanisms for creating snapshots. +Device Mapper includes dm-snap, which allows to create snapshots of block +devices. BTRFS supports snapshots at the filesystem level. However, both of +these options have flaws that do not allow to use them as a universal tool for +creating backups. + +The main properties that a backup tool should have are: + +- Simplicity and universality of use +- Reliability +- Minimal consumption of system resources during backup +- Minimal time required for recovery or replication of the entire system + +Taking above properties into account, blksnap module features: + +- Change tracker +- Snapshots at the block device level +- Dynamic allocation of space for storing differences +- Snapshot overflow resistance +- Coherent snapshot of multiple block devices + +Features +======== + +Change tracker +-------------- + +The change tracker allows to determine which blocks were changed during the +time between the last snapshot created and any of the previous snapshots. +With a map of changes, it is enough to copy only the changed blocks, and no +need to reread the entire block device completely. The change tracker allows +to implement the logic of both incremental and differential backups. +Incremental backup is critical for large file repositories whose size can be +hundreds of terabytes and whose full backup time can take more than a day. +On such servers, the use of backup tools without a change tracker becomes +practically impossible. + +Snapshot at the block device level +---------------------------------- + +A snapshot at the block device level allows to simplify the backup algorithm +and reduce consumption of system resources. It also allows to perform linear +reading of disk space directly, which allows to achieve maximum reading speed +with minimal use of processor time. At the same time, the universality of +creating snapshots for any block device is achieved, regardless of the file +system located on it. The exceptions are BTRFS, ZFS and cluster file systems. + +Dynamic allocation of storage space for differences +--------------------------------------------------- + +To store differences, the module does not require a pre-reserved space on +filesystem. The space for storing differences can be allocated in file in any +filesystem. In addition, the size of the difference storage can be increased +after the snapshot is created, but only for a filesystem that supports +fallocate. A shared difference storage for all images of snapshot block devices +allows to optimize the use of storage space. However, there is one limitation. +A snapshot cannot be taken from a block device on which the difference storage +is located. + +Snapshot overflow resistance +---------------------------- + +To create images of snapshots of block devices, the module stores blocks +of the original block device that have been changed since the snapshot +was taken. To do this, the module handles write requests and reads blocks +that need to be overwritten. This algorithm guarantees safety of the data +of the original block device in the event of an overflow of the snapshot, +and even in the case of unpredictable critical errors. If a problem occurs +during backup, the difference storage is released, the snapshot is closed, +no backup is created, but the server continues to work. + +Coherent snapshot of multiple block devices +------------------------------------------- + +A snapshot is created simultaneously for all block devices for which a backup +is being created, ensuring their coherent state. + + +Algorithms +========== + +Overview +-------- + +The blksnap module is a block-level filter. It handles all write I/O units. +The filter is attached to the block device when the snapshot is created +for the first time. The change tracker marks all overwritten blocks. +Information about the history of changes on the block device is available +while holding the snapshot. The module reads the blocks that need to be +overwritten and stores them in the difference storage. When reading from +a snapshot image, reading is performed either from the original device or +from the difference storage. + +Change tracking +--------------- + +A change tracker map is created for each block device. One byte of this map +corresponds to one block. The block size is set by the +``tracking_block_minimum_shift`` and ``tracking_block_maximum_count`` +module parameters. The ``tracking_block_minimum_shift`` parameter limits +the minimum block size for tracking, while ``tracking_block_maximum_count`` +defines the maximum allowed number of blocks. The size of the change tracker +block is determined depending on the size of the block device when adding +a tracking device, that is, when the snapshot is taken for the first time. +The block size must be a power of two. The ``tracking_block_maximum_shift`` +module parameter allows to limit the maximum block size for tracking. If the +block size reaches the allowable limit, the number of blocks will exceed the +``tracking_block_maximum_count`` parameter. + +The byte of the change map stores a number from 0 to 255. This is the +snapshot number, since the creation of which there have been changes in +the block. Each time a snapshot is created, the number of the current +snapshot is increased by one. This number is written to the cell of the +change map when writing to the block. Thus, knowing the number of one of +the previous snapshots and the number of the last snapshot, one can determine +from the change map which blocks have been changed. When the number of the +current change reaches the maximum allowed value for the map of 255, at the +time when the next snapshot is created, the map of changes is reset to zero, +and the number of the current snapshot is assigned the value 1. The change +tracker is reset, and a new UUID is generated - a unique identifier of the +snapshot generation. The snapshot generation identifier allows to identify +that a change tracking reset has been performed. + +The change map has two copies. One copy is active, it tracks the current +changes on the block device. The second copy is available for reading +while the snapshot is being held, and contains the history up to the moment +the snapshot is taken. Copies are synchronized at the moment of snapshot +creation. After the snapshot is released, a second copy of the map is not +needed, but it is not released, so as not to allocate memory for it again +the next time the snapshot is created. + +Copy on write +------------- + +Data is copied in blocks, or rather in chunks. The term "chunk" is used to +avoid confusion with change tracker blocks and I/O blocks. In addition, +the "chunk" in the blksnap module means about the same as the "chunk" in +the dm-snap module. + +The size of the chunk is determined by the ``chunk_minimum_shift`` and +``chunk_maximum_count`` module parameters. The ``chunk_minimum_shift`` +parameter limits the minimum size of the chunk, while ``chunk_maximum_count`` +defines the maximum allowed number of chunks. The size of the chunk is +determined depending on the size of the block device at the time of taking the +snapshot. The size of the chunk must be a power of two. The module parameter +``chunk_maximum_shift`` allows to limit the maximum chunk size. If the chunk +size reaches the allowable limit, the number of chunks will exceed the +``chunk_maximum_count`` parameter. + +One chunk is described by the ``struct chunk`` structure. A map of structures +is created for each block device. The structure contains all the necessary +information to copy the chunks data from the original block device to the +difference storage. This information allows to describe the snapshot image. +A semaphore is located in the structure, which allows synchronization of threads +accessing the chunk. + +The block level in Linux has a feature. If a read I/O unit was sent, and a +write I/O unit was sent after it, then a write can be performed first, and only +then a read. Therefore, the copy-on-write algorithm is executed synchronously. +If the write request is handled, the execution of this I/O unit will be delayed +until the overwritten chunks are read from the original device for later +storing to the difference store. But if, when handling a write I/O unit, it +turns out that the written range of sectors has already been prepared for +storing to the difference storage, then the I/O unit is simply passed. + +This algorithm makes it possible to efficiently perform backup even systems +with a Round-Robin databases. Such databases can be overwritten several times +during the system backup. Of course, the value of a backup of the RRD monitoring +system data can be questioned. However, it is often a task to make a backup +of the entire enterprise infrastructure in order to restore or replicate it +entirely in case of problems. + +There is also a flaw in the algorithm. When overwriting at least one sector, +an entire chunk is copied. Thus, a situation of rapid filling of the difference +storage when writing data to a block device in small portions in random order +is possible. This situation is possible in case of strong fragmentation of +data on the filesystem. But it must be borne in mind that with such data +fragmentation, performance of systems usually degrades greatly. So, this +problem does not occur on real servers, although it can easily be created +by artificial tests. + +Difference storage +------------------ + +The difference storage can be a block device or it can be a file on a +filesystem. Using a block device allows to achieve slightly higher performance, +but in this case, the block device is used by the kernel module exclusively. +Usually the disk space is marked up so that there is no available free space +for backup purposes. Using a file allows to place the difference storage on a +filesystem. + +The difference storage can be expanded already while the snapshot is being held, +but only if the filesystem supports fallocate(). If the free space in the +difference storage remains less than half of the value of the module parameter +``diff_storage_minimum``, then the kernel module can expand the difference +storage file within the specified limits. This limit is set when creating a +snapshot. + +If free space in the difference storage runs out, an event to user land is +generated about the overflow of the snapshot. Such a snapshot is considered +corrupted, and read I/O units to snapshot images will be terminated with an +error code. The difference storage stores outdated data required for snapshot +images, so when the snapshot is overflowed, the backup process is interrupted, +but the system maintains its operability without data loss. + +The difference storage has a limitation. The device cannot be added to the +snapshot where the difference storage is located. In this case, the difference +storage can be located in virtual memory, which consists of RAM and a swap +partition (or file). To do this, it is enough to use a file in /dev/shm, or a +new tmpfs filesystem can be created for this purpose. Obviously, this variant +can be useful if the system has a lot of RAM or a large swap. The good news is +that the modern Linux kernel allows to increase the size of the swap file "on +the fly" without changing the system configuration. + +A regular file or a block device file for the difference storage must be opened +with the O_EXCL flag. If an unnamed file with the O_TMPFILE flag is created, +then such a file will be automatically released when the snapshot is destroyed. +In addition, the use of an unnamed temporary file ensures that no one can open +this file and read its contents. + +Performing I/O for a snapshot image +----------------------------------- + +To read snapshot data, when taking a snapshot, block devices of snapshot images +are created. The snapshot image block devices support the write operation. +This allows to perform additional data preparation on the filesystem before +creating a backup. + +To process the I/O unit, clones of the I/O unit are created, which redirect +the I/O unit either to the original block device or to the difference storage. +When processing of cloned I/O units is completed, the original I/O unit is +marked as completed too. + +An I/O unit can be partially processed without accessing to block devices if +the I/O unit refers to a chunk that is in the queue for storing to the +difference storage. In this case, the data is read or written in a buffer in +memory. + +If, when processing the write I/O unit, it turns out that the data of the +referred chunk has not yet been stored to the difference storage or has not +even been read from the original device, then an I/O unit to read data from the +original device is initiated beforehand. After the reading from original device +is performed, their data from the I/O unit is partially overwritten directly in +the buffer of the chunk in memory, and the chunk is scheduled to be saved to the +difference storage. + +How to use +========== + +Depending on the needs and the selected license, you can choose different +options for managing the module: + +- Using ioctl directly +- Using a static C++ library +- Using the blksnap console tool + +Using a BLKFILTER_CTL for block device +-------------------------------------- + +BLKFILTER_CTL allows to send a filter-specific command to the filter on block +device and get the result of its execution. The module provides the +``include/uapi/blksnap.h`` header file with a description of the commands and +their data structures. + +1. ``blkfilter_ctl_blksnap_cbtinfo`` allows to get information from the + change tracker. +2. ``blkfilter_ctl_blksnap_cbtmap`` reads the change tracker table. If a write + operation was performed for the snapshot, then the change tracker takes this + into account. Therefore, it is necessary to receive tracker data after write + operations have been completed. +3. ``blkfilter_ctl_blksnap_cbtdirty`` mark blocks as changed in the change + tracker table. This is necessary if post-processing is performed after the + backup is created, which changes the backup blocks. +4. ``blkfilter_ctl_blksnap_snapshotadd`` adds a block device to the snapshot. +5. ``blkfilter_ctl_blksnap_snapshotinfo`` allows to get the name of the snapshot + image block device and the presence of an error. + +Using ioctl +----------- + +Using a BLKFILTER_CTL ioctl does not allow to fully implement the management of +the blksnap module. A control file ``blksnap-control`` is created to manage +snapshots. The control commands are also described in the file +``include/uapi/blksnap.h``. + +1. ``blksnap_ioctl_version`` get the version number. +2. ``blk_snap_ioctl_snapshot_create`` initiates the snapshot creation process. +3. ``blk_snap_ioctl_snapshot_append_storage`` add the range of blocks to + difference storage. +4. ``blk_snap_ioctl_snapshot_take`` creates block devices of block device + snapshot images. +5. ``blk_snap_ioctl_snapshot_collect`` collect all created snapshots. +6. ``blk_snap_ioctl_snapshot_wait_event`` allows to track the status of + snapshots and receive events about the requirement to expand the difference + storage or about snapshot overflow. +7. ``blk_snap_ioctl_snapshot_destroy`` releases the snapshot. + +Static C++ library +------------------ + +The [#userspace_libs]_ library was created primarily to simplify creation of +tests in C++, and it is also a good example of using the module interface. +When creating applications, direct use of control calls is preferable. +However, the library can be used in an application with a GPL-2+ license, +or a library with an LGPL-2+ license can be created, with which even a +proprietary application can be dynamically linked. + +blksnap console tool +-------------------- + +The blksnap [#userspace_tools]_ console tool allows to control the module from +the command line. The tool contains detailed built-in help. To get list of +commands with usage description, see ``blksnap --help`` command. The ``blksnap + --help`` command allows to get detailed information about the +parameters of each command call. This option may be convenient when creating +proprietary software, as it allows not to compile with the open source code. +At the same time, the blksnap tool can be used for creating backup scripts. +For example, rsync can be called to synchronize files on the filesystem of +the mounted snapshot image and files in the archive on a filesystem that +supports compression. + +Tests +----- + +A set of tests was created for regression testing [#userspace_tests]_. +Tests with simple algorithms that use the ``blksnap`` console tool to +control the module are written in Bash. More complex testing algorithms +are implemented in C++. + +References +========== + +.. [#userspace_libs] https://github.com/veeam/blksnap/tree/stable-v2.0/lib + +.. [#userspace_tools] https://github.com/veeam/blksnap/tree/stable-v2.0/tools + +.. [#userspace_tests] https://github.com/veeam/blksnap/tree/stable-v2.0/tests + +Module interface description +============================ + +.. kernel-doc:: include/uapi/linux/blksnap.h diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst index e9712f72cd6d..696ff150c6b7 100644 --- a/Documentation/block/index.rst +++ b/Documentation/block/index.rst @@ -11,6 +11,7 @@ Block biovecs blk-mq blkfilter + blksnap cmdline-partition data-integrity deadline-iosched diff --git a/MAINTAINERS b/MAINTAINERS index ef90cd0fec9c..9c81e4c83139 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3593,6 +3593,12 @@ F: block/blk-filter.c F: include/linux/blk-filter.h F: include/uapi/linux/blk-filter.h +BLOCK DEVICE SNAPSHOTS MODULE +M: Sergei Shtepa +L: linux-block@vger.kernel.org +S: Supported +F: Documentation/block/blksnap.rst + BLOCK LAYER M: Jens Axboe L: linux-block@vger.kernel.org -- 2.20.1