2011-06-07 15:08:30

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 00/30] Ext4 snapshots

Hi All,

I am resending the snapshots patch series as per Lukas's request.
This time, the snapshot*.c files have not been omitted, as in
the previous posting.

The series is still based on ext4 dev branch sometime in the preparation
for 3.0 merge window. It was not yet rebased on 3.0-rc1, so punch holes
changes have not been addressed yet.

As always, I advocate online review of the patches at:
https://github.com/amir73il/ext4-snapshots/commits/for-ext4-v1
but if you insist on doing it the old way, I won't complain.

Thanks,
Amir.

[PATCH v1 01/36] ext4: EXT4 snapshots (Experimental)
[PATCH v1 02/36] ext4: snapshot debugging support
[PATCH v1 03/36] ext4: snapshot hooks - inside JBD hooks
[PATCH v1 04/36] ext4: snapshot hooks - block bitmap access
[PATCH v1 05/36] ext4: snapshot hooks - delete blocks
[PATCH v1 06/36] ext4: snapshot hooks - move data blocks
[PATCH v1 07/36] ext4: snapshot hooks - direct I/O
[PATCH v1 08/36] ext4: snapshot hooks - move extent file data blocks
[PATCH v1 09/36] ext4: snapshot file
[PATCH v1 10/36] ext4: snapshot file - read through to block device
[PATCH v1 11/36] ext4: snapshot file - permissions
[PATCH v1 12/36] ext4: snapshot file - store on disk
[PATCH v1 13/36] ext4: snapshot file - increase maximum file size limit to 16TB
[PATCH v1 14/36] ext4: snapshot block operations
[PATCH v1 15/36] ext4: snapshot block operation - copy blocks to snapshot
[PATCH v1 16/36] ext4: snapshot block operation - move blocks to snapshot
[PATCH v1 17/36] ext4: snapshot block operation - copy block bitmap to snapshot
[PATCH v1 18/36] ext4: snapshot control
[PATCH v1 19/36] ext4: snapshot control - init new snapshot
[PATCH v1 20/36] ext4: snapshot control - fix new snapshot
[PATCH v1 21/36] ext4: snapshot control - reserve disk space for snapshot
[PATCH v1 22/36] ext4: snapshot journaled - increase transaction credits
[PATCH v1 23/36] ext4: snapshot journaled - implement journal_release_buffer()
[PATCH v1 24/36] ext4: snapshot journaled - bypass to save credits
[PATCH v1 25/36] ext4: snapshot journaled - cache last COW tid in journal_head
[PATCH v1 26/36] ext4: snapshot journaled - trace COW/buffer credits
[PATCH v1 27/36] ext4: snapshot list support
[PATCH v1 28/36] ext4: snapshot list - read through to previous snapshot
[PATCH v1 29/36] ext4: snapshot race conditions - concurrent COW bitmap operations
[PATCH v1 30/36] ext4: snapshot race conditions - concurrent COW operations
[PATCH v1 31/36] ext4: snapshot race conditions - tracked reads
[PATCH v1 32/36] ext4: snapshot exclude - the exclude bitmap
[PATCH v1 33/36] ext4: snapshot cleanup
[PATCH v1 34/36] ext4: snapshot cleanup - shrink deleted snapshots
[PATCH v1 35/36] ext4: snapshot cleanup - merge shrunk snapshots
[PATCH v1 36/36] ext4: snapshot rocompat - enable rw mount

fs/ext4/Kconfig | 11 +
fs/ext4/Makefile | 3 +
fs/ext4/balloc.c | 132 +++
fs/ext4/ext4.h | 188 ++++-
fs/ext4/ext4_jbd2.c | 162 ++++-
fs/ext4/ext4_jbd2.h | 266 ++++++-
fs/ext4/extents.c | 157 ++++-
fs/ext4/file.c | 11 +-
fs/ext4/ialloc.c | 19 +-
fs/ext4/inode.c | 668 +++++++++++++--
fs/ext4/ioctl.c | 120 +++
fs/ext4/mballoc.c | 161 ++++-
fs/ext4/move_extent.c | 3 +-
fs/ext4/namei.c | 9 +
fs/ext4/resize.c | 19 +-
fs/ext4/snapshot.c | 1000 ++++++++++++++++++++++
fs/ext4/snapshot.h | 690 ++++++++++++++++
fs/ext4/snapshot_buffer.c | 393 +++++++++
fs/ext4/snapshot_ctl.c | 2002 +++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot_debug.c | 107 +++
fs/ext4/snapshot_debug.h | 105 +++
fs/ext4/snapshot_inode.c | 960 ++++++++++++++++++++++
fs/ext4/super.c | 157 ++++-
fs/ext4/xattr.c | 4 +-
24 files changed, 7182 insertions(+), 165 deletions(-)



2011-06-07 15:08:59

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 01/36] ext4: EXT4 snapshots (Experimental)

From: Amir Goldstein <[email protected]>

Built-in snapshots support for ext4.
Requires that the filesystem has the has_snapshot and exclude_bitmap
features and that block size is equal to system page size.
Snapshots are not supported with 64bit and meta_bg features and the
filesystem must be mounted with ordered data mode.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/Kconfig | 11 ++
fs/ext4/Makefile | 2 +
fs/ext4/balloc.c | 2 +-
fs/ext4/ext4.h | 15 +++
fs/ext4/ext4_jbd2.c | 3 +
fs/ext4/ext4_jbd2.h | 25 +++++
fs/ext4/extents.c | 3 +
fs/ext4/file.c | 1 +
fs/ext4/ialloc.c | 1 +
fs/ext4/inode.c | 3 +
fs/ext4/ioctl.c | 3 +
fs/ext4/mballoc.c | 5 +
fs/ext4/namei.c | 1 +
fs/ext4/resize.c | 1 +
fs/ext4/snapshot.c | 18 ++++
fs/ext4/snapshot.h | 193 ++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot_buffer.c | 238 +++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot_ctl.c | 22 ++++
fs/ext4/snapshot_inode.c | 42 ++++++++
fs/ext4/super.c | 43 ++++++++
20 files changed, 631 insertions(+), 1 deletions(-)
create mode 100644 fs/ext4/snapshot.c
create mode 100644 fs/ext4/snapshot.h
create mode 100644 fs/ext4/snapshot_buffer.c
create mode 100644 fs/ext4/snapshot_ctl.c
create mode 100644 fs/ext4/snapshot_debug.c
create mode 100644 fs/ext4/snapshot_debug.h
create mode 100644 fs/ext4/snapshot_inode.c

diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index 9ed1bb1..8970525 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -83,3 +83,14 @@ config EXT4_DEBUG

If you select Y here, then you will be able to turn on debugging
with a command such as "echo 1 > /sys/kernel/debug/ext4/mballoc-debug"
+
+config EXT4_FS_SNAPSHOT
+ bool "EXT4 snapshots (Experimental)"
+ depends on EXT4_FS && EXPERIMENTAL
+ default n
+ help
+ Built-in snapshots support for ext4.
+ Requires that the filesystem has the has_snapshot and exclude_bitmap
+ features and that block size is equal to system page size.
+ Snapshots are not supported with 64bit and meta_bg features and the
+ filesystem must be mounted with ordered data mode.
diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index c947e36..a471c2e 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -11,3 +11,5 @@ ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o page-io.o \
ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
+ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot.o snapshot_ctl.o
+ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_inode.o snapshot_buffer.o
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index b2d10da..8f1803f 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -20,6 +20,7 @@
#include "ext4.h"
#include "ext4_jbd2.h"
#include "mballoc.h"
+#include "snapshot.h"

#include <trace/events/ext4.h>

@@ -156,7 +157,6 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
tmp = ext4_block_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
ext4_set_bit(tmp - start, bh->b_data);
-
tmp = ext4_inode_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
ext4_set_bit(tmp - start, bh->b_data);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 076c5d2..756848f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -873,6 +873,20 @@ struct ext4_inode_info {
#define EXT2_FLAGS_SIGNED_HASH 0x0001 /* Signed dirhash in use */
#define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
#define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
+#define EXT4_FLAGS_IS_SNAPSHOT 0x0010 /* Is a snapshot image */
+#define EXT4_FLAGS_FIX_SNAPSHOT 0x0020 /* Corrupted snapshot */
+#define EXT4_FLAGS_FIX_EXCLUDE 0x0040 /* Bad exclude bitmap */
+
+#define EXT4_SET_FLAGS(sb, mask) \
+ do { \
+ EXT4_SB(sb)->s_es->s_flags |= cpu_to_le32(mask); \
+ } while (0)
+#define EXT4_CLEAR_FLAGS(sb, mask) \
+ do { \
+ EXT4_SB(sb)->s_es->s_flags &= ~cpu_to_le32(mask);\
+ } while (0)
+#define EXT4_TEST_FLAGS(sb, mask) \
+ (EXT4_SB(sb)->s_es->s_flags & cpu_to_le32(mask))

/*
* Mount flags
@@ -1338,6 +1352,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
+#define EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT 0x0080 /* Ext4 has snapshots */

#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 6e272ef..560020d 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -1,8 +1,11 @@
/*
* Interface between ext4 and JBD
+ *
+ * Snapshot metadata COW hooks, Amir Goldstein <[email protected]>, 2011
*/

#include "ext4_jbd2.h"
+#include "snapshot.h"

#include <trace/events/ext4.h>

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index d0f5353..3da2092 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -10,6 +10,8 @@
* option, any later version, incorporated herein by reference.
*
* Ext4-specific journaling extensions.
+ *
+ * Snapshot extra COW credits, Amir Goldstein <[email protected]>, 2011
*/

#ifndef _EXT4_JBD2_H
@@ -18,6 +20,7 @@
#include <linux/fs.h>
#include <linux/jbd2.h>
#include "ext4.h"
+#include "snapshot.h"

#define EXT4_JOURNAL(inode) (EXT4_SB((inode)->i_sb)->s_journal)

@@ -272,6 +275,11 @@ static inline int ext4_should_journal_data(struct inode *inode)
return 0;
if (!S_ISREG(inode->i_mode))
return 1;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* snapshots enforce ordered data */
+ return 0;
+#endif
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
return 1;
if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
@@ -285,6 +293,11 @@ static inline int ext4_should_order_data(struct inode *inode)
return 0;
if (!S_ISREG(inode->i_mode))
return 0;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* snapshots enforce ordered data */
+ return 1;
+#endif
if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return 0;
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
@@ -298,6 +311,11 @@ static inline int ext4_should_writeback_data(struct inode *inode)
return 0;
if (EXT4_JOURNAL(inode) == NULL)
return 1;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* snapshots enforce ordered data */
+ return 0;
+#endif
if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return 0;
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
@@ -320,6 +338,11 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 0;
if (!S_ISREG(inode->i_mode))
return 0;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* XXX: should snapshots support dioread_nolock? */
+ return 0;
+#endif
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
return 0;
if (ext4_should_journal_data(inode))
@@ -327,4 +350,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 1;
}

+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+#endif
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e363f21..7598224 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -18,6 +18,8 @@
* You should have received a copy of the GNU General Public Licens
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
+ *
+ * Snapshot move-on-write (MOW), Yongqiang Yang <[email protected]>, 2011
*/

/*
@@ -43,6 +45,7 @@
#include <linux/fiemap.h>
#include "ext4_jbd2.h"
#include "ext4_extents.h"
+#include "snapshot.h"

#include <trace/events/ext4.h>

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 7b80d54..60b3b19 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -28,6 +28,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "snapshot.h"

/*
* Called when an inode is released. Note that this is different
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 21bb2f6..40ca5bc 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -28,6 +28,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "snapshot.h"

#include <trace/events/ext4.h>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..9dbd806 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -20,6 +20,8 @@
* ([email protected])
*
* Assorted race fixes, rewrite of ext4_get_block() by Al Viro, 2000
+ *
+ * Snapshot inode extensions, Amir Goldstein <[email protected]>, 2011
*/

#include <linux/module.h>
@@ -49,6 +51,7 @@
#include "ext4_extents.h"

#include <trace/events/ext4.h>
+#include "snapshot.h"

#define MPAGE_DA_EXTENT_TAIL 0x01

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 808c554..a8b1254 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -5,6 +5,8 @@
* Remy Card ([email protected])
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
+ *
+ * Snapshot control API, Amir Goldstein <[email protected]>, 2011
*/

#include <linux/fs.h>
@@ -17,6 +19,7 @@
#include <asm/uaccess.h>
#include "ext4_jbd2.h"
#include "ext4.h"
+#include "snapshot.h"

long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 42fbca9..5a930d6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -25,6 +25,7 @@
#include <linux/debugfs.h>
#include <linux/slab.h>
#include <trace/events/ext4.h>
+#include "snapshot.h"

/*
* MUSTDO:
@@ -2740,6 +2741,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
sbi = EXT4_SB(sb);

err = -EIO;
+
bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group);
if (!bitmap_bh)
goto out_err;
@@ -2791,6 +2793,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
}
#endif
mb_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,ac->ac_b_ex.fe_len);
+
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
ext4_free_blks_set(sb, gdp,
@@ -2820,6 +2823,8 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
if (err)
goto out_err;
+
+
err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);

out_err:
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 3c7a06e..93196b6 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -39,6 +39,7 @@

#include "xattr.h"
#include "acl.h"
+#include "snapshot.h"

#include <trace/events/ext4.h>
/*
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 80bbc9c..ebff8a1 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>

#include "ext4_jbd2.h"
+#include "snapshot.h"

#define outside(b, first, last) ((b) < (first) || (b) >= (last))
#define inside(b, first, last) ((b) >= (first) && (b) < (last))
diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
new file mode 100644
index 0000000..e8db8ca
--- /dev/null
+++ b/fs/ext4/snapshot.c
@@ -0,0 +1,18 @@
+/*
+ * linux/fs/ext4/snapshot.c
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshots core functions.
+ */
+
+#include <linux/quotaops.h>
+#include "snapshot.h"
+#include "ext4.h"
+#include "mballoc.h"
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
new file mode 100644
index 0000000..8a60ae1
--- /dev/null
+++ b/fs/ext4/snapshot.h
@@ -0,0 +1,193 @@
+/*
+ * linux/fs/ext4/snapshot.h
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshot extensions.
+ */
+
+#ifndef _LINUX_EXT4_SNAPSHOT_H
+#define _LINUX_EXT4_SNAPSHOT_H
+
+#include <linux/version.h>
+#include <linux/delay.h>
+#include "ext4.h"
+
+
+/*
+ * use signed 64bit for snapshot image addresses
+ * negative addresses are used to reference snapshot meta blocks
+ */
+#define ext4_snapblk_t long long
+
+/*
+ * We assert that file system block size == page size (on mount time)
+ * and that the first file system block is block 0 (on snapshot create).
+ * Snapshot inode direct blocks are reserved for snapshot meta blocks.
+ * Snapshot inode single indirect blocks are not used.
+ * Snapshot image starts at the first double indirect block, so all blocks in
+ * Snapshot image block group blocks are mapped by a single DIND block:
+ * 4k: 32k blocks_per_group = 32 IND (4k) blocks = 32 groups per DIND
+ * 8k: 64k blocks_per_group = 32 IND (8k) blocks = 64 groups per DIND
+ * 16k: 128k blocks_per_group = 32 IND (16k) blocks = 128 groups per DIND
+ */
+#define SNAPSHOT_BLOCK_SIZE PAGE_SIZE
+#define SNAPSHOT_BLOCK_SIZE_BITS PAGE_SHIFT
+#define SNAPSHOT_ADDR_PER_BLOCK (SNAPSHOT_BLOCK_SIZE / sizeof(__u32))
+#define SNAPSHOT_ADDR_PER_BLOCK_BITS (SNAPSHOT_BLOCK_SIZE_BITS - 2)
+#define SNAPSHOT_DIR_BLOCKS EXT4_NDIR_BLOCKS
+#define SNAPSHOT_IND_BLOCKS SNAPSHOT_ADDR_PER_BLOCK
+
+#define SNAPSHOT_BLOCKS_PER_GROUP_BITS (SNAPSHOT_BLOCK_SIZE_BITS + 3)
+#define SNAPSHOT_BLOCKS_PER_GROUP \
+ (1<<SNAPSHOT_BLOCKS_PER_GROUP_BITS) /* 8*PAGE_SIZE */
+#define SNAPSHOT_BLOCK_GROUP(block) \
+ ((block)>>SNAPSHOT_BLOCKS_PER_GROUP_BITS)
+#define SNAPSHOT_BLOCK_GROUP_OFFSET(block) \
+ ((block)&(SNAPSHOT_BLOCKS_PER_GROUP-1))
+#define SNAPSHOT_BLOCK_TUPLE(block) \
+ (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP_OFFSET(block), \
+ (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP(block)
+#define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS \
+ (SNAPSHOT_BLOCKS_PER_GROUP_BITS-SNAPSHOT_ADDR_PER_BLOCK_BITS)
+#define SNAPSHOT_IND_PER_BLOCK_GROUP \
+ (1<<SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */
+#define SNAPSHOT_DIND_BLOCK_GROUPS_BITS \
+ (SNAPSHOT_ADDR_PER_BLOCK_BITS-SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
+#define SNAPSHOT_DIND_BLOCK_GROUPS \
+ (1<<SNAPSHOT_DIND_BLOCK_GROUPS_BITS)
+
+#define SNAPSHOT_BLOCK_OFFSET \
+ (SNAPSHOT_DIR_BLOCKS+SNAPSHOT_IND_BLOCKS)
+#define SNAPSHOT_BLOCK(iblock) \
+ ((ext4_snapblk_t)(iblock) - SNAPSHOT_BLOCK_OFFSET)
+#define SNAPSHOT_IBLOCK(block) \
+ (ext4_fsblk_t)((block) + SNAPSHOT_BLOCK_OFFSET)
+
+
+
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+#define EXT4_SNAPSHOT_VERSION "ext4 snapshot v1.0.13-7 (1-Jun-2010)"
+
+#define SNAPSHOT_BYTES_OFFSET \
+ (SNAPSHOT_BLOCK_OFFSET << SNAPSHOT_BLOCK_SIZE_BITS)
+#define SNAPSHOT_ISIZE(size) \
+ ((size) + SNAPSHOT_BYTES_OFFSET)
+/* Snapshot block device size is recorded in i_disksize */
+#define SNAPSHOT_SET_SIZE(inode, size) \
+ (EXT4_I(inode)->i_disksize = SNAPSHOT_ISIZE(size))
+#define SNAPSHOT_SIZE(inode) \
+ (EXT4_I(inode)->i_disksize - SNAPSHOT_BYTES_OFFSET)
+#define SNAPSHOT_SET_BLOCKS(inode, blocks) \
+ SNAPSHOT_SET_SIZE((inode), \
+ (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS)
+#define SNAPSHOT_BLOCKS(inode) \
+ (ext4_fsblk_t)(SNAPSHOT_SIZE(inode) >> SNAPSHOT_BLOCK_SIZE_BITS)
+/* Snapshot shrink/merge/clean progress is exported via i_size */
+#define SNAPSHOT_PROGRESS(inode) \
+ (ext4_fsblk_t)((inode)->i_size >> SNAPSHOT_BLOCK_SIZE_BITS)
+#define SNAPSHOT_SET_ENABLED(inode) \
+ i_size_write((inode), SNAPSHOT_SIZE(inode))
+#define SNAPSHOT_SET_PROGRESS(inode, blocks) \
+ snapshot_size_extend((inode), (blocks))
+/* Disabled/deleted snapshot i_size is 1 block, to allow read of super block */
+#define SNAPSHOT_SET_DISABLED(inode) \
+ snapshot_size_truncate((inode), 1)
+/* Removed snapshot i_size and i_disksize are 0, since all blocks were freed */
+#define SNAPSHOT_SET_REMOVED(inode) \
+ do { \
+ EXT4_I(inode)->i_disksize = 0; \
+ snapshot_size_truncate((inode), 0); \
+ } while (0)
+
+static inline void snapshot_size_extend(struct inode *inode,
+ ext4_fsblk_t blocks)
+{
+ i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
+}
+
+static inline void snapshot_size_truncate(struct inode *inode,
+ ext4_fsblk_t blocks)
+{
+ loff_t i_size = (loff_t)blocks << SNAPSHOT_BLOCK_SIZE_BITS;
+
+ i_size_write(inode, i_size);
+ truncate_inode_pages(&inode->i_data, i_size);
+}
+
+/* Is ext4 configured for snapshots support? */
+static inline int EXT4_SNAPSHOTS(struct super_block *sb)
+{
+ return EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT);
+}
+
+#define ext4_snapshot_cow(handle, inode, block, bh, cow) 0
+
+#define ext4_snapshot_move(handle, inode, block, pcount, move) (0)
+
+/*
+ * Block access functions
+ */
+
+
+
+/* snapshot_ctl.c */
+
+
+static inline int init_ext4_snapshot(void)
+{
+ return 0;
+}
+
+static inline void exit_ext4_snapshot(void)
+{
+}
+
+
+
+
+
+#else /* CONFIG_EXT4_FS_SNAPSHOT */
+
+/* Snapshot NOP macros */
+#define EXT4_SNAPSHOTS(sb) (0)
+#define SNAPMAP_ISCOW(cmd) (0)
+#define SNAPMAP_ISMOVE(cmd) (0)
+#define SNAPMAP_ISSYNC(cmd) (0)
+#define IS_COWING(handle) (0)
+
+#define ext4_snapshot_load(sb, es, ro) (0)
+#define ext4_snapshot_destroy(sb)
+#define init_ext4_snapshot() (0)
+#define exit_ext4_snapshot()
+#define ext4_snapshot_active(sbi) (0)
+#define ext4_snapshot_file(inode) (0)
+#define ext4_snapshot_should_move_data(inode) (0)
+#define ext4_snapshot_test_excluded(handle, inode, block_to_free, count) (0)
+#define ext4_snapshot_list(inode) (0)
+#define ext4_snapshot_get_flags(ei, filp)
+#define ext4_snapshot_set_flags(handle, inode, flags) (0)
+#define ext4_snapshot_take(inode) (0)
+#define ext4_snapshot_update(inode_i_sb, cleanup, zero) (0)
+#define ext4_snapshot_has_active(sb) (NULL)
+#define ext4_snapshot_get_bitmap_access(handle, sb, grp, bh) (0)
+#define ext4_snapshot_get_write_access(handle, inode, bh) (0)
+#define ext4_snapshot_get_create_access(handle, bh) (0)
+#define ext4_snapshot_excluded(ac_inode) (0)
+#define ext4_snapshot_get_delete_access(handle, inode, block, pcount) (0)
+
+#define ext4_snapshot_get_move_access(handle, inode, block, pcount, move) (0)
+#define ext4_snapshot_start_pending_cow(sbh)
+#define ext4_snapshot_end_pending_cow(sbh)
+#define ext4_snapshot_is_active(inode) (0)
+#define ext4_snapshot_mow_in_tid(inode) (1)
+
+#endif /* CONFIG_EXT4_FS_SNAPSHOT */
+#endif /* _LINUX_EXT4_SNAPSHOT_H */
diff --git a/fs/ext4/snapshot_buffer.c b/fs/ext4/snapshot_buffer.c
new file mode 100644
index 0000000..acea9a3
--- /dev/null
+++ b/fs/ext4/snapshot_buffer.c
@@ -0,0 +1,238 @@
+/*
+ * linux/fs/ext4/snapshot_buffer.c
+ *
+ * Tracked buffer read implementation for ext4 snapshots
+ * by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * from
+ *
+ * linux/fs/buffer.c
+ *
+ * Copyright (C) 1991, 1992, 2002 Linus Torvalds
+ */
+
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+#include <linux/capability.h>
+#include <linux/blkdev.h>
+#include <linux/file.h>
+#include <linux/quotaops.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <linux/writeback.h>
+#include <linux/hash.h>
+#include <linux/suspend.h>
+#include <linux/buffer_head.h>
+#include <linux/task_io_accounting_ops.h>
+#include <linux/bio.h>
+#include <linux/notifier.h>
+#include <linux/cpu.h>
+#include <linux/bitops.h>
+#include <linux/mpage.h>
+#include <linux/bit_spinlock.h>
+#include "snapshot.h"
+
+static int quiet_error(struct buffer_head *bh)
+{
+ if (printk_ratelimit())
+ return 0;
+ return 1;
+}
+
+
+static void buffer_io_error(struct buffer_head *bh)
+{
+ char b[BDEVNAME_SIZE];
+ printk(KERN_ERR "Buffer I/O error on device %s, logical block %llu\n",
+ bdevname(bh->b_bdev, b),
+ (unsigned long long)bh->b_blocknr);
+}
+
+/*
+ * I/O completion handler for ext4_read_full_page() - pages
+ * which come unlocked at the end of I/O.
+ */
+static void end_buffer_async_read(struct buffer_head *bh, int uptodate)
+{
+ unsigned long flags;
+ struct buffer_head *first;
+ struct buffer_head *tmp;
+ struct page *page;
+ int page_uptodate = 1;
+
+ BUG_ON(!buffer_async_read(bh));
+
+ page = bh->b_page;
+ if (uptodate) {
+ set_buffer_uptodate(bh);
+ } else {
+ clear_buffer_uptodate(bh);
+ if (!quiet_error(bh))
+ buffer_io_error(bh);
+ SetPageError(page);
+ }
+
+ /*
+ * Be _very_ careful from here on. Bad things can happen if
+ * two buffer heads end IO at almost the same time and both
+ * decide that the page is now completely done.
+ */
+ first = page_buffers(page);
+ local_irq_save(flags);
+ bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
+ clear_buffer_async_read(bh);
+ unlock_buffer(bh);
+ tmp = bh;
+ do {
+ if (!buffer_uptodate(tmp))
+ page_uptodate = 0;
+ if (buffer_async_read(tmp)) {
+ BUG_ON(!buffer_locked(tmp));
+ goto still_busy;
+ }
+ tmp = tmp->b_this_page;
+ } while (tmp != bh);
+ bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
+ local_irq_restore(flags);
+
+ /*
+ * If none of the buffers had errors and they are all
+ * uptodate then we can set the page uptodate.
+ */
+ if (page_uptodate && !PageError(page))
+ SetPageUptodate(page);
+ unlock_page(page);
+ return;
+
+still_busy:
+ bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
+ local_irq_restore(flags);
+ return;
+}
+
+/*
+ * If a page's buffers are under async readin (end_buffer_async_read
+ * completion) then there is a possibility that another thread of
+ * control could lock one of the buffers after it has completed
+ * but while some of the other buffers have not completed. This
+ * locked buffer would confuse end_buffer_async_read() into not unlocking
+ * the page. So the absence of BH_Async_Read tells end_buffer_async_read()
+ * that this buffer is not under async I/O.
+ *
+ * The page comes unlocked when it has no locked buffer_async buffers
+ * left.
+ *
+ * PageLocked prevents anyone starting new async I/O reads any of
+ * the buffers.
+ *
+ * PageWriteback is used to prevent simultaneous writeout of the same
+ * page.
+ *
+ * PageLocked prevents anyone from starting writeback of a page which is
+ * under read I/O (PageWriteback is only ever set against a locked page).
+ */
+static void mark_buffer_async_read(struct buffer_head *bh)
+{
+ bh->b_end_io = end_buffer_async_read;
+ set_buffer_async_read(bh);
+}
+
+/*
+ * Generic "read page" function for block devices that have the normal
+ * get_block functionality. This is most of the block device filesystems.
+ * Reads the page asynchronously --- the unlock_buffer() and
+ * set/clear_buffer_uptodate() functions propagate buffer state into the
+ * page struct once IO has completed.
+ */
+int ext4_read_full_page(struct page *page, get_block_t *get_block)
+{
+ struct inode *inode = page->mapping->host;
+ sector_t iblock, lblock;
+ struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE];
+ unsigned int blocksize;
+ int nr, i;
+ int fully_mapped = 1;
+
+ BUG_ON(!PageLocked(page));
+ blocksize = 1 << inode->i_blkbits;
+ if (!page_has_buffers(page))
+ create_empty_buffers(page, blocksize, 0);
+ head = page_buffers(page);
+
+ iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
+ bh = head;
+ nr = 0;
+ i = 0;
+
+ do {
+ if (buffer_uptodate(bh))
+ continue;
+
+ if (!buffer_mapped(bh)) {
+ int err = 0;
+
+ fully_mapped = 0;
+ if (iblock < lblock) {
+ WARN_ON(bh->b_size != blocksize);
+ err = get_block(inode, iblock, bh, 0);
+ if (err)
+ SetPageError(page);
+ }
+ if (!buffer_mapped(bh)) {
+ zero_user(page, i * blocksize, blocksize);
+ if (!err)
+ set_buffer_uptodate(bh);
+ continue;
+ }
+ /*
+ * get_block() might have updated the buffer
+ * synchronously
+ */
+ if (buffer_uptodate(bh))
+ continue;
+ }
+ arr[nr++] = bh;
+ } while (i++, iblock++, (bh = bh->b_this_page) != head);
+
+ if (fully_mapped)
+ SetPageMappedToDisk(page);
+
+ if (!nr) {
+ /*
+ * All buffers are uptodate - we can set the page uptodate
+ * as well. But not if get_block() returned an error.
+ */
+ if (!PageError(page))
+ SetPageUptodate(page);
+ unlock_page(page);
+ return 0;
+ }
+
+ /* Stage two: lock the buffers */
+ for (i = 0; i < nr; i++) {
+ bh = arr[i];
+ lock_buffer(bh);
+ mark_buffer_async_read(bh);
+ }
+
+ /*
+ * Stage 3: start the IO. Check for uptodateness
+ * inside the buffer lock in case another process reading
+ * the underlying blockdev brought it uptodate (the sct fix).
+ */
+ for (i = 0; i < nr; i++) {
+ bh = arr[i];
+ if (buffer_uptodate(bh))
+ end_buffer_async_read(bh, 1);
+ else
+ submit_bh(READ, bh);
+ }
+ return 0;
+}
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
new file mode 100644
index 0000000..201ef20
--- /dev/null
+++ b/fs/ext4/snapshot_ctl.c
@@ -0,0 +1,22 @@
+/*
+ * linux/fs/ext4/snapshot_ctl.c
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshots control functions.
+ */
+
+#include <linux/statfs.h>
+#include "ext4_jbd2.h"
+#include "snapshot.h"
+#define ext4_snapshot_reset_bitmap_cache(sb, init) 0
+
+/*
+ * Snapshot constructor/destructor
+ */
diff --git a/fs/ext4/snapshot_debug.c b/fs/ext4/snapshot_debug.c
new file mode 100644
index 0000000..e69de29
diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
new file mode 100644
index 0000000..e69de29
diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
new file mode 100644
index 0000000..2de017a
--- /dev/null
+++ b/fs/ext4/snapshot_inode.c
@@ -0,0 +1,42 @@
+/*
+ * linux/fs/ext4/snapshot_inode.c
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshots inode functions.
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/time.h>
+#include <linux/jbd2.h>
+#include <linux/highuid.h>
+#include <linux/pagemap.h>
+#include <linux/quotaops.h>
+#include <linux/string.h>
+#include <linux/buffer_head.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include <linux/mpage.h>
+#include <linux/namei.h>
+#include <linux/uio.h>
+#include <linux/bio.h>
+#include <linux/workqueue.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include "ext4_jbd2.h"
+#include "xattr.h"
+#include "acl.h"
+#include "ext4_extents.h"
+
+#include <trace/events/ext4.h>
+#include "snapshot.h"
+#ifdef CONFIG_EXT4_DEBUG
+#endif
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index cb22783..61e9173 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -48,6 +48,7 @@
#include "xattr.h"
#include "acl.h"
#include "mballoc.h"
+#include "snapshot.h"

#define CREATE_TRACE_POINTS
#include <trace/events/ext4.h>
@@ -2625,6 +2626,24 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
return 0;
}
}
+ /* Enforce snapshots requirements: */
+ if (EXT4_SNAPSHOTS(sb)) {
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb,
+ EXT4_FEATURE_INCOMPAT_META_BG|
+ EXT4_FEATURE_INCOMPAT_64BIT)) {
+ ext4_msg(sb, KERN_ERR,
+ "has_snapshot feature cannot be mixed with "
+ "features: meta_bg, 64bit");
+ return 0;
+ }
+ if (EXT4_TEST_FLAGS(sb, EXT4_FLAGS_IS_SNAPSHOT)) {
+ ext4_msg(sb, KERN_ERR,
+ "A snapshot image must be mounted read-only. "
+ "If this is an exported snapshot image, you "
+ "must run fsck -xy to make it writable.");
+ return 0;
+ }
+ }
return 1;
}

@@ -3235,6 +3254,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);

+ /* Enforce snapshots blocksize == pagesize */
+ if (EXT4_SNAPSHOTS(sb) && blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR,
+ "snapshots require that filesystem blocksize "
+ "(%d) be equal to system page size (%lu)",
+ blocksize, PAGE_SIZE);
+ goto failed_mount;
+ }
+
if (blocksize < EXT4_MIN_BLOCK_SIZE ||
blocksize > EXT4_MAX_BLOCK_SIZE) {
ext4_msg(sb, KERN_ERR,
@@ -3592,6 +3620,15 @@ no_journal:
goto failed_mount_wq;
}

+ /* Enforce journal ordered mode with snapshots */
+ if (EXT4_SNAPSHOTS(sb) && !(sb->s_flags & MS_RDONLY) &&
+ (!EXT4_SB(sb)->s_journal ||
+ test_opt(sb, DATA_FLAGS) != EXT4_MOUNT_ORDERED_DATA)) {
+ ext4_msg(sb, KERN_ERR,
+ "snapshots require journal ordered mode");
+ goto failed_mount4;
+ }
+
/*
* The jbd2_journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
@@ -4959,10 +4996,15 @@ static int __init ext4_init_fs(void)
err = register_filesystem(&ext4_fs_type);
if (err)
goto out;
+ err = init_ext4_snapshot();
+ if (err)
+ goto out_fs;

ext4_li_info = NULL;
mutex_init(&ext4_li_mtx);
return 0;
+out_fs:
+ unregister_filesystem(&ext4_fs_type);
out:
unregister_as_ext2();
unregister_as_ext3();
@@ -4986,6 +5028,7 @@ out7:

static void __exit ext4_exit_fs(void)
{
+ exit_ext4_snapshot();
ext4_destroy_lazyinit_thread();
unregister_as_ext2();
unregister_as_ext3();
--
1.7.4.1


2011-06-07 15:09:01

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 02/36] ext4: snapshot debugging support

From: Amir Goldstein <[email protected]>

Control snapshot debug level via debugfs entry /ext4/snapshot-debug
and induce delay tests via debugfs entries /ext4/test-XXX-delay-msec.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/Makefile | 1 +
fs/ext4/mballoc.c | 3 +
fs/ext4/snapshot.h | 9 ++++
fs/ext4/snapshot_debug.c | 100 +++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot_debug.h | 105 ++++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 218 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index a471c2e..1d947ef 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -13,3 +13,4 @@ ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot.o snapshot_ctl.o
ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_inode.o snapshot_buffer.o
+ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_debug.o
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5a930d6..54ea8c8 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2658,10 +2658,13 @@ static void __init ext4_create_debugfs_entry(void)
S_IRUGO | S_IWUSR,
debugfs_dir,
&mb_enable_debug);
+ if (debugfs_dir)
+ ext4_snapshot_create_debugfs_entry(debugfs_dir);
}

static void ext4_remove_debugfs_entry(void)
{
+ ext4_snapshot_remove_debugfs_entry();
debugfs_remove(debugfs_debug);
debugfs_remove(debugfs_dir);
}
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 8a60ae1..d0c985b 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -18,6 +18,7 @@
#include <linux/version.h>
#include <linux/delay.h>
#include "ext4.h"
+#include "snapshot_debug.h"


/*
@@ -109,6 +110,14 @@
static inline void snapshot_size_extend(struct inode *inode,
ext4_fsblk_t blocks)
{
+#ifdef CONFIG_EXT4_DEBUG
+ ext4_fsblk_t old_blocks = SNAPSHOT_PROGRESS(inode);
+ ext4_fsblk_t max_blocks = SNAPSHOT_BLOCKS(inode);
+
+ /* sleep total of tunable delay unit over 100% progress */
+ snapshot_test_delay_progress(SNAPTEST_DELETE,
+ old_blocks, blocks, max_blocks);
+#endif
i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
}

diff --git a/fs/ext4/snapshot_debug.c b/fs/ext4/snapshot_debug.c
index e69de29..35f552a 100644
--- a/fs/ext4/snapshot_debug.c
+++ b/fs/ext4/snapshot_debug.c
@@ -0,0 +1,100 @@
+/*
+ * linux/fs/ext4/snapshot_debug.c
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshot debugging.
+ */
+
+
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/debugfs.h>
+#include "snapshot.h"
+
+#if defined(CONFIG_EXT4_FS_SNAPSHOT) && defined(CONFIG_EXT4_DEBUG)
+/*
+ * debugfs tunables
+ */
+
+const char *snapshot_indent = SNAPSHOT_INDENT_STR + SNAPSHOT_INDENT_MAX;
+
+/*
+ * Tunable delay values per snapshot operation for testing of
+ * COW race conditions and master snapshot_mutex lock
+ */
+static const char *snapshot_test_names[SNAPSHOT_TESTS_NUM] = {
+ /* delay completion of snapshot create|take */
+ "test-take-delay-msec",
+ /* delay completion of snapshot shrink|cleanup */
+ "test-delete-delay-msec",
+ /* delay completion of COW operation */
+ "test-cow-delay-msec",
+ /* delay submission of tracked read */
+ "test-read-delay-msec",
+ /* delay completion of COW bitmap operation */
+ "test-bitmap-delay-msec",
+};
+
+#define SNAPSHOT_TEST_NAMES (sizeof(snapshot_test_names) / \
+ sizeof(snapshot_test_names[0]))
+
+u16 snapshot_enable_test[SNAPSHOT_TESTS_NUM] __read_mostly = {0};
+u8 snapshot_enable_debug __read_mostly = 1;
+
+static struct dentry *snapshot_debug;
+static struct dentry *snapshot_version;
+static struct dentry *snapshot_test[SNAPSHOT_TESTS_NUM];
+
+static char snapshot_version_str[] = EXT4_SNAPSHOT_VERSION;
+static struct debugfs_blob_wrapper snapshot_version_blob = {
+ .data = snapshot_version_str,
+ .size = sizeof(snapshot_version_str)
+};
+
+
+/*
+ * ext4_snapshot_create_debugfs_entry - register ext4 snapshot debug hooks
+ * Void function doesn't return error if debug hooks are not registered.
+ */
+void ext4_snapshot_create_debugfs_entry(struct dentry *debugfs_dir)
+{
+ int i;
+
+ BUG_ON(!debugfs_dir);
+ snapshot_debug = debugfs_create_u8("snapshot-debug", S_IRUGO|S_IWUSR,
+ debugfs_dir,
+ &snapshot_enable_debug);
+ snapshot_version = debugfs_create_blob("snapshot-version", S_IRUGO,
+ debugfs_dir,
+ &snapshot_version_blob);
+ for (i = 0; i < SNAPSHOT_TESTS_NUM && i < SNAPSHOT_TEST_NAMES; i++)
+ snapshot_test[i] = debugfs_create_u16(snapshot_test_names[i],
+ S_IRUGO|S_IWUSR,
+ debugfs_dir,
+ &snapshot_enable_test[i]);
+}
+
+/*
+ * ext4_snapshot_remove_debugfs_entry - unregister ext4 snapshot debug hooks
+ * checks if the hooks have been registered before unregistering them.
+ */
+void ext4_snapshot_remove_debugfs_entry(void)
+{
+ int i;
+
+ for (i = 0; i < SNAPSHOT_TESTS_NUM && i < SNAPSHOT_TEST_NAMES; i++)
+ if (snapshot_test[i])
+ debugfs_remove(snapshot_test[i]);
+ if (snapshot_version)
+ debugfs_remove(snapshot_version);
+ if (snapshot_debug)
+ debugfs_remove(snapshot_debug);
+}
+#endif
diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
index e69de29..f893eb1 100644
--- a/fs/ext4/snapshot_debug.h
+++ b/fs/ext4/snapshot_debug.h
@@ -0,0 +1,105 @@
+/*
+ * linux/fs/ext4/snapshot_debug.h
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshot debugging.
+ */
+
+#ifndef _LINUX_EXT4_SNAPSHOT_DEBUG_H
+#define _LINUX_EXT4_SNAPSHOT_DEBUG_H
+
+#if defined(CONFIG_EXT4_FS_SNAPSHOT) && defined(CONFIG_EXT4_DEBUG)
+#include <linux/delay.h>
+
+#define SNAPSHOT_INDENT_MAX 4
+#define SNAPSHOT_INDENT_STR "\t\t\t\t"
+
+#define SNAPTEST_TAKE 0
+#define SNAPTEST_DELETE 1
+#define SNAPTEST_COW 2
+#define SNAPTEST_READ 3
+#define SNAPTEST_BITMAP 4
+#define SNAPSHOT_TESTS_NUM 5
+
+extern const char *snapshot_indent;
+extern u8 snapshot_enable_debug;
+extern u16 snapshot_enable_test[SNAPSHOT_TESTS_NUM];
+extern u8 cow_cache_enabled;
+
+#define snapshot_test_delay(i) \
+ do { \
+ if (snapshot_enable_test[i]) \
+ msleep_interruptible(snapshot_enable_test[i]); \
+ } while (0)
+
+/*
+ * Sleep 1ms every 'blocks_per_ms', amounting to the total test delay
+ * over 100% of progress (when 'to' reaches 'max').
+ * snapshot_enable_test[i] (msec) is limited to 64K and max (blocks_count)
+ * is likely much more than 64K, so 'blocks_per_ms' is likely non zero.
+ */
+#define snapshot_test_delay_progress(i, from, to, max) \
+ do { \
+ if (snapshot_enable_test[i] && \
+ (max) > snapshot_enable_test[i] && \
+ (from) <= (to) && (to) <= (max)) { \
+ unsigned long blocks_per_ms = \
+ do_div((max), snapshot_enable_test[i]); \
+ unsigned long x = do_div((from), blocks_per_ms);\
+ unsigned long y = do_div((to), blocks_per_ms); \
+ if (y > x) \
+ msleep_interruptible(y - x); \
+ } \
+ } while (0)
+
+#define snapshot_debug_l(n, l, f, a...) \
+ do { \
+ if ((n) <= snapshot_enable_debug && \
+ (l) <= SNAPSHOT_INDENT_MAX) { \
+ printk(KERN_DEBUG "snapshot: %s" f, \
+ snapshot_indent - (l), \
+ ## a); \
+ } \
+ } while (0)
+
+#define snapshot_debug(n, f, a...) snapshot_debug_l(n, 0, f, ## a)
+
+#define snapshot_debug_once(n, f, a...) \
+ do { \
+ static bool __once; \
+ if (!__once) { \
+ snapshot_debug(n, f, ## a); \
+ __once = true; \
+ } \
+ } while (0)
+
+extern void ext4_snapshot_create_debugfs_entry(struct dentry *debugfs_dir);
+extern void ext4_snapshot_remove_debugfs_entry(void);
+
+#else
+#define snapshot_enable_debug (0)
+#define snapshot_test_delay(i)
+#define snapshot_test_delay_progress(i, from, to, max)
+#define snapshot_debug(n, f, a...)
+#define snapshot_debug_l(n, l, f, a...)
+#define snapshot_debug_once(n, f, a...)
+#define ext4_snapshot_create_debugfs_entry(d)
+#define ext4_snapshot_remove_debugfs_entry()
+#endif
+
+
+/* debug levels */
+#define SNAP_ERR 1 /* errors and summary */
+#define SNAP_WARN 2 /* warnings */
+#define SNAP_INFO 3 /* info */
+#define SNAP_DEBUG 4 /* debug */
+#define SNAP_DUMP 5 /* dump snapshot file */
+
+#endif /* _LINUX_EXT4_SNAPSHOT_DEBUG_H */
--
1.7.4.1


2011-06-07 15:09:03

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 03/36] ext4: snapshot hooks - inside JBD hooks

From: Amir Goldstein <[email protected]>

Before every metadata buffer write, the journal API is called,
namely, one of the ext4_journal_get_XXX_access() functions.
We use these journal hooks to call the snapshot API, namely
ext4_snapshot_get_XXX_access(), to COW the metadata buffer before
it is modified for the first time.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 9 ++++++-
fs/ext4/ext4_jbd2.h | 15 ++++++++++---
fs/ext4/extents.c | 3 +-
fs/ext4/inode.c | 22 ++++++++++++++------
fs/ext4/move_extent.c | 3 +-
fs/ext4/snapshot.h | 51 +++++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 88 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 560020d..833969b 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -23,13 +23,16 @@ int __ext4_journal_get_undo_access(const char *where, unsigned int line,
return err;
}

-int __ext4_journal_get_write_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh)
+int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
+ handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, int exclude)
{
int err = 0;

if (ext4_handle_valid(handle)) {
err = jbd2_journal_get_write_access(handle, bh);
+ if (!err && !exclude)
+ err = ext4_snapshot_get_write_access(handle, inode, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
@@ -111,6 +114,8 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,

if (ext4_handle_valid(handle)) {
err = jbd2_journal_get_create_access(handle, bh);
+ if (!err)
+ err = ext4_snapshot_get_create_access(handle, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 3da2092..ca6e135 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -132,9 +132,9 @@ void ext4_journal_abort_handle(const char *caller, unsigned int line,
int __ext4_journal_get_undo_access(const char *where, unsigned int line,
handle_t *handle, struct buffer_head *bh);

-int __ext4_journal_get_write_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh);
-
+int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
+ handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, int exclude);
int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
int is_metadata, struct inode *inode,
struct buffer_head *bh, ext4_fsblk_t blocknr);
@@ -151,8 +151,15 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,

#define ext4_journal_get_undo_access(handle, bh) \
__ext4_journal_get_undo_access(__func__, __LINE__, (handle), (bh))
+#define ext4_journal_get_write_access_exclude(handle, bh) \
+ __ext4_journal_get_write_access_inode(__func__, __LINE__, \
+ (handle), NULL, (bh), 1)
#define ext4_journal_get_write_access(handle, bh) \
- __ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
+ __ext4_journal_get_write_access_inode(__func__, __LINE__, \
+ (handle), NULL, (bh), 0)
+#define ext4_journal_get_write_access_inode(handle, inode, bh) \
+ __ext4_journal_get_write_access_inode(__func__, __LINE__, \
+ (handle), (inode), (bh), 0)
#define ext4_forget(handle, is_metadata, inode, bh, block_nr) \
__ext4_forget(__func__, __LINE__, (handle), (is_metadata), (inode), \
(bh), (block_nr))
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7598224..6f0a711 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -79,7 +79,8 @@ static int ext4_ext_get_access(handle_t *handle, struct inode *inode,
{
if (path->p_bh) {
/* path points to block */
- return ext4_journal_get_write_access(handle, path->p_bh);
+ return ext4_journal_get_write_access_inode(handle,
+ inode, path->p_bh);
}
/* path points to leaf/index in inode body */
/* we use in-core data, no need to protect them */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9dbd806..80e3393 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -874,7 +874,8 @@ static int ext4_splice_branch(handle_t *handle, struct inode *inode,
*/
if (where->bh) {
BUFFER_TRACE(where->bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, where->bh);
+ err = ext4_journal_get_write_access_inode(handle, inode,
+ where->bh);
if (err)
goto err_out;
}
@@ -4149,7 +4150,8 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
goto out_err;
if (bh) {
BUFFER_TRACE(bh, "retaking write access");
- err = ext4_journal_get_write_access(handle, bh);
+ err = ext4_journal_get_write_access_inode(handle,
+ inode, bh);
if (unlikely(err))
goto out_err;
}
@@ -4200,7 +4202,8 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,

if (this_bh) { /* For indirect block */
BUFFER_TRACE(this_bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, this_bh);
+ err = ext4_journal_get_write_access_inode(handle, inode,
+ this_bh);
/* Important: if we can't update the indirect pointers
* to the blocks, we can't free them. */
if (err)
@@ -4363,8 +4366,8 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
* pointed to by an indirect block: journal it
*/
BUFFER_TRACE(parent_bh, "get_write_access");
- if (!ext4_journal_get_write_access(handle,
- parent_bh)){
+ if (!ext4_journal_get_write_access_inode(
+ handle, inode, parent_bh)){
*p = 0;
BUFFER_TRACE(parent_bh,
"call ext4_handle_dirty_metadata");
@@ -4741,9 +4744,14 @@ has_buffer:

int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
{
- /* We have all inode data except xattrs in memory here. */
- return __ext4_get_inode_loc(inode, iloc,
+ int in_mem = (!EXT4_SNAPSHOTS(inode->i_sb) &&
!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
+
+ /*
+ * We have all inode's data except xattrs in memory here,
+ * but we must always read-in the entire inode block for COW.
+ */
+ return __ext4_get_inode_loc(inode, iloc, in_mem);
}

void ext4_set_inode_flags(struct inode *inode)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index b9f3e78..ad5409a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -421,7 +421,8 @@ mext_insert_extents(handle_t *handle, struct inode *orig_inode,

if (depth) {
/* Register to journal */
- ret = ext4_journal_get_write_access(handle, orig_path->p_bh);
+ ret = ext4_journal_get_write_access_inode(handle,
+ orig_inode, orig_path->p_bh);
if (ret)
return ret;
}
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index d0c985b..54241b9 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -145,6 +145,57 @@ static inline int EXT4_SNAPSHOTS(struct super_block *sb)
* Block access functions
*/

+/*
+ * get_write_access() is called before writing to a metadata block
+ * if @inode is not NULL, then this is an inode's indirect block
+ * otherwise, this is a file system global metadata block
+ *
+ * Return values:
+ * = 0 - block was COWed or doesn't need to be COWed
+ * < 0 - error
+ */
+static inline int ext4_snapshot_get_write_access(handle_t *handle,
+ struct inode *inode, struct buffer_head *bh)
+{
+ struct super_block *sb;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (!EXT4_SNAPSHOTS(sb))
+ return 0;
+
+ return ext4_snapshot_cow(handle, inode, bh->b_blocknr, bh, 1);
+}
+
+/*
+ * get_create_access() is called after allocating a new metadata block
+ *
+ * Return values:
+ * = 0 - block was COWed or doesn't need to be COWed
+ * < 0 - error
+ */
+static inline int ext4_snapshot_get_create_access(handle_t *handle,
+ struct buffer_head *bh)
+{
+ struct super_block *sb;
+ int err;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (!EXT4_SNAPSHOTS(sb))
+ return 0;
+
+ /* Should block be COWed? */
+ err = ext4_snapshot_cow(handle, NULL, bh->b_blocknr, bh, 0);
+ /*
+ * A new block shouldn't need to be COWed if get_delete_access() was
+ * called for all deleted blocks. However, it may need to be COWed
+ * if fsck was run and if it had freed some blocks without moving them
+ * to snapshot. In the latter case, -EIO will be returned.
+ */
+ if (err > 0)
+ err = -EIO;
+ return err;
+}
+


/* snapshot_ctl.c */
--
1.7.4.1


2011-06-07 15:09:06

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 04/36] ext4: snapshot hooks - block bitmap access

From: Amir Goldstein <[email protected]>

The API ext4_handle_get_bitmap_access() is used instead of
ext4_journal_get_write_access(), before modifying a block bitmap
while allocating or deleting blocks. The bitmap access API is
used to initialize the COW bitmap for that group.
The old ext4_journal_get_undo_access() API was removed because it
is not being used in the code.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 10 +++++++---
fs/ext4/ext4_jbd2.h | 10 ++++++----
fs/ext4/mballoc.c | 7 ++++---
fs/ext4/snapshot.h | 31 +++++++++++++++++++++++++++++++
4 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 833969b..c44c362 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -9,13 +9,17 @@

#include <trace/events/ext4.h>

-int __ext4_journal_get_undo_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh)
+int __ext4_handle_get_bitmap_access(const char *where, unsigned int line,
+ handle_t *handle, struct super_block *sb,
+ ext4_group_t group, struct buffer_head *bh)
{
int err = 0;

if (ext4_handle_valid(handle)) {
- err = jbd2_journal_get_undo_access(handle, bh);
+ err = jbd2_journal_get_write_access(handle, bh);
+ if (!err)
+ err = ext4_snapshot_get_bitmap_access(handle, sb,
+ group, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index ca6e135..be3b8b3 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -129,8 +129,9 @@ void ext4_journal_abort_handle(const char *caller, unsigned int line,
const char *err_fn,
struct buffer_head *bh, handle_t *handle, int err);

-int __ext4_journal_get_undo_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh);
+int __ext4_handle_get_bitmap_access(const char *where, unsigned int line,
+ handle_t *handle, struct super_block *sb,
+ ext4_group_t group, struct buffer_head *bh);

int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
handle_t *handle, struct inode *inode,
@@ -149,8 +150,9 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
int __ext4_handle_dirty_super(const char *where, unsigned int line,
handle_t *handle, struct super_block *sb);

-#define ext4_journal_get_undo_access(handle, bh) \
- __ext4_journal_get_undo_access(__func__, __LINE__, (handle), (bh))
+#define ext4_handle_get_bitmap_access(handle, sb, group, bh) \
+ __ext4_handle_get_bitmap_access(__func__, __LINE__, \
+ (handle), (sb), (group), (bh))
#define ext4_journal_get_write_access_exclude(handle, bh) \
__ext4_journal_get_write_access_inode(__func__, __LINE__, \
(handle), NULL, (bh), 1)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 54ea8c8..6b400f2 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2749,7 +2749,8 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
if (!bitmap_bh)
goto out_err;

- err = ext4_journal_get_write_access(handle, bitmap_bh);
+ err = ext4_handle_get_bitmap_access(handle, sb, ac->ac_b_ex.fe_group,
+ bitmap_bh);
if (err)
goto out_err;

@@ -4549,7 +4550,7 @@ do_more:
}

BUFFER_TRACE(bitmap_bh, "getting write access");
- err = ext4_journal_get_write_access(handle, bitmap_bh);
+ err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
if (err)
goto error_return;

@@ -4698,7 +4699,7 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
}

BUFFER_TRACE(bitmap_bh, "getting write access");
- err = ext4_journal_get_write_access(handle, bitmap_bh);
+ err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
if (err)
goto error_return;

diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 54241b9..008f4a9 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -196,6 +196,37 @@ static inline int ext4_snapshot_get_create_access(handle_t *handle,
return err;
}

+/*
+ * get_bitmap_access() is called before modifying a block bitmap.
+ * this call initializes the COW bitmap for @group.
+ *
+ * Return values:
+ * = 0 - COW bitmap is initialized
+ * < 0 - error
+ */
+static inline int ext4_snapshot_get_bitmap_access(handle_t *handle,
+ struct super_block *sb, ext4_group_t group,
+ struct buffer_head *bh)
+{
+ if (!EXT4_SNAPSHOTS(sb))
+ return 0;
+ /*
+ * With flex_bg, block bitmap may reside in a different group than
+ * the group it describes, so we need to init both COW bitmaps:
+ * 1. init the COW bitmap for @group by testing
+ * if the first block in the group should be COWed
+ */
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
+ int err = ext4_snapshot_cow(handle, NULL,
+ ext4_group_first_block_no(sb, group),
+ NULL, 0);
+ if (err < 0)
+ return err;
+ }
+ /* 2. COW the block bitmap itself, which may be in another group */
+ return ext4_snapshot_cow(handle, NULL, bh->b_blocknr, bh, 1);
+}
+


/* snapshot_ctl.c */
--
1.7.4.1


2011-06-07 15:09:08

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 05/36] ext4: snapshot hooks - delete blocks

From: Amir Goldstein <[email protected]>

Before deleting file blocks in ext4_free_blocks(),
we call the snapshot API ext4_snapshot_get_delete_access(),
to optionally move the block to the snapshot file instead of
freeing them.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 10 +++++++---
fs/ext4/mballoc.c | 30 +++++++++++++++++++++++++++---
fs/ext4/snapshot.h | 26 ++++++++++++++++++++++++++
3 files changed, 60 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 756848f..b910dcb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1731,9 +1731,13 @@ extern int ext4_mb_reserve_blocks(struct super_block *, int);
extern void ext4_discard_preallocations(struct inode *);
extern int __init ext4_init_mballoc(void);
extern void ext4_exit_mballoc(void);
-extern void ext4_free_blocks(handle_t *handle, struct inode *inode,
- struct buffer_head *bh, ext4_fsblk_t block,
- unsigned long count, int flags);
+extern void __ext4_free_blocks(const char *where, unsigned int line,
+ handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, ext4_fsblk_t block,
+ unsigned long count, int flags);
+#define ext4_free_blocks(handle, inode, bh, block, count, flags) \
+ __ext4_free_blocks(__func__, __LINE__ , (handle), (inode), (bh), \
+ (block), (count), (flags))
extern int ext4_mb_add_groupinfo(struct super_block *sb,
ext4_group_t i, struct ext4_group_desc *desc);
extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6b400f2..f878449 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4451,9 +4451,9 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
* @count: number of blocks to count
* @metadata: Are these metadata blocks
*/
-void ext4_free_blocks(handle_t *handle, struct inode *inode,
- struct buffer_head *bh, ext4_fsblk_t block,
- unsigned long count, int flags)
+void __ext4_free_blocks(const char *where, unsigned int line, handle_t *handle,
+ struct inode *inode, struct buffer_head *bh,
+ ext4_fsblk_t block, unsigned long count, int flags)
{
struct buffer_head *bitmap_bh = NULL;
struct super_block *sb = inode->i_sb;
@@ -4467,6 +4467,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
struct ext4_buddy e4b;
int err = 0;
int ret;
+ int maxblocks;

if (bh) {
if (block)
@@ -4549,6 +4550,29 @@ do_more:
goto error_return;
}

+ maxblocks = count;
+ ret = ext4_snapshot_get_delete_access(handle, inode,
+ block, &maxblocks);
+ if (ret < 0) {
+ ext4_journal_abort_handle(where, line, __func__,
+ NULL, handle, ret);
+ err = ret;
+ goto error_return;
+ }
+ if (ret > 0) {
+ /* 'ret' blocks were moved to snapshot - skip them */
+ block += maxblocks;
+ count -= maxblocks;
+ count += overflow;
+ cond_resched();
+ if (count > 0)
+ goto do_more;
+ /* no more blocks to free/move to snapshot */
+ ext4_mark_super_dirty(sb);
+ goto error_return;
+ }
+ overflow += count - maxblocks;
+ count = maxblocks;
BUFFER_TRACE(bitmap_bh, "getting write access");
err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
if (err)
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 008f4a9..504dfd5 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -227,6 +227,32 @@ static inline int ext4_snapshot_get_bitmap_access(handle_t *handle,
return ext4_snapshot_cow(handle, NULL, bh->b_blocknr, bh, 1);
}

+/*
+ * get_delete_access() - move blocks to snapshot or approve to free them
+ * @handle: JBD handle
+ * @inode: owner of blocks if known (or NULL otherwise)
+ * @block: address of start @block
+ * @pcount: pointer to no. of blocks about to move or approve
+ *
+ * Called from ext4_free_blocks() before deleting blocks with
+ * i_data_sem held
+ *
+ * Return values:
+ * > 0 - blocks were moved to snapshot and may not be freed
+ * = 0 - blocks may be freed
+ * < 0 - error
+ */
+static inline int ext4_snapshot_get_delete_access(handle_t *handle,
+ struct inode *inode, ext4_fsblk_t block, int *pcount)
+{
+ struct super_block *sb;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (!EXT4_SNAPSHOTS(sb))
+ return 0;
+
+ return ext4_snapshot_move(handle, inode, block, pcount, 1);
+}


/* snapshot_ctl.c */
--
1.7.4.1


2011-06-07 15:09:11

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 06/36] ext4: snapshot hooks - move data blocks

From: Amir Goldstein <[email protected]>

Before every regular file data buffer write, the function
ext4_get_block() is called to map the buffer to disk. We add a
new function ext4_get_block_mow() which is called when we want to
snapshot the blocks. We use this hook to call the snapshot API
snapshot_get_move_access(), to optionally move the block
to the snapshot file.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 15 +++-
fs/ext4/ext4_jbd2.h | 17 ++++
fs/ext4/inode.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/mballoc.c | 23 +++++
fs/ext4/snapshot.h | 31 +++++++
5 files changed, 311 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b910dcb..a5bc3ab 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -143,9 +143,10 @@ struct ext4_allocation_request {
#define EXT4_MAP_UNWRITTEN (1 << BH_Unwritten)
#define EXT4_MAP_BOUNDARY (1 << BH_Boundary)
#define EXT4_MAP_UNINIT (1 << BH_Uninit)
+#define EXT4_MAP_REMAP (1 << BH_Remap)
#define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
- EXT4_MAP_UNINIT)
+ EXT4_MAP_UNINIT | EXT4_MAP_REMAP)

struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
@@ -512,6 +513,12 @@ struct ext4_new_group_data {
/* Convert extent to initialized after IO complete */
#define EXT4_GET_BLOCKS_IO_CONVERT_EXT (EXT4_GET_BLOCKS_CONVERT|\
EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
+ /* Look up if mapped block is used by snapshot,
+ * if so and EXT4_GET_BLOCKS_CREATE is set, move it to snapshot
+ * and allocate a new block for new data.
+ * if EXT4_GET_BLOCKS_CREATE is not set, return REMAP flags.
+ */
+#define EXT4_GET_BLOCKS_MOVE_ON_WRITE 0x0100

/*
* Flags used by ext4_free_blocks
@@ -2130,10 +2137,16 @@ extern int ext4_bio_write_page(struct ext4_io_submit *io,
enum ext4_state_bits {
BH_Uninit /* blocks are allocated but uninitialized on disk */
= BH_JBDPrivateStart,
+ BH_Remap, /* Data block need to be remapped,
+ * now used by snapshot to do mow
+ */
+ BH_Partial_Write, /* Buffer should be uptodate before write */
};

BUFFER_FNS(Uninit, uninit)
TAS_BUFFER_FNS(Uninit, uninit)
+BUFFER_FNS(Remap, remap)
+BUFFER_FNS(Partial_Write, partial_write)

/*
* Add new method to test wether block and inode bitmaps are properly
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index be3b8b3..46dc1ce 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -360,5 +360,22 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
}

#ifdef CONFIG_EXT4_FS_SNAPSHOT
+/*
+ * check if @inode data blocks should be moved-on-write
+ */
+static inline int ext4_snapshot_should_move_data(struct inode *inode)
+{
+ if (!EXT4_SNAPSHOTS(inode->i_sb))
+ return 0;
+ if (EXT4_JOURNAL(inode) == NULL)
+ return 0;
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ return 0;
+ /* when a data block is journaled, it is already COWed as metadata */
+ if (ext4_should_journal_data(inode))
+ return 0;
+ return 1;
+}
+
#endif
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 80e3393..2d7e540 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -78,7 +78,8 @@ static int noalloc_get_block_write(struct inode *inode, sector_t iblock,
static int ext4_set_bh_endio(struct buffer_head *bh, struct inode *inode);
static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate);
static int __ext4_journalled_writepage(struct page *page, unsigned int len);
-static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
+static int ext4_bh_delay_or_unwritten_or_remap(handle_t *handle,
+ struct buffer_head *bh);

/*
* Test whether an inode is a fast symlink.
@@ -988,6 +989,51 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,

partial = ext4_get_branch(inode, depth, offsets, chain, &err);

+ err = 0;
+ if (!partial && (flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE)) {
+ BUG_ON(!ext4_snapshot_should_move_data(inode));
+ first_block = le32_to_cpu(chain[depth - 1].key);
+ if (!(flags & EXT4_GET_BLOCKS_CREATE)) {
+ /*
+ * First call from ext4_map_blocks():
+ * test if first_block should be moved to snapshot?
+ */
+ err = ext4_snapshot_get_move_access(handle, inode,
+ first_block,
+ &map->m_len, 0);
+ if (err < 0) {
+ /* cleanup the whole chain and exit */
+ partial = chain + depth - 1;
+ goto cleanup;
+ }
+ if (err > 0) {
+ /*
+ * Return EXT4_MAP_REMAP via map->m_flags
+ * to tell ext4_map_blocks() that the
+ * found block should be moved to snapshot.
+ */
+ map->m_flags |= EXT4_MAP_REMAP;
+ }
+ /*
+ * Set max. blocks to map to max. blocks, which
+ * ext4_snapshot_get_move_access() allows us to handle
+ * (move or not move) in one ext4_map_blocks() call.
+ */
+ err = 0;
+ } else if (map->m_flags & EXT4_MAP_REMAP &&
+ map->m_pblk == first_block) {
+ /*
+ * Second call from ext4_map_blocks():
+ * If mapped block hasn't change, we can rely the
+ * cached result from the first call.
+ */
+ err = 1;
+ }
+ }
+ if (err)
+ /* do not map found block - it should be moved to snapshot */
+ partial = chain + depth - 1;
+
/* Simplest case - block found, no allocation needed */
if (!partial) {
first_block = le32_to_cpu(chain[depth - 1].key);
@@ -1022,8 +1068,12 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
* Next look up the indirect map to count the totoal number of
* direct blocks to allocate for this branch.
*/
- count = ext4_blks_to_allocate(partial, indirect_blks,
- map->m_len, blocks_to_boundary);
+ if (map->m_flags & EXT4_MAP_REMAP) {
+ BUG_ON(indirect_blks != 0);
+ count = map->m_len;
+ } else
+ count = ext4_blks_to_allocate(partial, indirect_blks,
+ map->m_len, blocks_to_boundary);
/*
* Block out ext4_truncate while we alter the tree
*/
@@ -1031,6 +1081,23 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
&count, goal,
offsets + (partial - chain), partial);

+ if (map->m_flags & EXT4_MAP_REMAP) {
+ map->m_len = count;
+ /* move old block to snapshot */
+ err = ext4_snapshot_get_move_access(handle, inode,
+ le32_to_cpu(*(partial->p)),
+ &map->m_len, 1);
+ if (err <= 0) {
+ /* failed to move to snapshot - abort! */
+ err = err ? : -EIO;
+ ext4_journal_abort_handle(__func__, __LINE__,
+ "ext4_snapshot_get_move_access", NULL,
+ handle, err);
+ goto cleanup;
+ }
+ /* block moved to snapshot - continue to splice new block */
+ err = 0;
+ }
/*
* The ext4_splice_branch call will free and forget any buffers
* on the new chain if there is a failure, but that risks using
@@ -1046,7 +1113,8 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,

map->m_flags |= EXT4_MAP_NEW;

- ext4_update_inode_fsync_trans(handle, inode, 1);
+ if (!IS_COWING(handle))
+ ext4_update_inode_fsync_trans(handle, inode, 1);
got_it:
map->m_flags |= EXT4_MAP_MAPPED;
map->m_pblk = le32_to_cpu(chain[depth-1].key);
@@ -1294,7 +1362,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, 0);
} else {
- retval = ext4_ind_map_blocks(handle, inode, map, 0);
+ retval = ext4_ind_map_blocks(handle, inode, map,
+ flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
}
up_read((&EXT4_I(inode)->i_data_sem));

@@ -1315,7 +1384,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* ext4_ext_get_block() returns th create = 0
* with buffer head unmapped.
*/
- if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED)
+ if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED &&
+ !(map->m_flags & EXT4_MAP_REMAP))
return retval;

/*
@@ -1378,6 +1448,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);

up_write((&EXT4_I(inode)->i_data_sem));
+ /* Clear EXT4_MAP_REMAP, it is not needed any more. */
+ map->m_flags &= ~EXT4_MAP_REMAP;
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
int ret = check_block_validity(inode, map);
if (ret != 0)
@@ -1386,6 +1458,41 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
return retval;
}

+/*
+ * Block may need to be moved to snapshot and we need to writeback part of the
+ * existing block data to the new block, so make sure the buffer and page are
+ * uptodate before moving the existing block to snapshot.
+ */
+static int ext4_partial_write_begin(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh)
+{
+ struct ext4_map_blocks map;
+ int ret;
+
+ BUG_ON(!buffer_partial_write(bh));
+ BUG_ON(!bh->b_page || !PageLocked(bh->b_page));
+ map.m_lblk = iblock;
+ map.m_len = 1;
+
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret <= 0)
+ return ret;
+
+ if (!buffer_uptodate(bh) && !buffer_unwritten(bh)) {
+ /* map existing block for read */
+ map_bh(bh, inode->i_sb, map.m_pblk);
+ ll_rw_block(READ, 1, &bh);
+ wait_on_buffer(bh);
+ /* clear existing block mapping */
+ clear_buffer_mapped(bh);
+ if (!buffer_uptodate(bh))
+ return -EIO;
+ }
+ /* prevent zero out of page with BH_New flag in block_write_begin() */
+ SetPageUptodate(bh->b_page);
+ return 0;
+}
+
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096

@@ -1408,11 +1515,18 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- return ret;
+ goto out;
}
started = 1;
}

+ if ((flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE) &&
+ buffer_partial_write(bh)) {
+ /* Read existing block data before moving it to snapshot */
+ ret = ext4_partial_write_begin(inode, iblock, bh);
+ if (ret < 0)
+ goto out;
+ }
ret = ext4_map_blocks(handle, inode, &map, flags);
if (ret > 0) {
map_bh(bh, inode->i_sb, map.m_pblk);
@@ -1420,11 +1534,30 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
ret = 0;
}
+out:
if (started)
ext4_journal_stop(handle);
+ /*
+ * BH_Partial_Write flags are only used to pass
+ * hints to this function and should be cleared on exit.
+ */
+ clear_buffer_partial_write(bh);
return ret;
}

+/*
+ * ext4_get_block_mow is used when a block may be needed to be snapshotted.
+ */
+int ext4_get_block_mow(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ int flags = create ? EXT4_GET_BLOCKS_CREATE : 0;
+
+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE;
+ return _ext4_get_block(inode, iblock, bh, flags);
+}
+
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{
@@ -1603,6 +1736,45 @@ static void ext4_truncate_failed_write(struct inode *inode)
ext4_truncate(inode);
}

+/*
+ * Prepare for snapshot.
+ * Clear mapped flag of buffers,
+ * Set partial write flag of buffers in non-delayed-mow case.
+ */
+static void ext4_snapshot_write_begin(struct inode *inode,
+ struct page *page, unsigned len, int delay)
+{
+ struct buffer_head *bh = NULL;
+ /*
+ * XXX: We can also check ext4_snapshot_has_active() here and we don't
+ * need to unmap the buffers is there is no active snapshot, but the
+ * result must be valid throughout the writepage() operation and to
+ * guarantee this we have to know that the transaction is not restarted.
+ * Can we count on that?
+ */
+ if (!ext4_snapshot_should_move_data(inode))
+ return;
+
+ if (!page_has_buffers(page))
+ create_empty_buffers(page, inode->i_sb->s_blocksize, 0);
+ /* snapshots only work when blocksize == pagesize */
+ bh = page_buffers(page);
+ /*
+ * make sure that get_block() is called even if the buffer is
+ * mapped, but not if it is already a part of any transaction.
+ * in data=ordered,the only mode supported by ext4, all dirty
+ * data buffers are flushed on snapshot take via freeze_fs()
+ * API.
+ */
+ if (!buffer_jbd(bh) && !buffer_delay(bh)) {
+ clear_buffer_mapped(bh);
+ /* explicitly request move-on-write */
+ if (!delay && len < PAGE_CACHE_SIZE)
+ /* read block before moving it to snapshot */
+ set_buffer_partial_write(bh);
+ }
+}
+
static int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
static int ext4_write_begin(struct file *file, struct address_space *mapping,
@@ -1645,11 +1817,13 @@ retry:
goto out;
}
*pagep = page;
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ ext4_snapshot_write_begin(inode, page, len, 0);

if (ext4_should_dioread_nolock(inode))
ret = __block_write_begin(page, pos, len, ext4_get_block_write);
else
- ret = __block_write_begin(page, pos, len, ext4_get_block);
+ ret = __block_write_begin(page, pos, len, ext4_get_block_mow);

if (!ret && ext4_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
@@ -2117,6 +2291,10 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
clear_buffer_delay(bh);
bh->b_blocknr = pblock;
}
+ if (buffer_remap(bh)) {
+ clear_buffer_remap(bh);
+ bh->b_blocknr = pblock;
+ }
if (buffer_unwritten(bh) ||
buffer_mapped(bh))
BUG_ON(bh->b_blocknr != pblock);
@@ -2126,7 +2304,8 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
}

/* skip page if block allocation undone */
- if (buffer_delay(bh) || buffer_unwritten(bh))
+ if (buffer_delay(bh) || buffer_unwritten(bh) ||
+ buffer_remap(bh))
skip_page = 1;
bh = bh->b_this_page;
block_start += bh->b_size;
@@ -2244,7 +2423,8 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
if ((mpd->b_size == 0) ||
((mpd->b_state & (1 << BH_Mapped)) &&
!(mpd->b_state & (1 << BH_Delay)) &&
- !(mpd->b_state & (1 << BH_Unwritten))))
+ !(mpd->b_state & (1 << BH_Unwritten)) &&
+ !(mpd->b_state & (1 << BH_Remap))))
goto submit_io;

handle = ext4_journal_current_handle();
@@ -2275,6 +2455,9 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
if (mpd->b_state & (1 << BH_Delay))
get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
+ if (mpd->b_state & (1 << BH_Remap))
+ get_blocks_flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE |
+ EXT4_GET_BLOCKS_DELALLOC_RESERVE;

blks = ext4_map_blocks(handle, mpd->inode, &map, get_blocks_flags);
if (blks < 0) {
@@ -2360,7 +2543,7 @@ submit_io:
}

#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \
- (1 << BH_Delay) | (1 << BH_Unwritten))
+ (1 << BH_Delay) | (1 << BH_Unwritten) | (1 << BH_Remap))

/*
* mpage_add_bh_to_extent - try to add one more block to extent of blocks
@@ -2437,9 +2620,11 @@ flush_it:
return;
}

-static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh)
+static int ext4_bh_delay_or_unwritten_or_remap(handle_t *handle,
+ struct buffer_head *bh)
{
- return (buffer_delay(bh) || buffer_unwritten(bh)) && buffer_dirty(bh);
+ return ((buffer_delay(bh) || buffer_unwritten(bh)) &&
+ buffer_dirty(bh)) || buffer_remap(bh);
}

/*
@@ -2459,6 +2644,8 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
{
struct ext4_map_blocks map;
int ret = 0;
+ handle_t *handle = ext4_journal_current_handle();
+ int flags = 0;
sector_t invalid_block = ~((sector_t) 0xffff);

if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
@@ -2470,12 +2657,15 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
map.m_lblk = iblock;
map.m_len = 1;

+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE;
+
/*
* first, we need to know whether the block is allocated already
* preallocated blocks are unmapped but should treated
* the same as allocated blocks.
*/
- ret = ext4_map_blocks(NULL, inode, &map, 0);
+ ret = ext4_map_blocks(handle, inode, &map, flags);
if (ret < 0)
return ret;
if (ret == 0) {
@@ -2495,6 +2685,11 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
return 0;
}

+ if (map.m_flags & EXT4_MAP_REMAP) {
+ ret = ext4_da_reserve_space(inode, iblock);
+ if (ret < 0)
+ return ret;
+ }
map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;

@@ -2662,7 +2857,7 @@ static int ext4_writepage(struct page *page,
}
page_bufs = page_buffers(page);
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
- ext4_bh_delay_or_unwritten)) {
+ ext4_bh_delay_or_unwritten_or_remap)) {
/*
* We don't want to do block allocation, so redirty
* the page and return. We may reach here when we do
@@ -2832,7 +3027,8 @@ static int write_cache_pages_da(struct address_space *mapping,
* Otherwise we won't make progress
* with the page in ext4_writepage
*/
- if (ext4_bh_delay_or_unwritten(NULL, bh)) {
+ if (ext4_bh_delay_or_unwritten_or_remap(
+ NULL, bh)) {
mpage_add_bh_to_extent(mpd, logical,
bh->b_size,
bh->b_state);
@@ -3146,6 +3342,8 @@ retry:
goto out;
}
*pagep = page;
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ ext4_snapshot_write_begin(inode, page, len, 1);

ret = __block_write_begin(page, pos, len, ext4_da_get_block_prep);
if (ret < 0) {
@@ -3979,6 +4177,18 @@ int ext4_block_truncate_page(handle_t *handle,
goto unlock;
}

+ /* check if block needs to be moved to snapshot before zeroing */
+ if (ext4_snapshot_should_move_data(inode)) {
+ err = ext4_get_block_mow(inode, iblock, bh, 1);
+ if (err)
+ goto unlock;
+ if (buffer_new(bh)) {
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
+ clear_buffer_new(bh);
+ }
+ }
+
if (ext4_should_journal_data(inode)) {
BUFFER_TRACE(bh, "get write access");
err = ext4_journal_get_write_access(handle, bh);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f878449..4ff3079 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3179,6 +3179,29 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
struct ext4_prealloc_space *pa, *cpa = NULL;
ext4_fsblk_t goal_block;

+ /*
+ * All inode preallocations allocated before the time when the
+ * active snapshot is taken need to be discarded, otherwise blocks
+ * maybe used by both a regular file and the snapshot file that we
+ * are taking in the below case.
+ *
+ * Case: An user take a snapshot when an inode has a preallocation
+ * 12/512, of which 12/64 has been used by the inode. Here 12 is the
+ * logical block number. After the snapshot is taken, an user issues
+ * a write request on the 12th block, then an allocation on 12 is
+ * needed and allocator will use blocks from the preallocations.As
+ * a result, the event above happens.
+ *
+ *
+ * For now, all preallocations are discarded.
+ *
+ * Please refer to code and comments about preallocation in
+ * mballoc.c for more information.
+ */
+ if (ext4_snapshot_active(EXT4_SB(ac->ac_inode->i_sb)) &&
+ !ext4_snapshot_mow_in_tid(ac->ac_inode)) {
+ ext4_discard_preallocations(ac->ac_inode);
+ }
/* only data can be preallocated */
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return 0;
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 504dfd5..71edd71 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -228,6 +228,37 @@ static inline int ext4_snapshot_get_bitmap_access(handle_t *handle,
}

/*
+ * get_move_access() - move block to snapshot
+ * @handle: JBD handle
+ * @inode: owner of @block
+ * @block: address of @block
+ * @pcount pointer to no. of blocks about to move or approve
+ * @move: if false, only test if blocks need to be moved
+ *
+ * Called from ext4_ind_map_blocks() before overwriting a data block, when the
+ * buffer_move_on_write() flag is set. Specifically, only data blocks of
+ * regular files are moved. Directory blocks are COWed on get_write_access().
+ * Snapshots and excluded files blocks are never moved-on-write.
+ * If @move is true, then down_write(&i_data_sem) is held.
+ *
+ * Return values:
+ * > 0 - @blocks a) were moved for @move = 1;
+ * b) need to be moved for @move = 0.
+ * = 0 - blocks don't need to be moved.
+ * < 0 - error
+ */
+static inline int ext4_snapshot_get_move_access(handle_t *handle,
+ struct inode *inode,
+ ext4_fsblk_t block,
+ int *pcount, int move)
+{
+ if (!EXT4_SNAPSHOTS(inode->i_sb))
+ return 0;
+
+ return ext4_snapshot_move(handle, inode, block, pcount, move);
+}
+
+/*
* get_delete_access() - move blocks to snapshot or approve to free them
* @handle: JBD handle
* @inode: owner of blocks if known (or NULL otherwise)
--
1.7.4.1


2011-06-07 15:09:13

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 07/36] ext4: snapshot hooks - direct I/O

From: Amir Goldstein <[email protected]>

With indirect mapped files, direct I/O write is not allowed to
initialize holes, so stale data won't be exposed.
With snapshots, direct I/O write is not allowed to do move-on-write,
for the exact same reason.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2d7e540..1cb94d2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1373,6 +1373,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
return ret;
}

+ if (retval > 0 && (map->m_flags & EXT4_MAP_REMAP) &&
+ (flags & EXT4_GET_BLOCKS_PRE_IO)) {
+ /*
+ * If mow is needed on the requested block and
+ * request comes from async-direct-io-write path,
+ * we return an unmapped buffer to fall back to buffered I/O.
+ */
+ map->m_flags &= ~EXT4_MAP_MAPPED;
+ return 0;
+ }
/* If it is only a block(s) look up */
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
return retval;
@@ -3654,6 +3664,29 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
}

/*
+ * ext4_get_block_dio used when preparing for a DIO write
+ * to indirect mapped files with snapshots.
+ */
+int ext4_get_block_dio_write(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ int flags = EXT4_GET_BLOCKS_CREATE;
+
+ /*
+ * DIO_SKIP_HOLES may ask to map direct I/O write with create=0,
+ * but we know this is a write, so we need to check if block
+ * needs to be moved to snapshot and fall back to buffered I/O.
+ * ext4_map_blocks() will return an unmapped buffer if block
+ * is not allocated or if it needs to be moved to snapshot.
+ */
+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE|
+ EXT4_GET_BLOCKS_PRE_IO;
+
+ return _ext4_get_block(inode, iblock, bh, flags);
+}
+
+/*
* O_DIRECT for ext3 (or indirect map) based files
*
* If the O_DIRECT write will extend the file then add this inode to the
@@ -3708,6 +3741,16 @@ retry:
ret = blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
+ /*
+ * snapshots code gets here for DIO write
+ * to ind mapped files or outside i_size
+ * of extent mapped files and for DIO read
+ * to all files.
+ * XXX: isn't it possible to expose stale data
+ * on DIO read to newly allocated ind map
+ * blocks or newly MOWed blocks?
+ */
+ (rw == WRITE) ? ext4_get_block_dio_write :
ext4_get_block, NULL);

if (unlikely((rw & WRITE) && ret < 0)) {
@@ -3769,10 +3812,13 @@ out:
static int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
+ int flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
ext4_debug("ext4_get_block_write: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE;
+ return _ext4_get_block(inode, iblock, bh_result, flags);
}

static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
--
1.7.4.1


2011-06-07 15:09:16

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 08/36] ext4: snapshot hooks - move extent file data blocks

From: Amir Goldstein <[email protected]>

Extent mapped file data is moved into snapshot in ext4_ext_map_blocks().
If a part of a extent is to be moved, the extent is splitted. Fragmentation
is light because of delayed-move-on-write.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.h | 2 -
fs/ext4/extents.c | 151 +++++++++++++++++++++++++++++++++++++++++++++------
fs/ext4/inode.c | 3 +-
3 files changed, 136 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 46dc1ce..1dfd439 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -369,8 +369,6 @@ static inline int ext4_snapshot_should_move_data(struct inode *inode)
return 0;
if (EXT4_JOURNAL(inode) == NULL)
return 0;
- if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
- return 0;
/* when a data block is journaled, it is already COWed as metadata */
if (ext4_should_journal_data(inode))
return 0;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6f0a711..234a043 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1260,11 +1260,10 @@ static int ext4_ext_search_left(struct inode *inode,
return 0;
}

- if (unlikely(*logical < (le32_to_cpu(ex->ee_block) + ee_len))) {
- EXT4_ERROR_INODE(inode,
- "logical %d < ee_block %d + ee_len %d!",
- *logical, le32_to_cpu(ex->ee_block), ee_len);
- return -EIO;
+ if (*logical < (le32_to_cpu(ex->ee_block) + ee_len)) {
+ *logical -= 1;
+ *phys = ext4_ext_pblock(ex) + *logical;
+ return 0;
}

*logical = le32_to_cpu(ex->ee_block) + ee_len - 1;
@@ -1328,11 +1327,10 @@ static int ext4_ext_search_right(struct inode *inode,
return 0;
}

- if (unlikely(*logical < (le32_to_cpu(ex->ee_block) + ee_len))) {
- EXT4_ERROR_INODE(inode,
- "logical %d < ee_block %d + ee_len %d!",
- *logical, le32_to_cpu(ex->ee_block), ee_len);
- return -EIO;
+ if (*logical < (le32_to_cpu(ex->ee_block) + ee_len)) {
+ *logical += 1;
+ *phys = ext4_ext_pblock(ex) + *logical;
+ return 0;
}

if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
@@ -3139,6 +3137,63 @@ out2:
}

/*
+ * Move oldblocks to snapshot and newblocks to the file.
+ */
+static int ext4_ext_move_to_snapshot(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map,
+ struct ext4_ext_path *path,
+ ext4_fsblk_t oldblock,
+ ext4_fsblk_t newblock)
+{
+ struct ext4_extent *ex;
+ int err, depth, len;
+
+ len = map->m_len;
+ err = ext4_snapshot_get_move_access(handle, inode,
+ oldblock, &map->m_len, 1);
+ if (err <= 0 || map->m_len != len) {
+ /* failed to move to snapshot - abort! */
+ err = err ? : -EIO;
+ ext4_journal_abort_handle(__func__, __LINE__,
+ "ext4_snapshot_get_move_access", NULL,
+ handle, err);
+ } else {
+ /*
+ * Move to snapshot successfully.
+ */
+ err = ext4_split_extent(handle, inode, path, map, 0,
+ EXT4_GET_BLOCKS_PRE_IO);
+ if (err < 0)
+ goto out;
+
+ /* extent tree may be changed. */
+ depth = ext_depth(inode);
+ ext4_ext_drop_refs(path);
+ path = ext4_ext_find_extent(inode, map->m_lblk, path);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ goto out;
+ }
+
+ /* just verify splitting. */
+ ex = path[depth].p_ext;
+ BUG_ON(le32_to_cpu(ex->ee_block) != map->m_lblk ||
+ ext4_ext_get_actual_len(ex) != map->m_len);
+
+ err = ext4_ext_get_access(handle, inode, path + depth);
+ if (!err) {
+ /* splice new blocks to the inode*/
+ ext4_ext_store_pblock(ex, newblock);
+ ext4_ext_try_to_merge(inode, path, ex);
+ err = ext4_ext_dirty(handle, inode,
+ path + depth);
+ }
+ }
+
+out:
+ return err;
+}
+/*
* Block allocation/map/preallocation routine for extents based files
*
*
@@ -3160,7 +3215,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
struct ext4_ext_path *path = NULL;
- struct ext4_extent newex, *ex;
+ struct ext4_extent newex, *ex = NULL;
+ ext4_fsblk_t oldblock = 0;
ext4_fsblk_t newblock = 0;
int err = 0, depth, ret;
unsigned int allocated = 0;
@@ -3190,7 +3246,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* number of remaining blocks in the extent */
allocated = ext4_ext_get_actual_len(&newex) -
(map->m_lblk - le32_to_cpu(newex.ee_block));
- goto out;
+ goto found;
}
}

@@ -3241,7 +3297,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (!ext4_ext_is_uninitialized(ex)) {
ext4_ext_put_in_cache(inode, ee_block,
ee_len, ee_start);
- goto out;
+ goto found;
}
ret = ext4_ext_handle_uninitialized_extents(handle,
inode, map, path, flags, allocated,
@@ -3262,6 +3318,51 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
goto out2;
}
+
+ /*
+ * two cases:
+ * 1. the request block is found.
+ * a. If EXT4_GET_BLOCKS_CREATE is not set, we will test
+ * if MOW is needed.
+ * b. If EXT4_GET_BLOCKS_CREATE is set. MOW will be done
+ * if MOW is needed.
+ *
+ * 2. the request block is not found, EXT4_GET_BLOCKS_CREATE
+ * must be set and MOW must be not needed.
+ */
+found:
+ if (newblock && (flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE)) {
+ BUG_ON(!ext4_snapshot_should_move_data(inode));
+ /*
+ * Should move 1 block to snapshot?
+ */
+ allocated = min(map->m_len, allocated);
+ err = ext4_snapshot_get_move_access(handle, inode, newblock,
+ &allocated, 0);
+ map->m_len = allocated;
+ if (err > 0) {
+ map->m_flags |= EXT4_MAP_REMAP;
+ err = 0;
+ oldblock = newblock;
+ } else if (err < 0)
+ goto out2;
+ }
+
+ if (!(flags & EXT4_GET_BLOCKS_CREATE))
+ goto out;
+
+ map->m_flags &= ~EXT4_MAP_REMAP;
+ if ((path == NULL) && (flags & EXT4_GET_BLOCKS_CREATE)) {
+ /* find extent for this block */
+ path = ext4_ext_find_extent(inode, map->m_lblk, NULL);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ path = NULL;
+ goto out2;
+ }
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+ }
/*
* Okay, we need to do block allocation.
*/
@@ -3271,7 +3372,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
err = ext4_ext_search_left(inode, path, &ar.lleft, &ar.pleft);
if (err)
goto out2;
- ar.lright = map->m_lblk;
+ ar.lright = map->m_lblk + allocated;
err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright);
if (err)
goto out2;
@@ -3292,7 +3393,11 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* Check if we can really insert (m_lblk)::(m_lblk + m_len) extent */
newex.ee_block = cpu_to_le32(map->m_lblk);
newex.ee_len = cpu_to_le16(map->m_len);
- err = ext4_ext_check_overlap(inode, &newex, path);
+ if (oldblock) {
+ /* Overlap checking is not needed for MOW case. */
+ err = 0;
+ } else
+ err = ext4_ext_check_overlap(inode, &newex, path);
if (err)
allocated = ext4_ext_get_actual_len(&newex);
else
@@ -3343,7 +3448,14 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (err)
goto out2;

- err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
+ if (oldblock) {
+ map->m_len = ar.len;
+ BUG_ON(!(flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE));
+ err = ext4_ext_move_to_snapshot(handle, inode, map, path,
+ oldblock, newblock);
+ } else
+ err = ext4_ext_insert_extent(handle, inode,
+ path, &newex, flags);
if (err) {
/* free data blocks we just allocated */
/* not a good idea to call discard here directly,
@@ -3372,7 +3484,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an uninitialized extent.
*/
- if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
+ if (IS_COWING(handle)) {
+ /*
+ * snapshot does not supprt fdatasync and fsync
+ * and there is no need to cache extent
+ */
+ } else if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
ext4_ext_put_in_cache(inode, map->m_lblk, allocated, newblock);
ext4_update_inode_fsync_trans(handle, inode, 1);
} else
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1cb94d2..1f1ba2b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1360,7 +1360,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
*/
down_read((&EXT4_I(inode)->i_data_sem));
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
- retval = ext4_ext_map_blocks(handle, inode, map, 0);
+ retval = ext4_ext_map_blocks(handle, inode, map,
+ flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
} else {
retval = ext4_ind_map_blocks(handle, inode, map,
flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
--
1.7.4.1


2011-06-07 15:09:19

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 09/36] ext4: snapshot file

From: Amir Goldstein <[email protected]>

Ext4 snapshot implementation as a file inside the file system.
Snapshot files are marked with the snapfile flag and have special
read-only address space ops.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 70 ++++++++++++++-
fs/ext4/ext4_jbd2.h | 2 +
fs/ext4/ialloc.c | 8 ++-
fs/ext4/inode.c | 29 ++++++
fs/ext4/snapshot.h | 106 ++++++++++++++++++++++
fs/ext4/snapshot_ctl.c | 227 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/super.c | 9 ++
7 files changed, 446 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a5bc3ab..7f96ba5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -348,17 +348,23 @@ struct flex_groups {
#define EXT4_EXTENTS_FL 0x00080000 /* Inode uses extents */
#define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */
#define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */
+/* snapshot persistent flags */
+#define EXT4_SNAPFILE_FL 0x01000000 /* snapshot file */
+#define EXT4_SNAPFILE_DELETED_FL 0x04000000 /* snapshot is deleted */
+#define EXT4_SNAPFILE_SHRUNK_FL 0x08000000 /* snapshot was shrunk */
+/* end of snapshot flags */
#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */

-#define EXT4_FL_USER_VISIBLE 0x004BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE 0x004B80FF /* User modifiable flags */
+
+#define EXT4_FL_USER_VISIBLE 0x014BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE 0x014B80FF /* User modifiable flags */

/* Flags that should be inherited by new inodes from their parent. */
#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
EXT4_SYNC_FL | EXT4_IMMUTABLE_FL | EXT4_APPEND_FL |\
EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
- EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
+ EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL | EXT4_SNAPFILE_FL)

/* Flags that are appropriate for regular files (all but dir-specific ones). */
#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
@@ -405,6 +411,9 @@ enum {
EXT4_INODE_EXTENTS = 19, /* Inode uses extents */
EXT4_INODE_EA_INODE = 21, /* Inode used for large EA */
EXT4_INODE_EOFBLOCKS = 22, /* Blocks allocated beyond EOF */
+ EXT4_INODE_SNAPFILE = 24, /* Snapshot file/dir */
+ EXT4_INODE_SNAPFILE_DELETED = 26, /* Snapshot is deleted */
+ EXT4_INODE_SNAPFILE_SHRUNK = 27, /* Snapshot was shrunk */
EXT4_INODE_RESERVED = 31, /* reserved for ext4 lib */
};

@@ -451,6 +460,9 @@ static inline void ext4_check_flag_values(void)
CHECK_FLAG_VALUE(EXTENTS);
CHECK_FLAG_VALUE(EA_INODE);
CHECK_FLAG_VALUE(EOFBLOCKS);
+ CHECK_FLAG_VALUE(SNAPFILE);
+ CHECK_FLAG_VALUE(SNAPFILE_DELETED);
+ CHECK_FLAG_VALUE(SNAPFILE_SHRUNK);
CHECK_FLAG_VALUE(RESERVED);
}

@@ -790,6 +802,14 @@ struct ext4_inode_info {
struct list_head i_orphan; /* unlinked but open inodes */

/*
+ * In-memory snapshot list overrides i_orphan to link snapshot inodes,
+ * but unlike the real orphan list, the next snapshot inode number
+ * is stored in i_next_snapshot_ino and not in i_dtime
+ */
+#define i_snaplist i_orphan
+ __u32 i_next_snapshot_ino;
+
+ /*
* i_disksize keeps track of what the inode size is ON DISK, not
* in memory. During truncate, i_size is set to the new size by
* the VFS prior to calling ext4_truncate(), but the filesystem won't
@@ -1145,6 +1165,8 @@ struct ext4_sb_info {
u32 s_max_batch_time;
u32 s_min_batch_time;
struct block_device *journal_bdev;
+ struct mutex s_snapshot_mutex; /* protects 2 fields below: */
+ struct inode *s_active_snapshot; /* [ s_snapshot_mutex ] */
#ifdef CONFIG_JBD2_DEBUG
struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */
wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */
@@ -1261,6 +1283,24 @@ enum {
EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_DELALLOC_RESERVED, /* blks already reserved for delalloc */
+ EXT4_STATE_LAST
+};
+
+/*
+ * Snapshot dynamic state flags (starting at offset EXT4_STATE_LAST)
+ * These flags are read by GETSNAPFLAGS ioctl and interpreted by the lssnap
+ * utility. Do not change these values.
+ */
+enum {
+ EXT4_SNAPSTATE_LIST = 0, /* snapshot is on list (S) */
+ EXT4_SNAPSTATE_ENABLED = 1, /* snapshot is enabled (n) */
+ EXT4_SNAPSTATE_ACTIVE = 2, /* snapshot is active (a) */
+ EXT4_SNAPSTATE_INUSE = 3, /* snapshot is in-use (p) */
+ EXT4_SNAPSTATE_DELETED = 4, /* snapshot is deleted (s) */
+ EXT4_SNAPSTATE_SHRUNK = 5, /* snapshot was shrunk (h) */
+ EXT4_SNAPSTATE_OPEN = 6, /* snapshot is mounted (o) */
+ EXT4_SNAPSTATE_TAGGED = 7, /* snapshot is tagged (t) */
+ EXT4_SNAPSTATE_LAST
};

#define EXT4_INODE_BIT_FNS(name, field, offset) \
@@ -1277,9 +1317,19 @@ static inline void ext4_clear_inode_##name(struct inode *inode, int bit) \
clear_bit(bit + (offset), &EXT4_I(inode)->i_##field); \
}

+#define EXT4_INODE_FLAGS_FNS(name, field, offset, count) \
+static inline int ext4_get_##name##_flags(struct inode *inode) \
+{ \
+ return (EXT4_I(inode)->i_##field >> (offset)) & \
+ ((1UL << (count)) - 1); \
+} \
+
EXT4_INODE_BIT_FNS(flag, flags, 0)
#if (BITS_PER_LONG < 64)
EXT4_INODE_BIT_FNS(state, state_flags, 0)
+EXT4_INODE_BIT_FNS(snapstate, state_flags, EXT4_STATE_LAST)
+EXT4_INODE_FLAGS_FNS(snapstate, state_flags, EXT4_STATE_LAST, \
+ EXT4_SNAPSTATE_LAST)

static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
{
@@ -1287,6 +1337,9 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
}
#else
EXT4_INODE_BIT_FNS(state, flags, 32)
+EXT4_INODE_BIT_FNS(snapstate, flags, 32 + EXT4_STATE_LAST)
+EXT4_INODE_FLAGS_FNS(snapstate, flags, 32 + EXT4_STATE_LAST, \
+ EXT4_SNAPSTATE_LAST)

static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
{
@@ -1301,6 +1354,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#endif

#define NEXT_ORPHAN(inode) EXT4_I(inode)->i_dtime
+#define NEXT_SNAPSHOT(inode) (EXT4_I(inode)->i_next_snapshot_ino)

/*
* Codes for operating systems
@@ -1783,6 +1837,10 @@ extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
+
+/* snapshot_inode.c */
+extern int ext4_snapshot_readpage(struct file *file, struct page *page);
+
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
@@ -2006,6 +2064,12 @@ struct ext4_group_info {
void *bb_bitmap;
#endif
struct rw_semaphore alloc_sem;
+ /*
+ * bg_cow_bitmap is reset to zero on mount time and on every snapshot
+ * take and initialized lazily on first block group write access.
+ * bg_cow_bitmap is protected by sb_bgl_lock().
+ */
+ unsigned long bg_cow_bitmap; /* COW bitmap cache */
ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
* regions, index is order.
* bb_counters[3] = 5 means
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 1dfd439..4d57fcb 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -369,6 +369,8 @@ static inline int ext4_snapshot_should_move_data(struct inode *inode)
return 0;
if (EXT4_JOURNAL(inode) == NULL)
return 0;
+ if (ext4_snapshot_excluded(inode))
+ return 0;
/* when a data block is journaled, it is already COWed as metadata */
if (ext4_should_journal_data(inode))
return 0;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 40ca5bc..b0e5749 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1049,8 +1049,12 @@ got:
goto fail_free_drop;

if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
- /* set extent flag only for directory, file and normal symlink*/
- if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) {
+ /*
+ * Set extent flag only for non-snapshot file, directory
+ * and normal symlink
+ */
+ if ((S_ISREG(mode) && !ext4_snapshot_file(inode)) ||
+ S_ISDIR(mode) || S_ISLNK(mode)) {
ext4_set_inode_flag(inode, EXT4_INODE_EXTENTS);
ext4_ext_tree_init(handle, inode);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1f1ba2b..0468ef2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4139,9 +4139,38 @@ static const struct address_space_operations ext4_da_aops = {
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
+static int ext4_no_writepage(struct page *page,
+ struct writeback_control *wbc)
+{
+ unlock_page(page);
+ return -EIO;
+}
+
+/*
+ * Snapshot file page operations:
+ * always readpage (by page) with buffer tracked read.
+ * user cannot writepage or direct_IO to a snapshot file.
+ *
+ * snapshot file pages are written to disk after a COW operation in "ordered"
+ * mode and are never changed after that again, so there is no data corruption
+ * risk when using "ordered" mode on snapshot files.
+ * some snapshot data pages are written to disk by sync_dirty_buffer(), namely
+ * the snapshot COW bitmaps and a few initial blocks copied on snapshot_take().
+ */
+static const struct address_space_operations ext4_snapfile_aops = {
+ .readpage = ext4_readpage,
+ .readpages = ext4_readpages,
+ .writepage = ext4_no_writepage,
+ .bmap = ext4_bmap,
+ .invalidatepage = ext4_invalidatepage,
+ .releasepage = ext4_releasepage,
+};

void ext4_set_aops(struct inode *inode)
{
+ if (ext4_snapshot_file(inode))
+ inode->i_mapping->a_ops = &ext4_snapfile_aops;
+ else
if (ext4_should_order_data(inode) &&
test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = &ext4_da_aops;
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 71edd71..19e3416 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -288,6 +288,14 @@ static inline int ext4_snapshot_get_delete_access(handle_t *handle,

/* snapshot_ctl.c */

+/*
+ * Snapshot constructor/destructor
+ */
+extern int ext4_snapshot_load(struct super_block *sb,
+ struct ext4_super_block *es, int read_only);
+extern int ext4_snapshot_update(struct super_block *sb, int cleanup,
+ int read_only);
+extern void ext4_snapshot_destroy(struct super_block *sb);

static inline int init_ext4_snapshot(void)
{
@@ -299,7 +307,105 @@ static inline void exit_ext4_snapshot(void)
}


+/* tests if @inode is a snapshot file */
+static inline int ext4_snapshot_file(struct inode *inode)
+{
+ if (!S_ISREG(inode->i_mode))
+ /* a snapshots directory */
+ return 0;
+ return ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE);
+}
+
+/* tests if @inode is on the on-disk snapshot list */
+static inline int ext4_snapshot_list(struct inode *inode)
+{
+ return ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_LIST);
+}
+
+/*
+ * ext4_snapshot_excluded():
+ * Checks if the file should be excluded from snapshot.
+ *
+ * Returns 0 for normal file.
+ * Returns > 0 for 'excluded' file.
+ * Returns < 0 for 'ignored' file (stonger than 'excluded').
+ *
+ * Excluded and ignored file blocks are not moved to snapshot.
+ * Ignored file metadata blocks are not COWed to snapshot.
+ * Excluded file metadata blocks are zeroed in the snapshot file.
+ * XXX: Excluded files code is experimental,
+ * but ignored files code isn't.
+ */
+static inline int ext4_snapshot_excluded(struct inode *inode)
+{
+ /* directory blocks and global filesystem blocks cannot be 'excluded' */
+ if (!inode || !S_ISREG(inode->i_mode))
+ return 0;
+ /* snapshot files are 'ignored' */
+ if (ext4_snapshot_file(inode))
+ return -1;
+ return 0;
+}
+
+/* tests if the file system has an active snapshot */
+static inline int ext4_snapshot_active(struct ext4_sb_info *sbi)
+{
+ if (unlikely((sbi)->s_active_snapshot))
+ return 1;
+ return 0;
+}

+/*
+ * tests if the file system has an active snapshot and returns its inode.
+ * active snapshot is only changed under journal_lock_updates(),
+ * so it is safe to use the returned inode during a transaction.
+ */
+static inline struct inode *ext4_snapshot_has_active(struct super_block *sb)
+{
+ return EXT4_SB(sb)->s_active_snapshot;
+}
+
+/*
+ * tests if @inode is the current active snapshot.
+ * active snapshot is only changed under journal_lock_updates(),
+ * so the test result never changes during a transaction.
+ */
+static inline int ext4_snapshot_is_active(struct inode *inode)
+{
+ return (inode == EXT4_SB(inode->i_sb)->s_active_snapshot);
+}
+
+
+#define SNAPSHOT_TRANSACTION_ID(sb) \
+ ((EXT4_I(EXT4_SB(sb)->s_active_snapshot))->i_datasync_tid)
+
+/**
+ * set transaction ID for active snapshot
+ *
+ * this function is called after freeze_super() returns but before
+ * calling unfreeze_super() to record the tid at time when a snapshot is
+ * taken.
+ */
+static inline void ext4_snapshot_set_tid(struct super_block *sb)
+{
+ BUG_ON(!ext4_snapshot_active(EXT4_SB(sb)));
+ SNAPSHOT_TRANSACTION_ID(sb) =
+ EXT4_SB(sb)->s_journal->j_transaction_sequence;
+}
+
+/* get trancation ID of active snapshot */
+static inline tid_t ext4_snapshot_get_tid(struct super_block *sb)
+{
+ BUG_ON(!ext4_snapshot_active(EXT4_SB(sb)));
+ return SNAPSHOT_TRANSACTION_ID(sb);
+}
+
+/* test if thereis a mow that is in or before current transcation */
+static inline int ext4_snapshot_mow_in_tid(struct inode *inode)
+{
+ return tid_geq(EXT4_I(inode)->i_datasync_tid,
+ ext4_snapshot_get_tid(inode->i_sb));
+}


#else /* CONFIG_EXT4_FS_SNAPSHOT */
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 201ef20..1abda77 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -15,8 +15,235 @@
#include <linux/statfs.h>
#include "ext4_jbd2.h"
#include "snapshot.h"
+
+/*
+ * General snapshot locking semantics:
+ *
+ * The snapshot_mutex:
+ * -------------------
+ * The majority of the code in the snapshot_{ctl,debug}.c files is called from
+ * very few entry points in the code:
+ * 1. {init,exit}_ext4_fs() - calls {init,exit}_ext4_snapshot() under BGL.
+ * 2. ext4_{fill,put}_super() - calls ext4_snapshot_{load,destroy}() under
+ * VFS sb_lock, while f/s is not accessible to users.
+ * 3. ext4_ioctl() - only place that takes snapshot_mutex (after i_mutex)
+ * and only entry point to snapshot control functions below.
+ *
+ * From the rules above it follows that all fields accessed inside
+ * snapshot_{ctl,debug}.c are protected by one of the following:
+ * - snapshot_mutex during snapshot control operations.
+ * - VFS sb_lock during f/s mount/umount time.
+ * - Big kernel lock during module init time.
+ * Needless to say, either of the above is sufficient.
+ * So if a field is accessed only inside snapshot_*.c it should be safe.
+ *
+ * The transaction handle:
+ * -----------------------
+ * Snapshot COW code (in snapshot.c) is called from block access hooks during a
+ * transaction (with a transaction handle). This guaranties safe read access to
+ * s_active_snapshot, without taking snapshot_mutex, because the latter is only
+ * changed under journal_lock_updates() (while no transaction handles exist).
+ *
+ * The transaction handle is a per task struct, so there is no need to protect
+ * fields on that struct (i.e. h_cowing, h_cow_*).
+ */
+
+/*
+ * ext4_snapshot_set_active - set the current active snapshot
+ * First, if current active snapshot exists, it is deactivated.
+ * Then, if @inode is not NULL, the active snapshot is set to @inode.
+ *
+ * Called from ext4_snapshot_take() and ext4_snapshot_update() under
+ * journal_lock_updates() and snapshot_mutex.
+ * Called from ext4_snapshot_{load,destroy}() under sb_lock.
+ *
+ * Returns 0 on success and <0 on error.
+ */
+static int ext4_snapshot_set_active(struct super_block *sb,
+ struct inode *inode)
+{
+ struct inode *old = EXT4_SB(sb)->s_active_snapshot;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ if (old == inode)
+ return 0;
+
+ /* add new active snapshot reference */
+ if (inode && !igrab(inode))
+ return -EIO;
+
+ /* point of no return - replace old with new snapshot */
+ if (old) {
+ ext4_clear_inode_snapstate(old, EXT4_SNAPSTATE_ACTIVE);
+ snapshot_debug(1, "snapshot (%u) deactivated\n",
+ old->i_generation);
+ /* remove old active snapshot reference */
+ iput(old);
+ }
+ if (inode) {
+ /*
+ * Set up the jbd2_inode - we are about to file_inode soon...
+ */
+ if (!ei->jinode) {
+ struct jbd2_inode *jinode;
+ jinode = jbd2_alloc_inode(GFP_KERNEL);
+
+ spin_lock(&inode->i_lock);
+ if (!ei->jinode) {
+ if (!jinode) {
+ spin_unlock(&inode->i_lock);
+ return -ENOMEM;
+ }
+ ei->jinode = jinode;
+ jbd2_journal_init_jbd_inode(ei->jinode, inode);
+ jinode = NULL;
+ }
+ spin_unlock(&inode->i_lock);
+ if (unlikely(jinode != NULL))
+ jbd2_free_inode(jinode);
+ }
+ /* ACTIVE implies LIST */
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_LIST);
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_ACTIVE);
+ snapshot_debug(1, "snapshot (%u) activated\n",
+ inode->i_generation);
+ }
+ EXT4_SB(sb)->s_active_snapshot = inode;
+
+ return 0;
+}
#define ext4_snapshot_reset_bitmap_cache(sb, init) 0

/*
* Snapshot constructor/destructor
*/
+/*
+ * ext4_snapshot_load - load the on-disk snapshot list to memory.
+ * Start with last (or active) snapshot and continue to older snapshots.
+ * If snapshot load fails before active snapshot, force read-only mount.
+ * If snapshot load fails after active snapshot, allow read-write mount.
+ * Called from ext4_fill_super() under sb_lock during mount time.
+ *
+ * Return values:
+ * = 0 - on-disk snapshot list is empty or active snapshot loaded
+ * < 0 - error loading active snapshot
+ */
+int ext4_snapshot_load(struct super_block *sb, struct ext4_super_block *es,
+ int read_only)
+{
+ __u32 active_ino = le32_to_cpu(es->s_snapshot_inum);
+ __u32 load_ino = le32_to_cpu(es->s_snapshot_list);
+ int err = 0, num = 0, snapshot_id = 0;
+ int has_active = 0;
+
+
+ if (!load_ino && active_ino) {
+ /* snapshots list is empty and active snapshot exists */
+ if (!read_only)
+ /* reset list head to active snapshot */
+ es->s_snapshot_list = es->s_snapshot_inum;
+ /* try to load active snapshot */
+ load_ino = le32_to_cpu(es->s_snapshot_inum);
+ }
+
+ while (load_ino) {
+ struct inode *inode;
+
+ inode = ext4_orphan_get(sb, load_ino);
+ if (IS_ERR(inode)) {
+ err = PTR_ERR(inode);
+ } else if (!ext4_snapshot_file(inode)) {
+ iput(inode);
+ err = -EIO;
+ }
+
+ if (err && num == 0 && load_ino != active_ino) {
+ /* failed to load last non-active snapshot */
+ if (!read_only)
+ /* reset list head to active snapshot */
+ es->s_snapshot_list = es->s_snapshot_inum;
+ snapshot_debug(1, "warning: failed to load "
+ "last snapshot (%u) - trying to load "
+ "active snapshot (%u).\n",
+ load_ino, active_ino);
+ /* try to load active snapshot */
+ load_ino = active_ino;
+ err = 0;
+ continue;
+ }
+
+ if (err)
+ break;
+
+ snapshot_id = inode->i_generation;
+ snapshot_debug(1, "snapshot (%d) loaded\n",
+ snapshot_id);
+ num++;
+
+ if (!has_active && load_ino == active_ino) {
+ /* active snapshot was loaded */
+ err = ext4_snapshot_set_active(sb, inode);
+ if (err)
+ break;
+ has_active = 1;
+ }
+
+ iput(inode);
+ break;
+ }
+
+ if (err) {
+ /* failed to load active snapshot */
+ snapshot_debug(1, "warning: failed to load "
+ "snapshot (ino=%u) - "
+ "forcing read-only mount!\n",
+ load_ino);
+ /* force read-only mount */
+ return read_only ? 0 : err;
+ }
+
+ if (num > 0) {
+ err = ext4_snapshot_update(sb, 0, read_only);
+ snapshot_debug(1, "%d snapshots loaded\n", num);
+ }
+ return err;
+}
+
+/*
+ * ext4_snapshot_destroy() releases the in-memory snapshot list
+ * Called from ext4_put_super() under sb_lock during umount time.
+ * This function cannot fail.
+ */
+void ext4_snapshot_destroy(struct super_block *sb)
+{
+ /* deactivate in-memory active snapshot - cannot fail */
+ (void) ext4_snapshot_set_active(sb, NULL);
+}
+
+/*
+ * ext4_snapshot_update - iterate snapshot list and update snapshots status.
+ * @sb: handle to file system super block.
+ * @cleanup: if true, shrink/merge/cleanup all snapshots marked for deletion.
+ * @read_only: if true, don't remove snapshot after failed take.
+ *
+ * Called from ext4_ioctl() under snapshot_mutex.
+ * Called from snapshot_load() under sb_lock with @cleanup=0.
+ * Returns 0 on success and <0 on error.
+ */
+int ext4_snapshot_update(struct super_block *sb, int cleanup, int read_only)
+{
+ struct inode *active_snapshot = ext4_snapshot_has_active(sb);
+ int err = 0;
+
+ BUG_ON(read_only && cleanup);
+ if (active_snapshot) {
+ /* ACTIVE implies LIST */
+ ext4_set_inode_snapstate(active_snapshot,
+ EXT4_SNAPSTATE_LIST);
+ ext4_set_inode_snapstate(active_snapshot,
+ EXT4_SNAPSTATE_ACTIVE);
+ }
+
+
+ return err;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 61e9173..7655010 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -761,6 +761,8 @@ static void ext4_put_super(struct super_block *sb)
destroy_workqueue(sbi->dio_unwritten_wq);

lock_super(sb);
+ if (EXT4_SNAPSHOTS(sb))
+ ext4_snapshot_destroy(sb);
if (sb->s_dirt)
ext4_commit_super(sb, 1);

@@ -3521,6 +3523,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

sb->s_root = NULL;

+ mutex_init(&sbi->s_snapshot_mutex);
+ sbi->s_active_snapshot = NULL;
+
needs_recovery = (es->s_last_orphan != 0 ||
EXT4_HAS_INCOMPAT_FEATURE(sb,
EXT4_FEATURE_INCOMPAT_RECOVER));
@@ -3727,6 +3732,10 @@ no_journal:
goto failed_mount4;
};

+ if (EXT4_SNAPSHOTS(sb) &&
+ ext4_snapshot_load(sb, es, sb->s_flags & MS_RDONLY))
+ /* XXX: how can we fail and force read-only at this point? */
+ ext4_error(sb, "load snapshot failed\n");
EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
ext4_orphan_cleanup(sb, es);
EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
--
1.7.4.1


2011-06-07 15:09:25

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 11/36] ext4: snapshot file - permissions

From: Amir Goldstein <[email protected]>

Enforce snapshot file permissions.
Write, truncate and unlink of snapshot inodes is not allowed.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/file.c | 7 +++++++
fs/ext4/inode.c | 7 +++++++
fs/ext4/namei.c | 8 ++++++++
3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 60b3b19..f31e58e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -168,6 +168,13 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
struct path path;
char buf[64], *cp;

+ if (ext4_snapshot_file(inode) &&
+ (filp->f_flags & O_ACCMODE) != O_RDONLY)
+ /*
+ * allow only read-only access to snapshot files
+ */
+ return -EPERM;
+
if (unlikely(!(sbi->s_mount_flags & EXT4_MF_MNTDIR_SAMPLED) &&
!(sb->s_flags & MS_RDONLY))) {
sbi->s_mount_flags |= EXT4_MF_MNTDIR_SAMPLED;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f44f7d3..b210b33 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4727,6 +4727,13 @@ void ext4_truncate(struct inode *inode)

trace_ext4_truncate_enter(inode);

+ /* prevent truncate of files on snapshot list */
+ if (ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "snapshot (%u) cannot be truncated!\n",
+ inode->i_generation);
+ return;
+ }
+
if (!ext4_can_truncate(inode))
return;

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 93196b6..41df36f 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2225,6 +2225,14 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
inode->i_ino, inode->i_nlink);
inode->i_nlink = 1;
}
+ /* prevent unlink of files on snapshot list */
+ if (inode->i_nlink == 1 &&
+ ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "snapshot (%u) cannot be unlinked!\n",
+ inode->i_generation);
+ retval = -EPERM;
+ goto end_unlink;
+ }
retval = ext4_delete_entry(handle, dir, de, bh);
if (retval)
goto end_unlink;
--
1.7.4.1


2011-06-07 15:09:27

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 12/36] ext4: snapshot file - store on disk

From: Amir Goldstein <[email protected]>

Snapshot inode is stored differently in memory and on disk.
During store and load of snapshot inode, some of the inode flags
and fields are converted.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 33 +++++++++++++++++++++++++++------
1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b210b33..33692fd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5183,6 +5183,17 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
*/
for (block = 0; block < EXT4_N_BLOCKS; block++)
ei->i_data[block] = raw_inode->i_block[block];
+ /* snapshot on-disk list is stored in snapshot inode on-disk version */
+ if (ext4_snapshot_file(inode)) {
+ ei->i_next_snapshot_ino =
+ le32_to_cpu(raw_inode->i_disk_version);
+ /*
+ * snapshot volume size is stored in i_disksize.
+ * in-memory i_size of snapshot files is set to 0 (disabled).
+ * enabling a snapshot is setting i_size to i_disksize.
+ */
+ inode->i_size = 0;
+ }
INIT_LIST_HEAD(&ei->i_orphan);

/*
@@ -5447,12 +5458,22 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block < EXT4_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];

- raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
- if (ei->i_extra_isize) {
- if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
- raw_inode->i_version_hi =
- cpu_to_le32(inode->i_version >> 32);
- raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
+ if (ext4_snapshot_file(inode)) {
+ /*
+ * Snapshot on-disk list overrides snapshot on-disk version.
+ * Snapshot files are not writable and have a fixed version.
+ */
+ raw_inode->i_disk_version =
+ cpu_to_le32(ei->i_next_snapshot_ino);
+ } else {
+ raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+ if (ei->i_extra_isize) {
+ if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
+ raw_inode->i_version_hi =
+ cpu_to_le32(inode->i_version >> 32);
+ raw_inode->i_extra_isize =
+ cpu_to_le16(ei->i_extra_isize);
+ }
}

BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
--
1.7.4.1


2011-06-07 15:09:22

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 10/36] ext4: snapshot file - read through to block device

From: Amir Goldstein <[email protected]>

On active snapshot page read, the function ext4_snapshot_get_block()
is called to map the page to a disk block. If the page is not mapped
in the snapshot file a direct mapping to the block device is returned.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 3 +-
fs/ext4/snapshot_inode.c | 244 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 245 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0468ef2..f44f7d3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4158,8 +4158,7 @@ static int ext4_no_writepage(struct page *page,
* the snapshot COW bitmaps and a few initial blocks copied on snapshot_take().
*/
static const struct address_space_operations ext4_snapfile_aops = {
- .readpage = ext4_readpage,
- .readpages = ext4_readpages,
+ .readpage = ext4_snapshot_readpage,
.writepage = ext4_no_writepage,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
index 2de017a..74b455d 100644
--- a/fs/ext4/snapshot_inode.c
+++ b/fs/ext4/snapshot_inode.c
@@ -38,5 +38,249 @@

#include <trace/events/ext4.h>
#include "snapshot.h"
+/*
+ * ext4_snapshot_get_block_access() - called from ext4_snapshot_read_through()
+ * on snapshot file access.
+ * return value <0 indicates access not granted
+ * return value 0 indicates snapshot inode read through access
+ * in which case 'prev_snapshot' is pointed to the previous snapshot
+ * on the list or set to NULL to indicate read through to block device.
+ */
+static int ext4_snapshot_get_block_access(struct inode *inode,
+ struct inode **prev_snapshot)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ unsigned long flags = ext4_get_snapstate_flags(inode);
+
+
+ *prev_snapshot = NULL;
+ if (ext4_snapshot_is_active(inode) ||
+ (flags & 1UL<<EXT4_SNAPSTATE_ACTIVE))
+ /* read through from active snapshot to block device */
+ return 0;
+
+ return -EPERM;
+}
+
+#ifdef CONFIG_EXT4_DEBUG
+/*
+ * ext4_snapshot_get_blockdev_access - get read through access to block device.
+ * Sanity test to verify that the read block is allocated and not excluded.
+ * This test has performance penalty and is only called if SNAPTEST_READ
+ * is enabled. An attempt to read through to block device of a non allocated
+ * or excluded block may indicate a corrupted filesystem, corrupted snapshot
+ * or corrupted exclude bitmap. However, it may also be a read-ahead, which
+ * was not implicitly requested by the user, so be sure to disable read-ahead
+ * on block device (blockdev --setra 0 <bdev>) before enabling SNAPTEST_READ.
+ *
+ * Return values:
+ * = 0 - block is allocated and not excluded
+ * < 0 - error (or block is not allocated or excluded)
+ */
+static int ext4_snapshot_get_blockdev_access(struct super_block *sb,
+ struct buffer_head *bh)
+{
+ unsigned long block_group = SNAPSHOT_BLOCK_GROUP(bh->b_blocknr);
+ ext4_grpblk_t bit = SNAPSHOT_BLOCK_GROUP_OFFSET(bh->b_blocknr);
+ struct buffer_head *bitmap_bh;
+ int err = 0;
+
+ if (PageReadahead(bh->b_page))
+ return 0;
+
+ bitmap_bh = ext4_read_block_bitmap(sb, block_group);
+ if (!bitmap_bh)
+ return -EIO;
+
+ if (!ext4_test_bit(bit, bitmap_bh->b_data)) {
+ snapshot_debug(2, "warning: attempt to read through to "
+ "non-allocated block [%d/%lu] - read ahead?\n",
+ bit, block_group);
+ brelse(bitmap_bh);
+ return -EIO;
+ }
+
+ brelse(bitmap_bh);
+ return err;
+}
+#endif
+
+/*
+ * ext4_snapshot_read_through - get snapshot image block.
+ * On read of snapshot file, an unmapped block is a peephole to prev snapshot.
+ * On read of active snapshot, an unmapped block is a peephole to the block
+ * device. On first block write, the peephole is filled forever.
+ */
+static int ext4_snapshot_read_through(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result)
+{
+ int err;
+ struct ext4_map_blocks map;
+ struct inode *prev_snapshot;
+
+ map.m_lblk = iblock;
+ map.m_pblk = 0;
+ map.m_len = bh_result->b_size >> inode->i_blkbits;
+
+ prev_snapshot = NULL;
+ /* request snapshot file read access */
+ err = ext4_snapshot_get_block_access(inode, &prev_snapshot);
+ if (err < 0)
+ return err;
+ err = ext4_map_blocks(NULL, inode, &map, 0);
+ snapshot_debug(4, "ext4_snapshot_read_through(%lld): block = "
+ "(%lld), err = %d\n prev_snapshot = %u",
+ (long long)iblock, map.m_pblk, err,
+ prev_snapshot ? prev_snapshot->i_generation : 0);
+ if (err < 0)
+ return err;
+ if (!err)
+ /* hole in active snapshot - read though to block device */
+ return 0;
+
+ map_bh(bh_result, inode->i_sb, map.m_pblk);
+ bh_result->b_state = (bh_result->b_state & ~EXT4_MAP_FLAGS) |
+ map.m_flags;
+
+ return 0;
+}
+
+/*
+ * Check if @block is a bitmap block of @group.
+ * if @block is found to be a block/inode/exclude bitmap block, the return
+ * value is one of the non-zero values below:
+ */
+#define BLOCK_BITMAP 1
+#define INODE_BITMAP 2
+#define EXCLUDE_BITMAP 3
+
+static int ext4_snapshot_is_group_bitmap(struct super_block *sb,
+ ext4_fsblk_t block, ext4_group_t group)
+{
+ struct ext4_group_desc *gdp;
+ struct ext4_group_info *grp;
+ ext4_fsblk_t bitmap_blk;
+
+ gdp = ext4_get_group_desc(sb, group, NULL);
+ grp = ext4_get_group_info(sb, group);
+ if (!gdp || !grp)
+ return 0;
+
+ bitmap_blk = ext4_block_bitmap(sb, gdp);
+ if (bitmap_blk == block)
+ return BLOCK_BITMAP;
+ bitmap_blk = ext4_inode_bitmap(sb, gdp);
+ if (bitmap_blk == block)
+ return INODE_BITMAP;
+ bitmap_blk = ext4_exclude_bitmap(sb, gdp);
+ if (bitmap_blk == block)
+ return EXCLUDE_BITMAP;
+ return 0;
+}
+
+/*
+ * Check if @block is a bitmap block and of any block group.
+ * if @block is found to be a bitmap block, @bitmap_group is set to the
+ * block group described by the bitmap block.
+ */
+static int ext4_snapshot_is_bitmap(struct super_block *sb,
+ ext4_fsblk_t block, ext4_group_t *bitmap_group)
+{
+ ext4_group_t group = SNAPSHOT_BLOCK_GROUP(block);
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
+ int flex_groups = ext4_flex_bg_size(EXT4_SB(sb));
+ int i, is_bitmap = 0;
+
+ /*
+ * When block is in the first block group of a flex group, we need to
+ * check all group desc of the flex group.
+ * The exclude bitmap can potentially be allocated from any group, if
+ * exclude inode was added not on mkfs. The worst case if we fail to
+ * identify a block as an exclude bitmap is that fsck sanity check of
+ * snapshots will fail, becasue exclude bitmap inside snapshot will
+ * not be all zeros.
+ */
+ if (!EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG) ||
+ (group % flex_groups) != 0)
+ flex_groups = 1;
+
+ for (i = 0; i < flex_groups && group < ngroups; i++, group++) {
+ is_bitmap = ext4_snapshot_is_group_bitmap(sb, block, group);
+ if (is_bitmap)
+ break;
+ }
+ *bitmap_group = group;
+ return is_bitmap;
+}
+
+static int ext4_snapshot_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ ext4_group_t block_group;
+ int err, is_bitmap;
+
+ BUG_ON(create != 0);
+ BUG_ON(buffer_tracked_read(bh_result));
+
+ err = ext4_snapshot_read_through(inode, SNAPSHOT_IBLOCK(iblock),
+ bh_result);
+ snapshot_debug(4, "ext4_snapshot_get_block(%lld): block = (%lld), "
+ "err = %d\n",
+ (long long)iblock, buffer_mapped(bh_result) ?
+ (long long)bh_result->b_blocknr : 0, err);
+ if (err < 0)
+ return err;
+
+ if (!buffer_tracked_read(bh_result))
+ return 0;
+
+ /* Check for read through to block or exclude bitmap block */
+ is_bitmap = ext4_snapshot_is_bitmap(inode->i_sb, bh_result->b_blocknr,
+ &block_group);
+ if (is_bitmap == BLOCK_BITMAP) {
+ /* copy fixed block bitmap directly to page buffer */
+ cancel_buffer_tracked_read(bh_result);
+ /* cancel_buffer_tracked_read() clears mapped flag */
+ set_buffer_mapped(bh_result);
+ snapshot_debug(2, "fixing snapshot block bitmap #%u\n",
+ block_group);
+ /*
+ * XXX: if we return unmapped buffer, the page will be zeroed
+ * but if we return mapped to block device and uptodate buffer
+ * next readpage may read directly from block device without
+ * fixing block bitmap. This only affects fsck of snapshots.
+ */
+ return ext4_snapshot_read_block_bitmap(inode->i_sb,
+ block_group, bh_result);
+ } else if (is_bitmap == EXCLUDE_BITMAP) {
+ /* return unmapped buffer to zero out page */
+ cancel_buffer_tracked_read(bh_result);
+ /* cancel_buffer_tracked_read() clears mapped flag */
+ snapshot_debug(2, "zeroing snapshot exclude bitmap #%u\n",
+ block_group);
+ return 0;
+ }
+
#ifdef CONFIG_EXT4_DEBUG
+ snapshot_debug(3, "started tracked read: block = [%llu/%llu]\n",
+ SNAPSHOT_BLOCK_TUPLE(bh_result->b_blocknr));
+ if (snapshot_enable_test[SNAPTEST_READ]) {
+ err = ext4_snapshot_get_blockdev_access(inode->i_sb,
+ bh_result);
+ if (err) {
+ /* read through access denied */
+ cancel_buffer_tracked_read(bh_result);
+ return err;
+ }
+ /* sleep 1 tunable delay unit */
+ snapshot_test_delay(SNAPTEST_READ);
+ }
#endif
+ return 0;
+}
+
+int ext4_snapshot_readpage(struct file *file, struct page *page)
+{
+ /* do read I/O with buffer heads to enable tracked reads */
+ return ext4_read_full_page(page, ext4_snapshot_get_block);
+}
--
1.7.4.1


2011-06-07 15:09:31

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 13/36] ext4: snapshot file - increase maximum file size limit to 16TB

From: Amir Goldstein <[email protected]>

Files larger than 2TB use Ext4 huge_file flag to store i_blocks
in file system blocks units, so the upper limit on snapshot actual
size is increased from 512*2^32 = 2TB to 4K*2^32 = 16TB,
which is also the upper limit on file system size.
To map 2^32 logical blocks, 4 triple indirect blocks are used instead
of just one. The extra 3 triple indirect blocks are stored in-place
of direct blocks, which are not in use by snapshot files.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 13 +++++++++++++
fs/ext4/file.c | 3 ++-
fs/ext4/inode.c | 43 +++++++++++++++++++++++++++++++++++++++++--
fs/ext4/super.c | 3 +++
4 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7f96ba5..81e6add 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -320,6 +320,19 @@ struct flex_groups {
#define EXT4_DIND_BLOCK (EXT4_IND_BLOCK + 1)
#define EXT4_TIND_BLOCK (EXT4_DIND_BLOCK + 1)
#define EXT4_N_BLOCKS (EXT4_TIND_BLOCK + 1)
+/*
+ * Snapshot files have different indirection mapping that can map up to 2^32
+ * logical blocks, so they can cover the mapped filesystem block address space.
+ * Ext4 must use either 4K or 8K blocks (depending on PAGE_SIZE).
+ * With 8K blocks, 1 triple indirect block maps 2^33 logical blocks.
+ * With 4K blocks (the system default), each triple indirect block maps 2^30
+ * logical blocks, so 4 triple indirect blocks map 2^32 logical blocks.
+ * Snapshot files in small filesystems (<= 4G), use only 1 double indirect
+ * block to map the entire filesystem.
+ */
+#define EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS 3
+#define EXT4_SNAPSHOT_N_BLOCKS (EXT4_TIND_BLOCK + 1 + \
+ EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS)

/*
* Inode flags
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f31e58e..0ebd3e7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,7 +228,8 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int origin)
struct inode *inode = file->f_mapping->host;
loff_t maxbytes;

- if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+ if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ !ext4_snapshot_file(inode))
maxbytes = EXT4_SB(inode->i_sb)->s_bitmap_maxbytes;
else
maxbytes = inode->i_sb->s_maxbytes;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 33692fd..e64cf64 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -335,6 +335,7 @@ static int ext4_block_to_path(struct inode *inode,
double_blocks = (1 << (ptrs_bits * 2));
int n = 0;
int final = 0;
+ int tind;

if (i_block < direct_blocks) {
offsets[n++] = i_block;
@@ -354,6 +355,18 @@ static int ext4_block_to_path(struct inode *inode,
offsets[n++] = (i_block >> ptrs_bits) & (ptrs - 1);
offsets[n++] = i_block & (ptrs - 1);
final = ptrs;
+ } else if (ext4_snapshot_file(inode) &&
+ (i_block >> (ptrs_bits * 3)) <
+ EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS + 1) {
+ tind = i_block >> (ptrs_bits * 3);
+ BUG_ON(tind == 0);
+ /* use up to 4 triple indirect blocks to map 2^32 blocks */
+ i_block -= (tind << (ptrs_bits * 3));
+ offsets[n++] = (EXT4_TIND_BLOCK + tind) % EXT4_NDIR_BLOCKS;
+ offsets[n++] = i_block >> (ptrs_bits * 2);
+ offsets[n++] = (i_block >> ptrs_bits) & (ptrs - 1);
+ offsets[n++] = i_block & (ptrs - 1);
+ final = ptrs;
} else {
ext4_warning(inode->i_sb, "block %lu > max in inode %lu",
i_block + direct_blocks +
@@ -4841,6 +4854,10 @@ do_indirects:
/* Kill the remaining (whole) subtrees */
switch (offsets[0]) {
default:
+ if (ext4_snapshot_file(inode) &&
+ offsets[0] < EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS)
+ /* Freeing snapshot extra tind branches */
+ break;
nr = i_data[EXT4_IND_BLOCK];
if (nr) {
ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 1);
@@ -4862,6 +4879,19 @@ do_indirects:
;
}

+ if (ext4_snapshot_file(inode)) {
+ int i;
+
+ /* Kill the remaining snapshot file triple indirect trees */
+ for (i = 0; i < EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS; i++) {
+ nr = i_data[i];
+ if (!nr)
+ continue;
+ ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 3);
+ i_data[i] = 0;
+ }
+ }
+
out_unlock:
up_write(&ei->i_data_sem);
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
@@ -5096,7 +5126,8 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
struct super_block *sb = inode->i_sb;

if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
- EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) {
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE) ||
+ ext4_snapshot_file(inode)) {
/* we are using combined 48 bit field */
i_blocks = ((u64)le16_to_cpu(raw_inode->i_blocks_high)) << 32 |
le32_to_cpu(raw_inode->i_blocks_lo);
@@ -5335,7 +5366,9 @@ static int ext4_inode_blocks_set(handle_t *handle,
ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
return 0;
}
- if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_HUGE_FILE))
+ /* snapshot files may be represented as huge files */
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_HUGE_FILE) &&
+ !ext4_snapshot_file(inode))
return -EFBIG;

if (i_blocks <= 0xffffffffffffULL) {
@@ -5625,6 +5658,12 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
}

if (attr->ia_valid & ATTR_SIZE) {
+ /* prevent size modification of snapshot files */
+ if (ext4_snapshot_file(inode) && attr->ia_size != 0) {
+ snapshot_debug(1, "snapshot file (%lu) can only be "
+ "truncated to 0!\n", inode->i_ino);
+ return -EPERM;
+ }
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7655010..dbe5651 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3302,6 +3302,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
EXT4_FEATURE_RO_COMPAT_HUGE_FILE);
sbi->s_bitmap_maxbytes = ext4_max_bitmap_size(sb->s_blocksize_bits,
has_huge_files);
+ if (EXT4_SNAPSHOTS(sb))
+ /* Snapshot files are huge files */
+ has_huge_files = 1;
sb->s_maxbytes = ext4_max_size(sb->s_blocksize_bits, has_huge_files);

if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV) {
--
1.7.4.1


2011-06-07 15:09:34

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 14/36] ext4: snapshot block operations

From: Amir Goldstein <[email protected]>

Core API of special snapshot file block operations.
The argument @create to the function ext4_getblk()
is re-interpreted as a snapshot block command argument. The old
argument values 0(=read) and 1(=create) preserve the original
behavior of the function. The bit field h_cowing in the current
transaction handle is used to prevent COW recursions.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 10 ++++++++++
fs/ext4/ext4_jbd2.h | 3 +++
fs/ext4/inode.c | 4 ++--
fs/ext4/snapshot.c | 43 +++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot.h | 37 +++++++++++++++++++++++++++++++++++++
5 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 81e6add..5564111 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -544,6 +544,16 @@ struct ext4_new_group_data {
* if EXT4_GET_BLOCKS_CREATE is not set, return REMAP flags.
*/
#define EXT4_GET_BLOCKS_MOVE_ON_WRITE 0x0100
+/*
+ * snapshot_map_blocks() flags passed to ext4_map_blocks() for mapping
+ * blocks to snapshot.
+ */
+ /* handle COW race conditions */
+#define EXT4_GET_BLOCKS_COW 0x200
+ /* allocate only indirect blocks */
+#define EXT4_GET_BLOCKS_MOVE 0x400
+ /* bypass journal and sync allocated indirect blocks directly to disk */
+#define EXT4_GET_BLOCKS_SYNC 0x800

/*
* Flags used by ext4_free_blocks
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 4d57fcb..4af0bb5 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -173,6 +173,9 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define ext4_handle_dirty_super(handle, sb) \
__ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))

+#define trace_cow_add(handle, name, num)
+#define trace_cow_inc(handle, name)
+
handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks);
int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e64cf64..410bc8b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1603,8 +1603,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,

map.m_lblk = block;
map.m_len = 1;
- err = ext4_map_blocks(handle, inode, &map,
- create ? EXT4_GET_BLOCKS_CREATE : 0);
+ /* passing SNAPMAP flags on create argument */
+ err = ext4_map_blocks(handle, inode, &map, create);

if (err < 0)
*errp = err;
diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index e8db8ca..ef84551 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -16,3 +16,46 @@
#include "snapshot.h"
#include "ext4.h"
#include "mballoc.h"
+
+#define snapshot_debug_hl(n, f, a...) snapshot_debug_l(n, handle ? \
+ IS_COWING(handle) : 0, f, ## a)
+
+/*
+ * ext4_snapshot_map_blocks() - helper function for
+ * ext4_snapshot_test_and_cow(). Test if blocks are mapped in snapshot file.
+ * If @block is not mapped and if @cmd is non zero, try to allocate @maxblocks.
+ * Also used by ext4_snapshot_create() to pre-allocate snapshot blocks.
+ *
+ * Return values:
+ * > 0 - no. of mapped blocks in snapshot file
+ * = 0 - @block is not mapped in snapshot file
+ * < 0 - error
+ */
+int ext4_snapshot_map_blocks(handle_t *handle, struct inode *inode,
+ ext4_snapblk_t block, unsigned long maxblocks,
+ ext4_fsblk_t *mapped, int cmd)
+{
+ int err;
+ struct ext4_map_blocks map;
+
+ map.m_lblk = SNAPSHOT_IBLOCK(block);
+ map.m_len = maxblocks;
+
+ err = ext4_map_blocks(handle, inode, &map, cmd);
+ /*
+ * ext4_get_blocks_handle() returns number of blocks
+ * mapped. 0 in case of a HOLE.
+ */
+ if (mapped && err > 0)
+ *mapped = map.m_pblk;
+
+ snapshot_debug_hl(4, "snapshot (%u) map_blocks "
+ "[%lld/%lld] = [%lld/%lld] "
+ "cmd=%d, maxblocks=%lu, mapped=%d\n",
+ inode->i_generation,
+ SNAPSHOT_BLOCK_TUPLE(block),
+ SNAPSHOT_BLOCK_TUPLE(map.m_pblk),
+ cmd, maxblocks, err);
+ return err;
+}
+
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 19e3416..ea87a5a 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -137,6 +137,43 @@ static inline int EXT4_SNAPSHOTS(struct super_block *sb)
EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT);
}

+/*
+ * snapshot_map_blocks() command flags passed to ext4_map_blocks() on its
+ * @flags argument. The higher bits are used for mapping snapshot blocks.
+ */
+/* original meaning - only check if blocks are mapped */
+#define SNAPMAP_READ 0
+/* original meaning - allocate missing blocks and indirect blocks */
+#define SNAPMAP_WRITE EXT4_GET_BLOCKS_CREATE
+/* creating COWed block */
+#define SNAPMAP_COW (SNAPMAP_WRITE|EXT4_GET_BLOCKS_COW)
+/* moving blocks to snapshot */
+#define SNAPMAP_MOVE (SNAPMAP_WRITE|EXT4_GET_BLOCKS_MOVE)
+ /* creating COW bitmap - handle COW races and bypass journal */
+#define SNAPMAP_BITMAP (SNAPMAP_COW|EXT4_GET_BLOCKS_SYNC)
+
+/* test special cases when mapping snapshot blocks */
+#define SNAPMAP_ISCOW(cmd) (unlikely((cmd) & EXT4_GET_BLOCKS_COW))
+#define SNAPMAP_ISMOVE(cmd) (unlikely((cmd) & EXT4_GET_BLOCKS_MOVE))
+#define SNAPMAP_ISSYNC(cmd) (unlikely((cmd) & EXT4_GET_BLOCKS_SYNC))
+
+#define IS_COWING(handle) (unlikely((handle)->h_cowing))
+
+/* snapshot.c */
+
+/* helper functions for ext4_snapshot_create() */
+extern int ext4_snapshot_map_blocks(handle_t *handle, struct inode *inode,
+ ext4_snapblk_t block,
+ unsigned long maxblocks,
+ ext4_fsblk_t *mapped, int cmd);
+/* helper function for ext4_snapshot_take() */
+extern void ext4_snapshot_copy_buffer(struct buffer_head *sbh,
+ struct buffer_head *bh,
+ const char *mask);
+/* helper function for ext4_snapshot_get_block() */
+extern int ext4_snapshot_read_block_bitmap(struct super_block *sb,
+ unsigned int block_group, struct buffer_head *bitmap_bh);
+
#define ext4_snapshot_cow(handle, inode, block, bh, cow) 0

#define ext4_snapshot_move(handle, inode, block, pcount, move) (0)
--
1.7.4.1


2011-06-07 15:09:37

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 15/36] ext4: snapshot block operation - copy blocks to snapshot

From: Amir Goldstein <[email protected]>

Implementation of copying blocks into a snapshot file.
This mechanism is used to copy-on-write metadata blocks to snapshot.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 3 +
fs/ext4/inode.c | 40 +++++++-
fs/ext4/mballoc.c | 18 ++++
fs/ext4/resize.c | 10 ++-
fs/ext4/snapshot.c | 269 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot.h | 12 ++-
6 files changed, 346 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5564111..7d66f92 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -109,6 +109,8 @@ typedef unsigned int ext4_group_t;
/* We are doing stream allocation */
#define EXT4_MB_STREAM_ALLOC 0x0800

+/* allocate blocks for active snapshot */
+#define EXT4_MB_HINT_COWING 0x02000

struct ext4_allocation_request {
/* target inode for block we're allocating */
@@ -1825,6 +1827,7 @@ extern void __ext4_free_blocks(const char *where, unsigned int line,
extern int ext4_mb_add_groupinfo(struct super_block *sb,
ext4_group_t i, struct ext4_group_desc *desc);
extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
+extern int ext4_mb_test_bit_range(int bit, void *addr, int *pcount);

/* inode.c */
struct buffer_head *ext4_getblk(handle_t *, struct inode *,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 410bc8b..cdc1752 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -699,8 +699,17 @@ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode,
ar.goal = goal;
ar.len = target;
ar.logical = iblock;
- if (S_ISREG(inode->i_mode))
- /* enable in-core preallocation only for regular files */
+ if (IS_COWING(handle)) {
+ /*
+ * This hint is used to tell the allocator not to fail
+ * on quota limits and allow allocation from blocks which
+ * are reserved for snapshots.
+ * Failing allocation during COW operations would result
+ * in I/O error, which is not desirable.
+ */
+ ar.flags = EXT4_MB_HINT_COWING;
+ } else if (S_ISREG(inode->i_mode) && !ext4_snapshot_file(inode))
+ /* Enable preallocation only for non-snapshot regular files */
ar.flags = EXT4_MB_HINT_DATA;

current_block = ext4_mb_new_blocks(handle, &ar, err);
@@ -1362,6 +1371,21 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
int retval;
+ int cowing = 0;
+
+ if (handle && IS_COWING(handle)) {
+ /*
+ * locking order for locks validator:
+ * inode (VFS operation) -> active snapshot (COW operation)
+ *
+ * The i_data_sem lock is nested during COW operation, but
+ * the active snapshot i_data_sem write lock is not taken
+ * otherwise, because snapshot file has read-only aops and
+ * because truncate/unlink of active snapshot is not permitted.
+ */
+ BUG_ON(!ext4_snapshot_is_active(inode));
+ cowing = 1;
+ }

map->m_flags = 0;
ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
@@ -1371,7 +1395,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* Try to see if we can get the block without requesting a new
* file system block.
*/
- down_read((&EXT4_I(inode)->i_data_sem));
+ down_read_nested((&EXT4_I(inode)->i_data_sem), cowing);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map,
flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
@@ -1430,7 +1454,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* the write lock of i_data_sem, and call get_blocks()
* with create == 1 flag.
*/
- down_write((&EXT4_I(inode)->i_data_sem));
+ down_write_nested((&EXT4_I(inode)->i_data_sem), cowing);

/*
* if the caller is from delayed allocation writeout path
@@ -1621,6 +1645,14 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
J_ASSERT(create != 0);
J_ASSERT(handle != NULL);

+ if (SNAPMAP_ISCOW(create)) {
+ /* COWing block or creating COW bitmap */
+ lock_buffer(bh);
+ clear_buffer_uptodate(bh);
+ /* flag locked buffer and return */
+ *errp = 1;
+ return bh;
+ }
/*
* Now that we do not always journal data, we should
* keep in mind whether this should always journal the
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4ff3079..6e4d960 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -420,6 +420,24 @@ static inline int mb_find_next_bit(void *addr, int max, int start)
return ret;
}

+/*
+ * Find the largest range of set or clear bits.
+ * Return 1 for set bits and 0 for clear bits.
+ * Set *pcount to number of bits in range.
+ */
+int ext4_mb_test_bit_range(int bit, void *addr, int *pcount)
+{
+ int i, ret;
+
+ ret = mb_test_bit(bit, addr);
+ if (ret)
+ i = mb_find_next_zero_bit(addr, bit + *pcount, bit);
+ else
+ i = mb_find_next_bit(addr, bit + *pcount, bit);
+ *pcount = i - bit;
+ return ret ? 1 : 0;
+}
+
static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
{
char *bb;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index ebff8a1..91f5473 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -673,7 +673,15 @@ static void update_backups(struct super_block *sb,
(err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA)))
break;

- bh = sb_getblk(sb, group * bpg + blk_off);
+ if (ext4_snapshot_has_active(sb))
+ /*
+ * test_and_cow() expects an uptodate buffer.
+ * Read the buffer here to suppress the
+ * "non uptodate buffer" warning.
+ */
+ bh = sb_bread(sb, group * bpg + blk_off);
+ else
+ bh = sb_getblk(sb, group * bpg + blk_off);
if (!bh) {
err = -EIO;
break;
diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index ef84551..fc91ca4 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -59,3 +59,272 @@ int ext4_snapshot_map_blocks(handle_t *handle, struct inode *inode,
return err;
}

+/*
+ * COW helper functions
+ */
+
+/*
+ * copy buffer @bh to (locked) snapshot buffer @sbh and mark it uptodate
+ */
+static inline void
+__ext4_snapshot_copy_buffer(struct buffer_head *sbh,
+ struct buffer_head *bh)
+{
+ memcpy(sbh->b_data, bh->b_data, SNAPSHOT_BLOCK_SIZE);
+ set_buffer_uptodate(sbh);
+}
+
+/*
+ * ext4_snapshot_complete_cow()
+ * Unlock a newly COWed snapshot buffer and complete the COW operation.
+ * Optionally, sync the buffer to disk or add it to the current transaction
+ * as dirty data.
+ */
+static inline int
+ext4_snapshot_complete_cow(handle_t *handle, struct inode *snapshot,
+ struct buffer_head *sbh, struct buffer_head *bh, int sync)
+{
+ int err = 0;
+
+ unlock_buffer(sbh);
+ err = ext4_jbd2_file_inode(handle, snapshot);
+ if (err)
+ goto out;
+ mark_buffer_dirty(sbh);
+ if (sync)
+ sync_dirty_buffer(sbh);
+out:
+ return err;
+}
+
+/*
+ * ext4_snapshot_copy_buffer_cow()
+ * helper function for ext4_snapshot_test_and_cow()
+ * copy COWed buffer to new allocated (locked) snapshot buffer
+ * add complete the COW operation
+ */
+static inline int
+ext4_snapshot_copy_buffer_cow(handle_t *handle, struct inode *snapshot,
+ struct buffer_head *sbh,
+ struct buffer_head *bh)
+{
+ __ext4_snapshot_copy_buffer(sbh, bh);
+ return ext4_snapshot_complete_cow(handle, snapshot, sbh, bh, 0);
+}
+
+/*
+ * ext4_snapshot_copy_buffer()
+ * helper function for ext4_snapshot_take()
+ * used for initializing pre-allocated snapshot blocks
+ * copy buffer to snapshot buffer and sync to disk
+ * 'mask' block bitmap with exclude bitmap before copying to snapshot.
+ */
+void ext4_snapshot_copy_buffer(struct buffer_head *sbh,
+ struct buffer_head *bh, const char *mask)
+{
+ lock_buffer(sbh);
+ __ext4_snapshot_copy_buffer(sbh, bh);
+ unlock_buffer(sbh);
+ mark_buffer_dirty(sbh);
+ sync_dirty_buffer(sbh);
+}
+
+/*
+ * COW functions
+ */
+
+#ifdef CONFIG_EXT4_DEBUG
+static void
+__ext4_snapshot_trace_cow(const char *where, handle_t *handle,
+ struct super_block *sb, struct inode *inode,
+ struct buffer_head *bh, ext4_fsblk_t block,
+ int count, int cmd)
+{
+ unsigned long inode_group = 0;
+ ext4_grpblk_t inode_offset = 0;
+
+ if (inode) {
+ inode_group = (inode->i_ino - 1) /
+ EXT4_INODES_PER_GROUP(sb);
+ inode_offset = (inode->i_ino - 1) %
+ EXT4_INODES_PER_GROUP(sb);
+ }
+ snapshot_debug_hl(4, "%s(i:%d/%ld, b:%lld/%lld) "
+ "count=%d, h_ref=%d, cmd=%d\n",
+ where, inode_offset, inode_group,
+ SNAPSHOT_BLOCK_TUPLE(block),
+ count, handle->h_ref, cmd);
+}
+
+#define ext4_snapshot_trace_cow(where, handle, sb, inode, bh, blk, cnt, cmd) \
+ if (snapshot_enable_debug >= 4) \
+ __ext4_snapshot_trace_cow(where, handle, sb, inode, \
+ bh, block, count, cmd)
+#else
+#define ext4_snapshot_trace_cow(where, handle, sb, inode, bh, blk, cnt, cmd)
+#endif
+/*
+ * Begin COW or move operation.
+ * No locks needed here, because @handle is a per-task struct.
+ */
+static inline void ext4_snapshot_cow_begin(handle_t *handle)
+{
+ snapshot_debug_hl(4, "{\n");
+ handle->h_cowing = 1;
+}
+
+/*
+ * End COW or move operation.
+ * No locks needed here, because @handle is a per-task struct.
+ */
+static inline void ext4_snapshot_cow_end(const char *where,
+ handle_t *handle, ext4_fsblk_t block, int err)
+{
+ handle->h_cowing = 0;
+ snapshot_debug_hl(4, "} = %d\n", err);
+ snapshot_debug_hl(4, ".\n");
+ if (err < 0)
+ snapshot_debug(1, "%s(b:%lld/%lld) failed!"
+ " h_ref=%d, err=%d\n", where,
+ SNAPSHOT_BLOCK_TUPLE(block),
+ handle->h_ref, err);
+}
+
+/*
+ * ext4_snapshot_test_and_cow - COW metadata block
+ * @where: name of caller function
+ * @handle: JBD handle
+ * @inode: owner of blocks (NULL for global metadata blocks)
+ * @block: address of metadata block
+ * @bh: buffer head of metadata block
+ * @cow: if false, return 1 if block needs to be COWed
+ *
+ * Return values:
+ * = 1 - @block needs to be COWed
+ * = 0 - @block was COWed or doesn't need to be COWed
+ * < 0 - error
+ */
+int ext4_snapshot_test_and_cow(const char *where, handle_t *handle,
+ struct inode *inode, ext4_fsblk_t block,
+ struct buffer_head *bh, int cow)
+{
+ struct super_block *sb = handle->h_transaction->t_journal->j_private;
+ struct inode *active_snapshot = ext4_snapshot_has_active(sb);
+ struct buffer_head *sbh = NULL;
+ ext4_fsblk_t blk = 0;
+ int err = 0, clear = 0, count = 1;
+
+ if (!active_snapshot)
+ /* no active snapshot - no need to COW */
+ return 0;
+
+ ext4_snapshot_trace_cow(where, handle, sb, inode, bh, block, 1, cow);
+
+ if (IS_COWING(handle)) {
+ /* avoid recursion on active snapshot updates */
+ WARN_ON(inode && inode != active_snapshot);
+ snapshot_debug_hl(4, "active snapshot update - "
+ "skip block cow!\n");
+ return 0;
+ } else if (inode == active_snapshot) {
+ /* active snapshot may only be modified during COW */
+ snapshot_debug_hl(4, "active snapshot access denied!\n");
+ return -EPERM;
+ }
+
+ /* BEGIN COWing */
+ ext4_snapshot_cow_begin(handle);
+
+ if (inode)
+ clear = ext4_snapshot_excluded(inode);
+ if (clear < 0) {
+ /*
+ * excluded file block access - don't COW and
+ * mark block in exclude bitmap
+ */
+ snapshot_debug_hl(4, "file (%lu) excluded from snapshot - "
+ "mark block (%lld) in exclude bitmap\n",
+ inode->i_ino, block);
+ cow = 0;
+ }
+
+ if (clear < 0)
+ goto cowed;
+ if (!err) {
+ trace_cow_inc(handle, ok_bitmap);
+ goto cowed;
+ }
+
+ /* block is in use by snapshot - check if it is mapped */
+ err = ext4_snapshot_map_blocks(handle, active_snapshot, block, 1, &blk,
+ SNAPMAP_READ);
+ if (err < 0)
+ goto out;
+ if (err > 0) {
+ sbh = sb_find_get_block(sb, blk);
+ trace_cow_inc(handle, ok_mapped);
+ err = 0;
+ goto test_pending_cow;
+ }
+
+ /* block needs to be COWed */
+ err = 1;
+ if (!cow)
+ /* don't COW - we were just checking */
+ goto out;
+
+ err = -EIO;
+ /* make sure we hold an uptodate source buffer */
+ if (!bh || !buffer_mapped(bh))
+ goto out;
+ if (!buffer_uptodate(bh)) {
+ snapshot_debug(1, "warning: non uptodate buffer (%lld)"
+ " needs to be copied to active snapshot!\n",
+ block);
+ ll_rw_block(READ, 1, &bh);
+ wait_on_buffer(bh);
+ if (!buffer_uptodate(bh))
+ goto out;
+ }
+
+ /* try to allocate snapshot block to make a backup copy */
+ sbh = ext4_getblk(handle, active_snapshot, SNAPSHOT_IBLOCK(block),
+ SNAPMAP_COW, &err);
+ if (!sbh)
+ goto out;
+
+ blk = sbh->b_blocknr;
+ if (!err) {
+ /*
+ * we didn't allocate this block -
+ * another COWing task must have allocated it
+ */
+ trace_cow_inc(handle, ok_mapped);
+ goto test_pending_cow;
+ }
+
+ /*
+ * we allocated this block -
+ * copy block data to snapshot and complete COW operation
+ */
+ err = ext4_snapshot_copy_buffer_cow(handle, active_snapshot,
+ sbh, bh);
+ if (err)
+ goto out;
+ snapshot_debug(3, "block [%lld/%lld] of snapshot (%u) "
+ "mapped to block [%lld/%lld]\n",
+ SNAPSHOT_BLOCK_TUPLE(block),
+ active_snapshot->i_generation,
+ SNAPSHOT_BLOCK_TUPLE(sbh->b_blocknr));
+
+ trace_cow_inc(handle, copied);
+test_pending_cow:
+
+cowed:
+out:
+ brelse(sbh);
+ /* END COWing */
+ ext4_snapshot_cow_end(where, handle, block, err);
+ return err;
+}
+
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index ea87a5a..90cb33e 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -174,7 +174,17 @@ extern void ext4_snapshot_copy_buffer(struct buffer_head *sbh,
extern int ext4_snapshot_read_block_bitmap(struct super_block *sb,
unsigned int block_group, struct buffer_head *bitmap_bh);

-#define ext4_snapshot_cow(handle, inode, block, bh, cow) 0
+extern int ext4_snapshot_test_and_cow(const char *where,
+ handle_t *handle, struct inode *inode,
+ ext4_fsblk_t block, struct buffer_head *bh, int cow);
+
+/*
+ * test if a metadata block should be COWed
+ * and if it should, copy the block to the active snapshot
+ */
+#define ext4_snapshot_cow(handle, inode, block, bh, cow) \
+ ext4_snapshot_test_and_cow(__func__, handle, inode, \
+ block, bh, cow)

#define ext4_snapshot_move(handle, inode, block, pcount, move) (0)

--
1.7.4.1


2011-06-07 15:09:39

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 16/36] ext4: snapshot block operation - move blocks to snapshot

From: Amir Goldstein <[email protected]>

Implementation of moving blocks into a snapshot file.
The move block command maps an allocated blocks to the snapshot file,
allocating only the indirect blocks when needed.
This mechanism is used to move-on-write data blocks to snapshot.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 62 +++++++++++++++++++-------
fs/ext4/snapshot.c | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot.h | 12 +++++-
3 files changed, 176 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cdc1752..1558a7b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -676,6 +676,11 @@ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode,
new_blocks[index++] = current_block++;
count--;
}
+ if (blks == 0 && target == 0) {
+ /* mapping data blocks */
+ *err = 0;
+ return 0;
+ }
if (count > 0) {
/*
* save the new block number
@@ -777,10 +782,10 @@ failed_out:
* ext4_alloc_block() (normally -ENOSPC). Otherwise we set the chain
* as described above and return 0.
*/
-static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
- ext4_lblk_t iblock, int indirect_blks,
- int *blks, ext4_fsblk_t goal,
- ext4_lblk_t *offsets, Indirect *branch)
+static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t iblock, int indirect_blks,
+ int *blks, ext4_fsblk_t goal,
+ int *offsets, Indirect *branch, int flags)
{
int blocksize = inode->i_sb->s_blocksize;
int i, n = 0;
@@ -790,6 +795,22 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
ext4_fsblk_t new_blocks[4];
ext4_fsblk_t current_block;

+ if (SNAPMAP_ISMOVE(flags)) {
+ /* mapping snapshot block to block device block */
+ current_block = SNAPSHOT_BLOCK(iblock);
+ num = 0;
+ if (indirect_blks > 0) {
+ /* allocating only indirect blocks */
+ ext4_alloc_blocks(handle, inode, iblock, goal,
+ indirect_blks, 0, new_blocks, &err);
+ if (err)
+ return err;
+ }
+ /* charge snapshot file owner for moved blocks */
+ dquot_alloc_block_nofail(inode, *blks);
+ num = *blks;
+ new_blocks[indirect_blks] = current_block;
+ } else
num = ext4_alloc_blocks(handle, inode, iblock, goal, indirect_blks,
*blks, new_blocks, &err);
if (err)
@@ -861,8 +882,11 @@ failed:
}
for (i = n+1; i < indirect_blks; i++)
ext4_free_blocks(handle, inode, NULL, new_blocks[i], 1, 0);
-
- ext4_free_blocks(handle, inode, NULL, new_blocks[i], num, 0);
+ if (SNAPMAP_ISMOVE(flags) && num > 0)
+ /* don't charge snapshot file owner if move failed */
+ dquot_free_block(inode, num);
+ else if (num > 0)
+ ext4_free_blocks(handle, inode, NULL, new_blocks[i], num, 0);

return err;
}
@@ -882,9 +906,8 @@ failed:
* inode (->i_blocks, etc.). In case of success we end up with the full
* chain to new block and return 0.
*/
-static int ext4_splice_branch(handle_t *handle, struct inode *inode,
- ext4_lblk_t block, Indirect *where, int num,
- int blks)
+static int ext4_splice_branch_cow(handle_t *handle, struct inode *inode,
+ long block, Indirect *where, int num, int blks, int flags)
{
int i;
int err = 0;
@@ -951,8 +974,12 @@ err_out:
ext4_free_blocks(handle, inode, where[i].bh, 0, 1,
EXT4_FREE_BLOCKS_FORGET);
}
- ext4_free_blocks(handle, inode, NULL, le32_to_cpu(where[num].key),
- blks, 0);
+ if (SNAPMAP_ISMOVE(flags))
+ /* don't charge snapshot file owner if move failed */
+ dquot_free_block(inode, blks);
+ else
+ ext4_free_blocks(handle, inode, NULL,
+ le32_to_cpu(where[num].key), blks, 0);

return err;
}
@@ -1099,9 +1126,11 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
/*
* Block out ext4_truncate while we alter the tree
*/
- err = ext4_alloc_branch(handle, inode, map->m_lblk, indirect_blks,
- &count, goal,
- offsets + (partial - chain), partial);
+ err = ext4_alloc_branch_cow(handle, inode, map->m_lblk, indirect_blks,
+ &count, goal, offsets + (partial - chain),
+ partial, flags);
+ if (err)
+ goto cleanup;

if (map->m_flags & EXT4_MAP_REMAP) {
map->m_len = count;
@@ -1127,9 +1156,8 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
* credits cannot be returned. Can we handle this somehow? We
* may need to return -EAGAIN upwards in the worst case. --sct
*/
- if (!err)
- err = ext4_splice_branch(handle, inode, map->m_lblk,
- partial, indirect_blks, count);
+ err = ext4_splice_branch_cow(handle, inode, map->m_lblk, partial,
+ indirect_blks, count, flags);
if (err)
goto cleanup;

diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index fc91ca4..adeb0b6 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -328,3 +328,123 @@ out:
return err;
}

+/*
+ * ext4_snapshot_test_and_move - move blocks to active snapshot
+ * @where: name of caller function
+ * @handle: JBD handle
+ * @inode: owner of blocks (NULL for global metadata blocks)
+ * @block: address of first block to move
+ * @maxblocks: max. blocks to move
+ * @move: if false, only test if @block needs to be moved
+ *
+ * Return values:
+ * > 0 - blocks a) were moved to snapshot for @move = 1;
+ * b) needs to be moved for @move = 0
+ * = 0 - blocks dont need to be moved
+ * < 0 - error
+ */
+int ext4_snapshot_test_and_move(const char *where, handle_t *handle,
+ struct inode *inode, ext4_fsblk_t block, int *maxblocks, int move)
+{
+ struct super_block *sb = handle->h_transaction->t_journal->j_private;
+ struct inode *active_snapshot = ext4_snapshot_has_active(sb);
+ ext4_fsblk_t blk = 0;
+ int err = 0, count = *maxblocks;
+ int moved_blks = 0;
+ int excluded = 0;
+
+ if (!active_snapshot)
+ /* no active snapshot - no need to move */
+ return 0;
+
+ ext4_snapshot_trace_cow(where, handle, sb, inode, NULL, block, count,
+ move);
+
+ BUG_ON(IS_COWING(handle) || inode == active_snapshot);
+
+ /* BEGIN moving */
+ ext4_snapshot_cow_begin(handle);
+
+ if (inode)
+ excluded = ext4_snapshot_excluded(inode);
+ if (excluded) {
+ /* don't move excluded file block to snapshot */
+ snapshot_debug_hl(4, "file (%lu) excluded from snapshot\n",
+ inode->i_ino);
+ move = 0;
+ }
+
+ if (excluded)
+ goto out;
+ if (!err) {
+ /* block not in COW bitmap - no need to move */
+ trace_cow_add(handle, ok_bitmap, count);
+ goto out;
+ }
+
+#ifdef CONFIG_EXT4_DEBUG
+ if (inode == NULL &&
+ !(EXT4_I(active_snapshot)->i_flags & EXT4_UNRM_FL)) {
+ /*
+ * This is ext4_group_extend() "freeing" the blocks that
+ * were added to the block group. These block should not be
+ * moved to snapshot, unless the snapshot is marked with the
+ * UNRM flag for large snapshot creation test.
+ */
+ trace_cow_add(handle, ok_bitmap, count);
+ err = 0;
+ goto out;
+ }
+#endif
+
+ /* count blocks are in use by snapshot - check if @block is mapped */
+ err = ext4_snapshot_map_blocks(handle, active_snapshot, block, count,
+ &blk, SNAPMAP_READ);
+ if (err < 0)
+ goto out;
+ if (err > 0) {
+ /* blocks already mapped in snapshot - no need to move */
+ count = err;
+ trace_cow_add(handle, ok_mapped, count);
+ err = 0;
+ goto out;
+ }
+
+ /* @count blocks need to be moved */
+ err = count;
+ if (!move)
+ /* don't move - we were just checking */
+ goto out;
+
+ /* try to move @count blocks from inode to snapshot.
+ * @count blocks may cross block boundry.
+ * TODO: if moving fails after some blocks has been moved,
+ * maybe we need a blockbitmap fsck.
+ */
+ blk = block;
+ while (count) {
+ err = ext4_snapshot_map_blocks(handle, active_snapshot, blk,
+ count, NULL, SNAPMAP_MOVE);
+ if (err <= 0)
+ goto out;
+ moved_blks += err;
+ blk += err;
+ count -= err;
+ }
+ count = moved_blks;
+ err = moved_blks;
+ /*
+ * User should no longer be charged for these blocks.
+ * Snapshot file owner was charged for these blocks
+ * when they were mapped to snapshot file.
+ */
+ if (inode)
+ dquot_free_block(inode, count);
+ trace_cow_add(handle, moved, count);
+out:
+ /* END moving */
+ ext4_snapshot_cow_end(where, handle, block, err);
+ *maxblocks = count;
+ return err;
+}
+
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 90cb33e..fc5dbec 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -186,7 +186,17 @@ extern int ext4_snapshot_test_and_cow(const char *where,
ext4_snapshot_test_and_cow(__func__, handle, inode, \
block, bh, cow)

-#define ext4_snapshot_move(handle, inode, block, pcount, move) (0)
+extern int ext4_snapshot_test_and_move(const char *where,
+ handle_t *handle, struct inode *inode,
+ ext4_fsblk_t block, int *pcount, int move);
+
+/*
+ * test if blocks should be moved to snapshot
+ * and if they should, try to move them to the active snapshot
+ */
+#define ext4_snapshot_move(handle, inode, block, pcount, move) \
+ ext4_snapshot_test_and_move(__func__, handle, inode, \
+ block, pcount, move)

/*
* Block access functions
--
1.7.4.1


2011-06-07 15:09:42

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 17/36] ext4: snapshot block operation - copy block bitmap to snapshot

From: Amir Goldstein <[email protected]>

The snapshot copy of the file system block bitmap is called the COW
bitmap and it is used to check if a block was allocated at the time
that the snapshot was taken.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/snapshot.c | 250 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/snapshot_ctl.c | 19 ++++-
2 files changed, 264 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index adeb0b6..9fb5c2f 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -75,6 +75,27 @@ __ext4_snapshot_copy_buffer(struct buffer_head *sbh,
}

/*
+ * use @mask to clear exclude bitmap bits from block bitmap
+ * when creating COW bitmap and mark snapshot buffer @sbh uptodate
+ */
+static inline void
+__ext4_snapshot_copy_bitmap(struct buffer_head *sbh,
+ char *dst, const char *src, const char *mask)
+{
+ const u32 *ps = (const u32 *)src, *pm = (const u32 *)mask;
+ u32 *pd = (u32 *)dst;
+ int i;
+
+ if (mask) {
+ for (i = 0; i < SNAPSHOT_ADDR_PER_BLOCK; i++)
+ *pd++ = *ps++ & ~*pm++;
+ } else
+ memcpy(dst, src, SNAPSHOT_BLOCK_SIZE);
+
+ set_buffer_uptodate(sbh);
+}
+
+/*
* ext4_snapshot_complete_cow()
* Unlock a newly COWed snapshot buffer and complete the COW operation.
* Optionally, sync the buffer to disk or add it to the current transaction
@@ -123,13 +144,228 @@ void ext4_snapshot_copy_buffer(struct buffer_head *sbh,
struct buffer_head *bh, const char *mask)
{
lock_buffer(sbh);
- __ext4_snapshot_copy_buffer(sbh, bh);
+ if (mask)
+ __ext4_snapshot_copy_bitmap(sbh,
+ sbh->b_data, bh->b_data, mask);
+ else
+ __ext4_snapshot_copy_buffer(sbh, bh);
unlock_buffer(sbh);
mark_buffer_dirty(sbh);
sync_dirty_buffer(sbh);
}

/*
+ * COW bitmap functions
+ */
+
+/*
+ * ext4_snapshot_init_cow_bitmap() init a new allocated (locked) COW bitmap
+ * buffer on first time block group access after snapshot take.
+ * COW bitmap is created by masking the block bitmap with exclude bitmap.
+ */
+static int
+ext4_snapshot_init_cow_bitmap(struct super_block *sb,
+ unsigned int block_group, struct buffer_head *cow_bh)
+{
+ struct buffer_head *bitmap_bh;
+ char *dst, *src, *mask = NULL;
+
+ bitmap_bh = ext4_read_block_bitmap(sb, block_group);
+ if (!bitmap_bh)
+ return -EIO;
+
+ src = bitmap_bh->b_data;
+ /*
+ * Another COWing task may be changing this block bitmap
+ * (allocating active snapshot blocks) while we are trying
+ * to copy it. At this point we are guaranteed that the only
+ * changes to block bitmap are the new active snapshot blocks,
+ * because before allocating/freeing any other blocks a task
+ * must first get_write_access() on the bitmap and get here.
+ */
+ ext4_lock_group(sb, block_group);
+
+ /*
+ * in the path coming from ext4_snapshot_read_block_bitmap(),
+ * cow_bh is a user page buffer so it has to be kmapped.
+ */
+ dst = kmap_atomic(cow_bh->b_page, KM_USER0);
+ __ext4_snapshot_copy_bitmap(cow_bh, dst, src, mask);
+ kunmap_atomic(dst, KM_USER0);
+
+ ext4_unlock_group(sb, block_group);
+
+ brelse(bitmap_bh);
+ return 0;
+}
+
+/*
+ * ext4_snapshot_read_block_bitmap()
+ * helper function for ext4_snapshot_get_block()
+ * used for fixing the block bitmap user page buffer when
+ * reading through to block device.
+ */
+int ext4_snapshot_read_block_bitmap(struct super_block *sb,
+ unsigned int block_group, struct buffer_head *bitmap_bh)
+{
+ int err;
+
+ lock_buffer(bitmap_bh);
+ err = ext4_snapshot_init_cow_bitmap(sb, block_group, bitmap_bh);
+ unlock_buffer(bitmap_bh);
+ return err;
+}
+
+/*
+ * ext4_snapshot_read_cow_bitmap - read COW bitmap from active snapshot
+ * @handle: JBD handle
+ * @snapshot: active snapshot
+ * @block_group: block group
+ *
+ * Reads the COW bitmap block (i.e., the active snapshot copy of block bitmap).
+ * Creates the COW bitmap on first access to @block_group after snapshot take.
+ * COW bitmap cache is non-persistent, so no need to mark the group descriptor
+ * block dirty. COW bitmap races are handled internally, so no locks are
+ * required when calling this function, only a valid @handle.
+ *
+ * Return COW bitmap buffer on success or NULL in case of failure.
+ */
+static struct buffer_head *
+ext4_snapshot_read_cow_bitmap(handle_t *handle, struct inode *snapshot,
+ unsigned int block_group)
+{
+ struct super_block *sb = snapshot->i_sb;
+ struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+ struct ext4_group_desc *desc;
+ struct buffer_head *cow_bh;
+ ext4_fsblk_t bitmap_blk;
+ ext4_fsblk_t cow_bitmap_blk;
+ int err = 0;
+
+ desc = ext4_get_group_desc(sb, block_group, NULL);
+ if (!desc)
+ return NULL;
+
+ bitmap_blk = ext4_block_bitmap(sb, desc);
+
+ ext4_lock_group(sb, block_group);
+ cow_bitmap_blk = grp->bg_cow_bitmap;
+ ext4_unlock_group(sb, block_group);
+ if (cow_bitmap_blk)
+ return sb_bread(sb, cow_bitmap_blk);
+
+ /*
+ * Try to read cow bitmap block from snapshot file. If COW bitmap
+ * is not yet allocated, create the new COW bitmap block.
+ */
+ cow_bh = ext4_bread(handle, snapshot, SNAPSHOT_IBLOCK(bitmap_blk),
+ SNAPMAP_READ, &err);
+ if (cow_bh)
+ goto out;
+
+ /* allocate snapshot block for COW bitmap */
+ cow_bh = ext4_getblk(handle, snapshot, SNAPSHOT_IBLOCK(bitmap_blk),
+ SNAPMAP_BITMAP, &err);
+ if (!cow_bh)
+ goto out;
+ if (!err) {
+ /*
+ * err should be 1 to indicate new allocated (locked) buffer.
+ * if err is 0, it means that someone mapped this block
+ * before us, while we are updating the COW bitmap cache.
+ * the pending COW bitmap code should prevent that.
+ */
+ WARN_ON(1);
+ err = -EIO;
+ goto out;
+ }
+
+ err = ext4_snapshot_init_cow_bitmap(sb, block_group, cow_bh);
+ if (err)
+ goto out;
+ /*
+ * complete pending COW operation. no need to wait for tracked reads
+ * of block bitmap, because it is copied directly to page buffer by
+ * ext4_snapshot_read_block_bitmap()
+ */
+ err = ext4_snapshot_complete_cow(handle, snapshot, cow_bh, NULL, 1);
+ if (err)
+ goto out;
+
+ trace_cow_inc(handle, bitmaps);
+out:
+ if (!err && cow_bh) {
+ /* initialized COW bitmap block */
+ cow_bitmap_blk = cow_bh->b_blocknr;
+ snapshot_debug(3, "COW bitmap #%u of snapshot (%u) "
+ "mapped to block [%lld/%lld]\n",
+ block_group, snapshot->i_generation,
+ SNAPSHOT_BLOCK_TUPLE(cow_bitmap_blk));
+ } else {
+ /* uninitialized COW bitmap block */
+ cow_bitmap_blk = 0;
+ snapshot_debug(1, "failed to read COW bitmap #%u of snapshot "
+ "(%u)\n", block_group, snapshot->i_generation);
+ brelse(cow_bh);
+ cow_bh = NULL;
+ }
+
+ /* update or reset COW bitmap cache */
+ ext4_lock_group(sb, block_group);
+ grp->bg_cow_bitmap = cow_bitmap_blk;
+ ext4_unlock_group(sb, block_group);
+
+ return cow_bh;
+}
+
+/*
+ * ext4_snapshot_test_cow_bitmap - test if blocks are in use by snapshot
+ * @handle: JBD handle
+ * @snapshot: active snapshot
+ * @block: address of block
+ * @maxblocks: max no. of blocks to be tested
+ * @excluded: if not NULL, blocks belong to this excluded inode
+ *
+ * If the block bit is set in the COW bitmap, than it was allocated at the time
+ * that the active snapshot was taken and is therefore "in use" by the snapshot.
+ *
+ * Return values:
+ * > 0 - blocks are in use by snapshot
+ * = 0 - @blocks are not in use by snapshot
+ * < 0 - error
+ */
+static int
+ext4_snapshot_test_cow_bitmap(handle_t *handle, struct inode *snapshot,
+ ext4_fsblk_t block, int *maxblocks, struct inode *excluded)
+{
+ struct buffer_head *cow_bh;
+ unsigned long block_group = SNAPSHOT_BLOCK_GROUP(block);
+ ext4_grpblk_t bit = SNAPSHOT_BLOCK_GROUP_OFFSET(block);
+ ext4_fsblk_t snapshot_blocks = SNAPSHOT_BLOCKS(snapshot);
+ int ret;
+
+ if (block >= snapshot_blocks)
+ /*
+ * Block is not is use by snapshot because it is past the
+ * last f/s block at the time that the snapshot was taken.
+ * (suggests that f/s was resized after snapshot take)
+ */
+ return 0;
+
+ cow_bh = ext4_snapshot_read_cow_bitmap(handle, snapshot, block_group);
+ if (!cow_bh)
+ return -EIO;
+ /*
+ * if the bit is set in the COW bitmap,
+ * then the block is in use by snapshot
+ */
+
+ ret = ext4_mb_test_bit_range(bit, cow_bh->b_data, maxblocks);
+
+ brelse(cow_bh);
+ return ret;
+}
+/*
* COW functions
*/

@@ -248,8 +484,11 @@ int ext4_snapshot_test_and_cow(const char *where, handle_t *handle,
cow = 0;
}

- if (clear < 0)
- goto cowed;
+ /* get the COW bitmap and test if blocks are in use by snapshot */
+ err = ext4_snapshot_test_cow_bitmap(handle, active_snapshot,
+ block, &count, clear < 0 ? inode : NULL);
+ if (err < 0)
+ goto out;
if (!err) {
trace_cow_inc(handle, ok_bitmap);
goto cowed;
@@ -374,7 +613,10 @@ int ext4_snapshot_test_and_move(const char *where, handle_t *handle,
move = 0;
}

- if (excluded)
+ /* get the COW bitmap and test if blocks are in use by snapshot */
+ err = ext4_snapshot_test_cow_bitmap(handle, active_snapshot,
+ block, &count, excluded ? inode : NULL);
+ if (err < 0)
goto out;
if (!err) {
/* block not in COW bitmap - no need to move */
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 1abda77..810cb21 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -112,7 +112,24 @@ static int ext4_snapshot_set_active(struct super_block *sb,

return 0;
}
-#define ext4_snapshot_reset_bitmap_cache(sb, init) 0
+/*
+ * ext4_snapshot_reset_bitmap_cache():
+ *
+ * Resets the COW/exclude bitmap cache for all block groups.
+ *
+ * Called from snapshot_take() under journal_lock_updates().
+ */
+static void ext4_snapshot_reset_bitmap_cache(struct super_block *sb)
+{
+ struct ext4_group_info *grp;
+ int i;
+
+ for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ grp = ext4_get_group_info(sb, i);
+ grp->bg_cow_bitmap = 0;
+ cond_resched();
+ }
+}

/*
* Snapshot constructor/destructor
--
1.7.4.1


2011-06-07 15:09:45

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 18/36] ext4: snapshot control

From: Amir Goldstein <[email protected]>

Snapshot control with chsnap/lssnap.
Take/delete snapshot with chsnap +/-S.
Enable/disable snapshot with chsnap +/-n.
Show snapshot status with lssnap.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/ioctl.c | 117 ++++++++++
fs/ext4/snapshot.h | 8 +
fs/ext4/snapshot_ctl.c | 593 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 720 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7d66f92..e76faae 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -584,6 +584,8 @@ struct ext4_new_group_data {
/* note ioctl 10 reserved for an early version of the FIEMAP ioctl */
/* note ioctl 11 reserved for filesystem-independent FIEMAP ioctl */
#define EXT4_IOC_ALLOC_DA_BLKS _IO('f', 12)
+#define EXT4_IOC_GETSNAPFLAGS _IOR('f', 13, long)
+#define EXT4_IOC_SETSNAPFLAGS _IOW('f', 14, long)
#define EXT4_IOC_MOVE_EXT _IOWR('f', 15, struct move_extent)

#if defined(__KERNEL__) && defined(CONFIG_COMPAT)
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a8b1254..1ed6f50 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -83,6 +83,21 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (!capable(CAP_SYS_RESOURCE))
goto flags_out;
}
+
+ /*
+ * The SNAPFILE flag can only be changed on directories by
+ * the relevant capability.
+ * It can only be inherited by regular files.
+ */
+ if ((flags ^ oldflags) & EXT4_SNAPFILE_FL) {
+ if (!S_ISDIR(inode->i_mode)) {
+ err = -ENOTDIR;
+ goto flags_out;
+ }
+ if (!capable(CAP_SYS_RESOURCE))
+ goto flags_out;
+ }
+
if (oldflags & EXT4_EXTENTS_FL) {
/* We don't support clearning extent flags */
if (!(flags & EXT4_EXTENTS_FL)) {
@@ -139,6 +154,102 @@ flags_out:
mnt_drop_write(filp->f_path.mnt);
return err;
}
+ case EXT4_IOC_GETSNAPFLAGS:
+ if (!EXT4_SNAPSHOTS(inode->i_sb))
+ return -EOPNOTSUPP;
+
+ ext4_snapshot_get_flags(inode, filp);
+ flags = ext4_get_snapstate_flags(inode);
+ return put_user(flags, (int __user *) arg);
+
+ case EXT4_IOC_SETSNAPFLAGS: {
+ handle_t *handle = NULL;
+ struct ext4_iloc iloc;
+ unsigned int oldflags;
+ int err;
+
+ if (!EXT4_SNAPSHOTS(inode->i_sb))
+ return -EOPNOTSUPP;
+
+ if (!is_owner_or_cap(inode))
+ return -EACCES;
+
+ if (get_user(flags, (int __user *) arg))
+ return -EFAULT;
+
+ err = mnt_want_write(filp->f_path.mnt);
+ if (err)
+ return err;
+
+ /*
+ * Snapshot file state flags can only be changed by
+ * the relevant capability and under snapshot_mutex lock.
+ */
+ if (!ext4_snapshot_file(inode) ||
+ !capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+
+ /* update snapshot 'open' flag under i_mutex */
+ mutex_lock(&inode->i_mutex);
+ ext4_snapshot_get_flags(inode, filp);
+ oldflags = ext4_get_snapstate_flags(inode);
+
+ /*
+ * snapshot_mutex should be held throughout the trio
+ * snapshot_{set_flags,take,update}(). It must be taken
+ * before starting the transaction, otherwise
+ * journal_lock_updates() inside snapshot_take()
+ * can deadlock:
+ * A: journal_start()
+ * A: snapshot_mutex_lock()
+ * B: journal_start()
+ * B: snapshot_mutex_lock() (waiting for A)
+ * A: journal_stop()
+ * A: snapshot_take() ->
+ * A: journal_lock_updates() (waiting for B)
+ */
+ mutex_lock(&EXT4_SB(inode->i_sb)->s_snapshot_mutex);
+
+ handle = ext4_journal_start(inode, 1);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto snapflags_out;
+ }
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ if (err)
+ goto snapflags_err;
+
+ err = ext4_snapshot_set_flags(handle, inode, flags);
+ if (err)
+ goto snapflags_err;
+
+ err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+snapflags_err:
+ ext4_journal_stop(handle);
+ if (err)
+ goto snapflags_out;
+
+ if (!(oldflags & 1UL<<EXT4_SNAPSTATE_LIST) &&
+ (flags & 1UL<<EXT4_SNAPSTATE_LIST))
+ /* setting list flag - take snapshot */
+ err = ext4_snapshot_take(inode);
+snapflags_out:
+ if ((oldflags|flags) & 1UL<<EXT4_SNAPSTATE_LIST) {
+ /* if clearing list flag, cleanup snapshot list */
+ int ret;
+
+ /* update/cleanup snapshots list even if take failed */
+ ret = ext4_snapshot_update(inode->i_sb,
+ !(flags & 1UL<<EXT4_SNAPSTATE_LIST), 0);
+ if (!err)
+ err = ret;
+ }
+
+ mutex_unlock(&EXT4_SB(inode->i_sb)->s_snapshot_mutex);
+ mutex_unlock(&inode->i_mutex);
+ mnt_drop_write(filp->f_path.mnt);
+ return err;
+ }
case EXT4_IOC_GETVERSION:
case EXT4_IOC_GETVERSION_OLD:
return put_user(inode->i_generation, (int __user *) arg);
@@ -210,6 +321,8 @@ setversion_out:

if (get_user(n_blocks_count, (__u32 __user *)arg))
return -EFAULT;
+ /* avoid snapshot_take() in the middle of group_extend() */
+ mutex_lock(&EXT4_SB(sb)->s_snapshot_mutex);

err = mnt_want_write(filp->f_path.mnt);
if (err)
@@ -223,6 +336,7 @@ setversion_out:
}
if (err == 0)
err = err2;
+ mutex_unlock(&EXT4_SB(sb)->s_snapshot_mutex);
mnt_drop_write(filp->f_path.mnt);

return err;
@@ -285,6 +399,8 @@ mext_out:
if (err)
return err;

+ /* avoid snapshot_take() in the middle of group_add() */
+ mutex_lock(&EXT4_SB(sb)->s_snapshot_mutex);
err = ext4_group_add(sb, &input);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
@@ -293,6 +409,7 @@ mext_out:
}
if (err == 0)
err = err2;
+ mutex_unlock(&EXT4_SB(sb)->s_snapshot_mutex);
mnt_drop_write(filp->f_path.mnt);

return err;
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index fc5dbec..007fec0 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -346,6 +346,14 @@ static inline int ext4_snapshot_get_delete_access(handle_t *handle,
/* snapshot_ctl.c */

/*
+ * Snapshot control functions
+ */
+extern void ext4_snapshot_get_flags(struct inode *inode, struct file *filp);
+extern int ext4_snapshot_set_flags(handle_t *handle, struct inode *inode,
+ unsigned int flags);
+extern int ext4_snapshot_take(struct inode *inode);
+
+/*
* Snapshot constructor/destructor
*/
extern int ext4_snapshot_load(struct super_block *sb,
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 810cb21..f2dbef4 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -132,6 +132,576 @@ static void ext4_snapshot_reset_bitmap_cache(struct super_block *sb)
}

/*
+ * Snapshot control functions
+ *
+ * Snapshot files are controlled by changing snapshot flags with chattr and
+ * moving the snapshot file through the stages of its life cycle:
+ *
+ * 1. Creating a snapshot file
+ * The snapfile flag is changed for directories only (chattr +x), so
+ * snapshot files must be created inside a snapshots directory.
+ * They inherit the flag at birth and they die with it.
+ * This helps to avoid various race conditions when changing
+ * regular files to snapshots and back.
+ * Snapshot files are assigned with read-only address space operations, so
+ * they are not writable for users.
+ *
+ * 2. Taking a snapshot
+ * An empty snapshot file becomes the active snapshot after it is added to the
+ * head on the snapshots list by setting its snapshot list flag (chattr -X +S).
+ * snapshot_create() verifies that the file is empty and pre-allocates some
+ * blocks during the ioctl transaction. snapshot_take() locks journal updates
+ * and copies some file system block to the pre-allocated blocks and then adds
+ * the snapshot file to the on-disk list and sets it as the active snapshot.
+ *
+ * 3. Mounting a snapshot
+ * A snapshot on the list can be enabled for user read access by setting the
+ * enabled flag (chattr -X +n) and disabled by clearing the enabled flag.
+ * An enabled snapshot can be mounted via a loop device and mounted as a
+ * read-only ext2 filesystem.
+ *
+ * 4. Deleting a snapshot
+ * A non-mounted and disabled snapshot may be marked for removal from the
+ * snapshots list by requesting to clear its snapshot list flag (chattr -X -S).
+ * The process of removing a snapshot from the list varies according to the
+ * dependencies between the snapshot and older snapshots on the list:
+ * - if all older snapshots are deleted, the snapshot is removed from the list.
+ * - if some older snapshots are enabled, snapshot_shrink() is called to free
+ * unused blocks, but the snapshot remains on the list.
+ * - if all older snapshots are disabled, snapshot_merge() is called to move
+ * used blocks to an older snapshot and the snapshot is removed from the list.
+ *
+ * 5. Unlinking a snapshot file
+ * When a snapshot file is no longer (or never was) on the snapshots list, it
+ * may be unlinked. Snapshots on the list are protected from user unlink and
+ * truncate operations.
+ *
+ * 6. Discarding all snapshots
+ * An irregular way to abruptly end the lives of all snapshots on the list is by
+ * detaching the snapshot list head using the command: tune2fs -O ^has_snapshot.
+ * This action is applicable on an un-mounted ext4 filesystem. After mounting
+ * the filesystem, the discarded snapshot files will not be loaded, they will
+ * not have the snapshot list flag and therefore, may be unlinked.
+ */
+static int ext4_snapshot_enable(struct inode *inode);
+static int ext4_snapshot_disable(struct inode *inode);
+static int ext4_snapshot_create(struct inode *inode);
+static int ext4_snapshot_delete(struct inode *inode);
+
+/*
+ * ext4_snapshot_get_flags() check snapshot state
+ * Called from ext4_ioctl() under i_mutex
+ */
+void ext4_snapshot_get_flags(struct inode *inode, struct file *filp)
+{
+ unsigned int open_count = filp->f_path.dentry->d_count;
+
+ /*
+ * 1 count for ioctl (lsattr)
+ * greater count means the snapshot is open by user (mounted?)
+ * We rely on d_count because snapshot shouldn't have hard links.
+ */
+ if (ext4_snapshot_list(inode) && open_count > 1)
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_OPEN);
+ else
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_OPEN);
+ /* copy persistent flags to dynamic state flags */
+ if (ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_DELETED))
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_DELETED);
+ else
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_DELETED);
+ if (ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_SHRUNK))
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_SHRUNK);
+ else
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_SHRUNK);
+}
+
+/*
+ * ext4_snapshot_set_flags() monitors snapshot state changes
+ * Called from ext4_ioctl() under i_mutex and snapshot_mutex
+ */
+int ext4_snapshot_set_flags(handle_t *handle, struct inode *inode,
+ unsigned int flags)
+{
+ unsigned int oldflags = ext4_get_snapstate_flags(inode);
+ int err = 0;
+
+ if ((flags ^ oldflags) & 1UL<<EXT4_SNAPSTATE_ENABLED) {
+ /* enabled/disabled the snapshot during transaction */
+ if (flags & 1UL<<EXT4_SNAPSTATE_ENABLED)
+ err = ext4_snapshot_enable(inode);
+ else
+ err = ext4_snapshot_disable(inode);
+ }
+ if (err)
+ goto out;
+
+ if ((flags ^ oldflags) & 1UL<<EXT4_SNAPSTATE_LIST) {
+ /* add/delete to snapshots list during transaction */
+ if (flags & 1UL<<EXT4_SNAPSTATE_LIST)
+ err = ext4_snapshot_create(inode);
+ else
+ err = ext4_snapshot_delete(inode);
+ }
+ if (err)
+ goto out;
+
+out:
+ /*
+ * retake reserve inode write from ext4_ioctl() and mark inode
+ * dirty
+ */
+ if (!err)
+ err = ext4_mark_inode_dirty(handle, inode);
+ return err;
+}
+
+/*
+ * If we have fewer than nblocks credits,
+ * extend transaction by at most EXT4_MAX_TRANS_DATA.
+ * If that fails, restart the transaction &
+ * regain write access for the inode block.
+ */
+int __extend_or_restart_transaction(const char *where,
+ handle_t *handle, struct inode *inode, int nblocks)
+{
+ int err;
+
+ if (ext4_handle_has_enough_credits(handle, nblocks))
+ return 0;
+
+ if (nblocks < EXT4_MAX_TRANS_DATA)
+ nblocks = EXT4_MAX_TRANS_DATA;
+
+ err = __ext4_journal_extend(where, handle, nblocks);
+ if (err < 0)
+ return err;
+ if (err) {
+ if (inode) {
+ /* lazy way to do mark_iloc_dirty() */
+ err = ext4_mark_inode_dirty(handle, inode);
+ if (err)
+ return err;
+ }
+ err = __ext4_journal_restart(where, handle, nblocks);
+ if (err)
+ return err;
+ if (inode)
+ /* lazy way to do reserve_inode_write() */
+ err = ext4_mark_inode_dirty(handle, inode);
+ }
+
+ return err;
+}
+
+#define extend_or_restart_transaction(handle, nblocks) \
+ __extend_or_restart_transaction(__func__, (handle), NULL, (nblocks))
+#define extend_or_restart_transaction_inode(handle, inode, nblocks) \
+ __extend_or_restart_transaction(__func__, (handle), (inode), (nblocks))
+
+
+static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
+ unsigned long ino,
+ struct ext4_iloc *iloc)
+{
+ ext4_fsblk_t block;
+ struct ext4_group_desc *desc;
+ int inodes_per_block, inode_offset;
+
+ iloc->bh = NULL;
+ iloc->offset = 0;
+ iloc->block_group = 0;
+
+ if (!ext4_valid_inum(sb, ino))
+ return 0;
+
+ iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+ desc = ext4_get_group_desc(sb, iloc->block_group, NULL);
+ if (!desc)
+ return 0;
+
+ /*
+ * Figure out the offset within the block group inode table
+ */
+ inodes_per_block = (EXT4_BLOCK_SIZE(sb) / EXT4_INODE_SIZE(sb));
+ inode_offset = ((ino - 1) %
+ EXT4_INODES_PER_GROUP(sb));
+ block = ext4_inode_table(sb, desc) + (inode_offset / inodes_per_block);
+ iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
+ return block;
+}
+
+/*
+ * ext4_snapshot_create() initializes a snapshot file
+ * and adds it to the list of snapshots
+ * Called under i_mutex and snapshot_mutex
+ */
+static int ext4_snapshot_create(struct inode *inode)
+{
+ handle_t *handle;
+ struct super_block *sb = inode->i_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct inode *active_snapshot = ext4_snapshot_has_active(sb);
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ int i, err, ret;
+ ext4_fsblk_t snapshot_blocks = ext4_blocks_count(sbi->s_es);
+ if (active_snapshot) {
+ snapshot_debug(1, "failed to add snapshot because active "
+ "snapshot (%u) has to be deleted first\n",
+ active_snapshot->i_generation);
+ return -EINVAL;
+ }
+
+ /* prevent take of unlinked snapshot file */
+ if (!inode->i_nlink) {
+ snapshot_debug(1, "failed to create snapshot file (ino=%lu) "
+ "because it has 0 nlink count\n",
+ inode->i_ino);
+ return -EINVAL;
+ }
+
+ /* prevent recycling of old snapshot files */
+ if (ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_DELETED)) {
+ snapshot_debug(1, "deleted snapshot file (ino=%lu) cannot "
+ "be reused - it may be unlinked\n",
+ inode->i_ino);
+ return -EINVAL;
+ }
+
+ /* verify that no inode blocks are allocated */
+ for (i = 0; i < EXT4_N_BLOCKS; i++) {
+ if (ei->i_data[i])
+ break;
+ }
+ /* Don't need i_size_read because we hold i_mutex */
+ if (i != EXT4_N_BLOCKS ||
+ inode->i_size > 0 || ei->i_disksize > 0) {
+ snapshot_debug(1, "failed to create snapshot file (ino=%lu) "
+ "because it is not empty (i_data[%d]=%u, "
+ "i_size=%lld, i_disksize=%lld)\n",
+ inode->i_ino, i, ei->i_data[i],
+ inode->i_size, ei->i_disksize);
+ return -EINVAL;
+ }
+
+ /*
+ * Take a reference to the small transaction that started in
+ * ext4_ioctl() We will extend or restart this transaction as we go
+ * along. journal_start(n > 1) would not have increase the buffer
+ * credits.
+ */
+ handle = ext4_journal_start(inode, 1);
+
+ err = extend_or_restart_transaction_inode(handle, inode, 2);
+ if (err)
+ goto out_handle;
+
+ /* record the new snapshot ID in the snapshot inode generation field */
+ inode->i_generation = le32_to_cpu(sbi->s_es->s_snapshot_id) + 1;
+ if (inode->i_generation == 0)
+ /* 0 is not a valid snapshot id */
+ inode->i_generation = 1;
+
+ /* record the file system size in the snapshot inode disksize field */
+ SNAPSHOT_SET_BLOCKS(inode, snapshot_blocks);
+
+ lock_super(sb);
+ err = ext4_journal_get_write_access(handle, sbi->s_sbh);
+ sbi->s_es->s_snapshot_list = cpu_to_le32(inode->i_ino);
+ if (!err)
+ err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
+ unlock_super(sb);
+ if (err)
+ goto out_handle;
+
+ err = ext4_mark_inode_dirty(handle, inode);
+ if (err)
+ goto out_handle;
+
+ snapshot_debug(1, "snapshot (%u) created\n", inode->i_generation);
+ err = 0;
+out_handle:
+ ret = ext4_journal_stop(handle);
+ if (!err)
+ err = ret;
+ return err;
+}
+
+
+/*
+ * ext4_snapshot_take() makes a new snapshot file
+ * into the active snapshot
+ *
+ * this function calls journal_lock_updates()
+ * and should not be called during a journal transaction
+ * Called from ext4_ioctl() under i_mutex and snapshot_mutex
+ */
+int ext4_snapshot_take(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_super_block *es = NULL;
+ struct buffer_head *es_bh = NULL;
+ struct buffer_head *sbh = NULL;
+ int err = -EIO;
+
+ if (!sbi->s_sbh)
+ goto out_err;
+ else if (sbi->s_sbh->b_blocknr != 0) {
+ snapshot_debug(1, "warning: unexpected super block at block "
+ "(%lld:%d)!\n", (long long)sbi->s_sbh->b_blocknr,
+ (int)((char *)sbi->s_es - (char *)sbi->s_sbh->b_data));
+ } else if (sbi->s_es->s_magic != cpu_to_le16(EXT4_SUPER_MAGIC)) {
+ snapshot_debug(1, "warning: super block of snapshot (%u) is "
+ "broken!\n", inode->i_generation);
+ } else
+ es_bh = ext4_getblk(NULL, inode, SNAPSHOT_IBLOCK(0),
+ SNAPMAP_READ, &err);
+
+ if (!es_bh || es_bh->b_blocknr == 0) {
+ snapshot_debug(1, "warning: super block of snapshot (%u) not "
+ "allocated\n", inode->i_generation);
+ goto out_err;
+ } else {
+ snapshot_debug(4, "super block of snapshot (%u) mapped to "
+ "block (%lld)\n", inode->i_generation,
+ (long long)es_bh->b_blocknr);
+ es = (struct ext4_super_block *)(es_bh->b_data +
+ ((char *)sbi->s_es -
+ sbi->s_sbh->b_data));
+ }
+
+ err = -EIO;
+
+ /*
+ * flush journal to disk and clear the RECOVER flag
+ * before taking the snapshot
+ */
+ freeze_super(sb);
+ lock_super(sb);
+
+#ifdef CONFIG_EXT4_DEBUG
+ if (snapshot_enable_test[SNAPTEST_TAKE]) {
+ snapshot_debug(1, "taking snapshot (%u) ...\n",
+ inode->i_generation);
+ /* sleep 1 tunable delay unit */
+ snapshot_test_delay(SNAPTEST_TAKE);
+ }
+#endif
+
+
+ /* reset i_size and invalidate page cache */
+ SNAPSHOT_SET_DISABLED(inode);
+ /* reset COW bitmap cache */
+ ext4_snapshot_reset_bitmap_cache(sb);
+ /* set as in-memory active snapshot */
+ err = ext4_snapshot_set_active(sb, inode);
+ if (err)
+ goto out_unlockfs;
+
+ /* set as on-disk active snapshot */
+
+ sbi->s_es->s_snapshot_id =
+ cpu_to_le32(le32_to_cpu(sbi->s_es->s_snapshot_id) + 1);
+ if (sbi->s_es->s_snapshot_id == 0)
+ /* 0 is not a valid snapshot id */
+ sbi->s_es->s_snapshot_id = cpu_to_le32(1);
+ sbi->s_es->s_snapshot_inum = cpu_to_le32(inode->i_ino);
+ ext4_snapshot_set_tid(sb);
+
+ err = 0;
+out_unlockfs:
+ unlock_super(sb);
+ thaw_super(sb);
+
+ if (err)
+ goto out_err;
+
+ snapshot_debug(1, "snapshot (%u) has been taken\n",
+ inode->i_generation);
+
+out_err:
+ brelse(es_bh);
+ brelse(sbh);
+ return err;
+}
+
+/*
+ * ext4_snapshot_enable() enables snapshot mount
+ * sets the in-use flag and the active snapshot
+ * Called under i_mutex and snapshot_mutex
+ */
+static int ext4_snapshot_enable(struct inode *inode)
+{
+ if (!ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "ext4_snapshot_enable() called with "
+ "snapshot file (ino=%lu) not on list\n",
+ inode->i_ino);
+ return -EINVAL;
+ }
+
+ if (ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_DELETED)) {
+ snapshot_debug(1, "enable of deleted snapshot (%u) "
+ "is not permitted\n",
+ inode->i_generation);
+ return -EPERM;
+ }
+
+ /*
+ * set i_size to block device size to enable loop device mount
+ */
+ SNAPSHOT_SET_ENABLED(inode);
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_ENABLED);
+
+ /* Don't need i_size_read because we hold i_mutex */
+ snapshot_debug(4, "setting snapshot (%u) i_size to (%lld)\n",
+ inode->i_generation, inode->i_size);
+ snapshot_debug(1, "snapshot (%u) enabled\n", inode->i_generation);
+ return 0;
+}
+
+/*
+ * ext4_snapshot_disable() disables snapshot mount
+ * Called under i_mutex and snapshot_mutex
+ */
+static int ext4_snapshot_disable(struct inode *inode)
+{
+ if (!ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "ext4_snapshot_disable() called with "
+ "snapshot file (ino=%lu) not on list\n",
+ inode->i_ino);
+ return -EINVAL;
+ }
+
+ if (ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_OPEN)) {
+ snapshot_debug(1, "disable of mounted snapshot (%u) "
+ "is not permitted\n",
+ inode->i_generation);
+ return -EPERM;
+ }
+
+ /* reset i_size and invalidate page cache */
+ SNAPSHOT_SET_DISABLED(inode);
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_ENABLED);
+
+ /* Don't need i_size_read because we hold i_mutex */
+ snapshot_debug(4, "setting snapshot (%u) i_size to (%lld)\n",
+ inode->i_generation, inode->i_size);
+ snapshot_debug(1, "snapshot (%u) disabled\n", inode->i_generation);
+ return 0;
+}
+
+/*
+ * ext4_snapshot_delete() marks snapshot for deletion
+ * Called under i_mutex and snapshot_mutex
+ */
+static int ext4_snapshot_delete(struct inode *inode)
+{
+ if (!ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "ext4_snapshot_delete() called with "
+ "snapshot file (ino=%lu) not on list\n",
+ inode->i_ino);
+ return -EINVAL;
+ }
+
+ if (ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_ENABLED)) {
+ snapshot_debug(1, "delete of enabled snapshot (%u) "
+ "is not permitted\n",
+ inode->i_generation);
+ return -EPERM;
+ }
+
+ /* mark deleted for later cleanup to finish the job */
+ ext4_set_inode_flag(inode, EXT4_INODE_SNAPFILE_DELETED);
+ snapshot_debug(1, "snapshot (%u) marked for deletion\n",
+ inode->i_generation);
+ return 0;
+}
+
+/*
+ * ext4_snapshot_remove - removes a snapshot from the list
+ * @inode: snapshot inode
+ *
+ * Removed the snapshot inode from in-memory and on-disk snapshots list of
+ * and truncates the snapshot inode.
+ * Called from ext4_snapshot_update/cleanup/merge() under snapshot_mutex.
+ * Returns 0 on success and <0 on error.
+ */
+static int ext4_snapshot_remove(struct inode *inode)
+{
+ handle_t *handle;
+ struct ext4_sb_info *sbi;
+ int err = 0, ret;
+
+ /* elevate ref count until final cleanup */
+ if (!igrab(inode))
+ return -EIO;
+
+ if (ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_ACTIVE) ||
+ ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_ENABLED) ||
+ ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_INUSE)) {
+ snapshot_debug(1, "ext4_snapshot_remove() called with active/"
+ "enabled/in-use snapshot file (ino=%lu)\n",
+ inode->i_ino);
+ err = -EINVAL;
+ goto out_err;
+ }
+
+ /* start large truncate transaction that will be extended/restarted */
+ handle = ext4_journal_start(inode, EXT4_MAX_TRANS_DATA);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto out_err;
+ }
+ sbi = EXT4_SB(inode->i_sb);
+
+
+ err = extend_or_restart_transaction_inode(handle, inode, 2);
+ if (err)
+ goto out_handle;
+
+ lock_super(inode->i_sb);
+ err = ext4_journal_get_write_access(handle, sbi->s_sbh);
+ sbi->s_es->s_snapshot_list = 0;
+ if (!err)
+ err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
+ unlock_super(inode->i_sb);
+ if (err)
+ goto out_handle;
+ /*
+ * At this point, this snapshot is empty and not on the snapshots list.
+ * As long as it was on the list it had to have the LIST flag to prevent
+ * truncate/unlink. Now that it is removed from the list, the LIST flag
+ * and other snapshot status flags should be cleared. It will still
+ * have the SNAPFILE and SNAPFILE_DELETED persistent flags to indicate
+ * this is a deleted snapshot that should not be recycled.
+ */
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_LIST);
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_ENABLED);
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_ACTIVE);
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_INUSE);
+
+out_handle:
+ ret = ext4_journal_stop(handle);
+ if (!err)
+ err = ret;
+ if (err)
+ goto out_err;
+
+ snapshot_debug(1, "snapshot (%u) deleted\n", inode->i_generation);
+
+ err = 0;
+out_err:
+ /* drop final ref count - taken on entry to this function */
+ iput(inode);
+ if (err) {
+ snapshot_debug(1, "failed to delete snapshot (%u)\n",
+ inode->i_generation);
+ }
+ return err;
+}
+
+/*
* Snapshot constructor/destructor
*/
/*
@@ -250,6 +820,8 @@ void ext4_snapshot_destroy(struct super_block *sb)
int ext4_snapshot_update(struct super_block *sb, int cleanup, int read_only)
{
struct inode *active_snapshot = ext4_snapshot_has_active(sb);
+ struct inode *used_by = NULL; /* last non-deleted snapshot found */
+ int deleted;
int err = 0;

BUG_ON(read_only && cleanup);
@@ -262,5 +834,26 @@ int ext4_snapshot_update(struct super_block *sb, int cleanup, int read_only)
}


+ if (!active_snapshot || !cleanup || used_by)
+ return 0;
+
+ /* if all snapshots are deleted - deactivate active snapshot */
+ deleted = ext4_test_inode_flag(active_snapshot,
+ EXT4_INODE_SNAPFILE_DELETED);
+ if (deleted && igrab(active_snapshot)) {
+ /* lock journal updates before deactivating snapshot */
+ freeze_super(sb);
+ lock_super(sb);
+ /* deactivate in-memory active snapshot - cannot fail */
+ (void) ext4_snapshot_set_active(sb, NULL);
+ /* clear on-disk active snapshot */
+ EXT4_SB(sb)->s_es->s_snapshot_inum = 0;
+ unlock_super(sb);
+ thaw_super(sb);
+ /* remove unused deleted active snapshot */
+ err = ext4_snapshot_remove(active_snapshot);
+ /* drop the refcount to 0 */
+ iput(active_snapshot);
+ }
return err;
}
--
1.7.4.1


2011-06-07 15:09:50

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 19/36] ext4: snapshot control - init new snapshot

From: Amir Goldstein <[email protected]>

On snapshot create, a few special blocks (i.e., the super block and
group descriptors) are pre-allocated and on snapshot take, they are
copied under journal_lock_updates(). This is done to avoid the
recursion that would be caused by COWing these blocks after the
snapshot becomes active.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/snapshot_ctl.c | 308 ++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 308 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index f2dbef4..9d915a9 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -299,6 +299,48 @@ int __extend_or_restart_transaction(const char *where,
#define extend_or_restart_transaction_inode(handle, inode, nblocks) \
__extend_or_restart_transaction(__func__, (handle), (inode), (nblocks))

+/*
+ * helper function for snapshot_create().
+ * places pre-allocated [d,t]ind blocks in position
+ * after they have been allocated as direct blocks.
+ */
+static inline int ext4_snapshot_shift_blocks(struct ext4_inode_info *ei,
+ int from, int to, int count)
+{
+ int i, err = -EIO;
+
+ /* move from direct blocks range */
+ BUG_ON(from < 0 || from + count > EXT4_NDIR_BLOCKS);
+ /* to indirect blocks range */
+ BUG_ON(to < EXT4_NDIR_BLOCKS || to + count > EXT4_SNAPSHOT_N_BLOCKS);
+
+ /*
+ * truncate_mutex is held whenever allocating or freeing inode
+ * blocks.
+ */
+ down_write(&ei->i_data_sem);
+
+ /*
+ * verify that 'from' blocks are allocated
+ * and that 'to' blocks are not allocated.
+ */
+ for (i = 0; i < count; i++)
+ if (!ei->i_data[from+i] ||
+ ei->i_data[(to+i)%EXT4_N_BLOCKS])
+ goto out;
+
+ /*
+ * shift 'count' blocks from position 'from' to 'to'
+ */
+ for (i = 0; i < count; i++) {
+ ei->i_data[(to+i)%EXT4_N_BLOCKS] = ei->i_data[from+i];
+ ei->i_data[from+i] = 0;
+ }
+ err = 0;
+out:
+ up_write(&ei->i_data_sem);
+ return err;
+}

static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
unsigned long ino,
@@ -344,6 +386,13 @@ static int ext4_snapshot_create(struct inode *inode)
struct inode *active_snapshot = ext4_snapshot_has_active(sb);
struct ext4_inode_info *ei = EXT4_I(inode);
int i, err, ret;
+ int count, nind;
+ const long double_blocks = (1 << (2 * SNAPSHOT_ADDR_PER_BLOCK_BITS));
+ struct buffer_head *bh = NULL;
+ struct ext4_group_desc *desc;
+ unsigned long ino;
+ struct ext4_iloc iloc;
+ ext4_fsblk_t bmap_blk = 0, imap_blk = 0, inode_blk = 0;
ext4_fsblk_t snapshot_blocks = ext4_blocks_count(sbi->s_es);
if (active_snapshot) {
snapshot_debug(1, "failed to add snapshot because active "
@@ -418,6 +467,140 @@ static int ext4_snapshot_create(struct inode *inode)
if (err)
goto out_handle;

+ /* small filesystems can be mapped with just 1 double indirect block */
+ nind = 1;
+ if (snapshot_blocks > double_blocks)
+ /* add up to 4 triple indirect blocks to map 2^32 blocks */
+ nind += ((snapshot_blocks - double_blocks) >>
+ (3 * SNAPSHOT_ADDR_PER_BLOCK_BITS)) + 1;
+ if (nind > 2 + EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS) {
+ snapshot_debug(1, "need too many [d,t]ind blocks (%d) "
+ "for snapshot (%u)\n",
+ nind, inode->i_generation);
+ err = -EFBIG;
+ goto out_handle;
+ }
+
+ err = extend_or_restart_transaction_inode(handle, inode,
+ nind * EXT4_DATA_TRANS_BLOCKS(sb));
+ if (err)
+ goto out_handle;
+
+ /* pre-allocate and zero out [d,t]ind blocks */
+ for (i = 0; i < nind; i++) {
+ brelse(bh);
+ bh = ext4_getblk(handle, inode, i, SNAPMAP_WRITE, &err);
+ if (!bh)
+ break;
+ /* zero out indirect block and journal as dirty metadata */
+ err = ext4_journal_get_write_access(handle, bh);
+ if (err)
+ break;
+ lock_buffer(bh);
+ memset(bh->b_data, 0, bh->b_size);
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+ err = ext4_handle_dirty_metadata(handle, NULL, bh);
+ if (err)
+ break;
+ }
+ brelse(bh);
+ if (!bh || err) {
+ snapshot_debug(1, "failed to initiate [d,t]ind block (%d) "
+ "for snapshot (%u)\n",
+ i, inode->i_generation);
+ goto out_handle;
+ }
+ /* place pre-allocated [d,t]ind blocks in position */
+ err = ext4_snapshot_shift_blocks(ei, 0, EXT4_DIND_BLOCK, nind);
+ if (err) {
+ snapshot_debug(1, "failed to move pre-allocated [d,t]ind blocks"
+ " for snapshot (%u)\n",
+ inode->i_generation);
+ goto out_handle;
+ }
+
+ /* allocate super block and group descriptors for snapshot */
+ count = sbi->s_gdb_count + 1;
+ err = count;
+ for (i = 0; err > 0 && i < count; i += err) {
+ err = extend_or_restart_transaction_inode(handle, inode,
+ EXT4_DATA_TRANS_BLOCKS(sb));
+ if (err)
+ goto out_handle;
+ err = ext4_snapshot_map_blocks(handle, inode, i, count - i,
+ NULL, SNAPMAP_WRITE);
+ }
+ if (err <= 0) {
+ snapshot_debug(1, "failed to allocate super block and %d "
+ "group descriptor blocks for snapshot (%u)\n",
+ count - 1, inode->i_generation);
+ if (err)
+ err = -EIO;
+ goto out_handle;
+ }
+
+ ino = inode->i_ino;
+ /*
+ * pre-allocate the following blocks in the new snapshot:
+ * - block and inode bitmap blocks of ino's block group
+ * - inode table block that contains ino
+ */
+ err = extend_or_restart_transaction_inode(handle, inode,
+ 3 * EXT4_DATA_TRANS_BLOCKS(sb));
+ if (err)
+ goto out_handle;
+
+ inode_blk = ext4_get_inode_block(sb, ino, &iloc);
+
+ bmap_blk = 0;
+ imap_blk = 0;
+ desc = ext4_get_group_desc(sb, iloc.block_group, NULL);
+ if (!desc)
+ goto next_snapshot;
+
+ bmap_blk = ext4_block_bitmap(sb, desc);
+ imap_blk = ext4_inode_bitmap(sb, desc);
+ if (!bmap_blk || !imap_blk)
+ goto next_snapshot;
+
+ count = 1;
+ if (imap_blk == bmap_blk + 1)
+ count++;
+ if ((count > 1) && (inode_blk == imap_blk + 1))
+ count++;
+ /* try to allocate all blocks at once */
+ err = ext4_snapshot_map_blocks(handle, inode,
+ bmap_blk, count,
+ NULL, SNAPMAP_WRITE);
+ count = err;
+ /* allocate remaining blocks one by one */
+ if (err > 0 && count < 2)
+ err = ext4_snapshot_map_blocks(handle, inode,
+ imap_blk, 1,
+ NULL,
+ SNAPMAP_WRITE);
+ if (err > 0 && count < 3)
+ err = ext4_snapshot_map_blocks(handle, inode,
+ inode_blk, 1,
+ NULL,
+ SNAPMAP_WRITE);
+next_snapshot:
+ if (!bmap_blk || !imap_blk || !inode_blk || err < 0) {
+#ifdef CONFIG_EXT4_DEBUG
+ ext4_fsblk_t blk0 = iloc.block_group *
+ EXT4_BLOCKS_PER_GROUP(sb);
+ snapshot_debug(1, "failed to allocate block/inode bitmap "
+ "or inode table block of inode (%lu) "
+ "(%llu,%llu,%llu/%u) for snapshot (%u)\n",
+ ino, bmap_blk - blk0,
+ imap_blk - blk0, inode_blk - blk0,
+ iloc.block_group, inode->i_generation);
+#endif
+ if (!err)
+ err = -EIO;
+ goto out_handle;
+ }
snapshot_debug(1, "snapshot (%u) created\n", inode->i_generation);
err = 0;
out_handle:
@@ -427,6 +610,68 @@ out_handle:
return err;
}

+/*
+ * ext4_snapshot_copy_block() - copy block to new snapshot
+ * @snapshot: new snapshot to copy block to
+ * @bh: source buffer to be copied
+ * @mask: if not NULL, mask buffer data before copying to snapshot
+ * (used to mask block bitmap with exclude bitmap)
+ * @name: name of copied block to print
+ * @idx: index of copied block to print
+ *
+ * Called from ext4_snapshot_take() under journal_lock_updates()
+ * Returns snapshot buffer on success, NULL on error
+ */
+static struct buffer_head *ext4_snapshot_copy_block(struct inode *snapshot,
+ struct buffer_head *bh, const char *mask,
+ const char *name, unsigned long idx)
+{
+ struct buffer_head *sbh = NULL;
+ int err;
+
+ if (!bh)
+ return NULL;
+
+ sbh = ext4_getblk(NULL, snapshot,
+ SNAPSHOT_IBLOCK(bh->b_blocknr),
+ SNAPMAP_READ, &err);
+
+ if (!sbh || sbh->b_blocknr == bh->b_blocknr) {
+ snapshot_debug(1, "failed to copy %s (%lu) "
+ "block [%llu/%llu] to snapshot (%u)\n",
+ name, idx,
+ SNAPSHOT_BLOCK_TUPLE(bh->b_blocknr),
+ snapshot->i_generation);
+ brelse(sbh);
+ return NULL;
+ }
+
+ ext4_snapshot_copy_buffer(sbh, bh, mask);
+
+ snapshot_debug(4, "copied %s (%lu) block [%llu/%llu] "
+ "to snapshot (%u)\n",
+ name, idx,
+ SNAPSHOT_BLOCK_TUPLE(bh->b_blocknr),
+ snapshot->i_generation);
+ return sbh;
+}
+
+/*
+ * List of blocks which are copied to snapshot for every special inode.
+ * Keep block bitmap first and inode table block last in the list.
+ */
+enum copy_inode_block {
+ COPY_BLOCK_BITMAP,
+ COPY_INODE_BITMAP,
+ COPY_INODE_TABLE,
+ COPY_INODE_BLOCKS_NUM
+};
+
+static char *copy_inode_block_name[COPY_INODE_BLOCKS_NUM] = {
+ "block bitmap",
+ "inode bitmap",
+ "inode table"
+};

/*
* ext4_snapshot_take() makes a new snapshot file
@@ -443,6 +688,12 @@ int ext4_snapshot_take(struct inode *inode)
struct ext4_super_block *es = NULL;
struct buffer_head *es_bh = NULL;
struct buffer_head *sbh = NULL;
+ struct buffer_head *bhs[COPY_INODE_BLOCKS_NUM] = { NULL };
+ const char *mask = NULL;
+ struct inode *curr_inode;
+ struct ext4_iloc iloc;
+ struct ext4_group_desc *desc;
+ int i;
int err = -EIO;

if (!sbi->s_sbh)
@@ -489,6 +740,61 @@ int ext4_snapshot_take(struct inode *inode)
}
#endif

+ /*
+ * copy group descriptors to snapshot
+ */
+ for (i = 0; i < sbi->s_gdb_count; i++) {
+ brelse(sbh);
+ sbh = ext4_snapshot_copy_block(inode,
+ sbi->s_group_desc[i], NULL,
+ "GDT", i);
+ if (!sbh)
+ goto out_unlockfs;
+ }
+
+ curr_inode = inode;
+ /*
+ * copy the following blocks to the new snapshot:
+ * - block and inode bitmap blocks of curr_inode block group
+ * - inode table block that contains curr_inode
+ */
+ iloc.block_group = 0;
+ err = ext4_get_inode_loc(curr_inode, &iloc);
+ brelse(bhs[COPY_INODE_TABLE]);
+ bhs[COPY_INODE_TABLE] = iloc.bh;
+ desc = ext4_get_group_desc(sb, iloc.block_group, NULL);
+ if (err || !desc) {
+ snapshot_debug(1, "failed to read inode and bitmap blocks "
+ "of inode (%lu)\n", curr_inode->i_ino);
+ err = err ? : -EIO;
+ goto out_unlockfs;
+ }
+ brelse(bhs[COPY_BLOCK_BITMAP]);
+ bhs[COPY_BLOCK_BITMAP] = sb_bread(sb,
+ ext4_block_bitmap(sb, desc));
+ brelse(bhs[COPY_INODE_BITMAP]);
+ bhs[COPY_INODE_BITMAP] = sb_bread(sb,
+ ext4_inode_bitmap(sb, desc));
+ err = -EIO;
+ for (i = 0; i < COPY_INODE_BLOCKS_NUM; i++) {
+ brelse(sbh);
+ sbh = ext4_snapshot_copy_block(inode, bhs[i], mask,
+ copy_inode_block_name[i], curr_inode->i_ino);
+ if (!sbh)
+ goto out_unlockfs;
+ mask = NULL;
+ }
+
+ /*
+ * copy super block to snapshot and fix it
+ */
+ lock_buffer(es_bh);
+ memcpy(es_bh->b_data, sbi->s_sbh->b_data, sb->s_blocksize);
+ set_buffer_uptodate(es_bh);
+ unlock_buffer(es_bh);
+ mark_buffer_dirty(es_bh);
+ sync_dirty_buffer(es_bh);
+

/* reset i_size and invalidate page cache */
SNAPSHOT_SET_DISABLED(inode);
@@ -523,6 +829,8 @@ out_unlockfs:
out_err:
brelse(es_bh);
brelse(sbh);
+ for (i = 0; i < COPY_INODE_BLOCKS_NUM; i++)
+ brelse(bhs[i]);
return err;
}

--
1.7.4.1


2011-06-07 15:09:52

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 20/36] ext4: snapshot control - fix new snapshot

From: Amir Goldstein <[email protected]>

On snapshot take, after copying the pre-allocated blocks, some are
fixed to make the snapshot image appear as a valid Ext4 file system.
The has_snapshot flags is cleared from the super block as well as
the last_snapshot field and all snapshot inodes are cleared
(to appear as empty inodes).


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/inode.c | 4 +-
fs/ext4/snapshot_ctl.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e76faae..198d7d4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1839,6 +1839,8 @@ struct buffer_head *ext4_bread(handle_t *, struct inode *,
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);

+extern blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei);
extern struct inode *ext4_iget(struct super_block *, unsigned long);
extern int ext4_write_inode(struct inode *, struct writeback_control *);
extern int ext4_setattr(struct dentry *, struct iattr *);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1558a7b..d703a55 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5178,8 +5178,8 @@ void ext4_get_inode_flags(struct ext4_inode_info *ei)
} while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl);
}

-static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
- struct ext4_inode_info *ei)
+blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei)
{
blkcnt_t i_blocks ;
struct inode *inode = &(ei->vfs_inode);
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 9d915a9..360581d 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -393,6 +393,7 @@ static int ext4_snapshot_create(struct inode *inode)
unsigned long ino;
struct ext4_iloc iloc;
ext4_fsblk_t bmap_blk = 0, imap_blk = 0, inode_blk = 0;
+ ext4_fsblk_t prev_inode_blk = 0;
ext4_fsblk_t snapshot_blocks = ext4_blocks_count(sbi->s_es);
if (active_snapshot) {
snapshot_debug(1, "failed to add snapshot because active "
@@ -540,7 +541,9 @@ static int ext4_snapshot_create(struct inode *inode)
goto out_handle;
}

- ino = inode->i_ino;
+ /* start with root inode and continue with snapshot list */
+ ino = EXT4_ROOT_INO;
+alloc_inode_blocks:
/*
* pre-allocate the following blocks in the new snapshot:
* - block and inode bitmap blocks of ino's block group
@@ -553,6 +556,11 @@ static int ext4_snapshot_create(struct inode *inode)

inode_blk = ext4_get_inode_block(sb, ino, &iloc);

+ if (!inode_blk || inode_blk == prev_inode_blk)
+ goto next_snapshot;
+
+ /* not same inode and bitmap blocks as prev snapshot */
+ prev_inode_blk = inode_blk;
bmap_blk = 0;
imap_blk = 0;
desc = ext4_get_group_desc(sb, iloc.block_group, NULL);
@@ -601,6 +609,10 @@ next_snapshot:
err = -EIO;
goto out_handle;
}
+ if (ino == EXT4_ROOT_INO) {
+ ino = inode->i_ino;
+ goto alloc_inode_blocks;
+ }
snapshot_debug(1, "snapshot (%u) created\n", inode->i_generation);
err = 0;
out_handle:
@@ -693,6 +705,10 @@ int ext4_snapshot_take(struct inode *inode)
struct inode *curr_inode;
struct ext4_iloc iloc;
struct ext4_group_desc *desc;
+ ext4_fsblk_t prev_inode_blk = 0;
+ struct ext4_inode *raw_inode;
+ blkcnt_t excluded_blocks = 0;
+ int fixing = 0;
int i;
int err = -EIO;

@@ -752,7 +768,9 @@ int ext4_snapshot_take(struct inode *inode)
goto out_unlockfs;
}

- curr_inode = inode;
+ /* start with root inode and continue with snapshot list */
+ curr_inode = sb->s_root->d_inode;
+copy_inode_blocks:
/*
* copy the following blocks to the new snapshot:
* - block and inode bitmap blocks of curr_inode block group
@@ -769,6 +787,11 @@ int ext4_snapshot_take(struct inode *inode)
err = err ? : -EIO;
goto out_unlockfs;
}
+ if (fixing)
+ goto fix_inode_copy;
+ if (iloc.bh->b_blocknr == prev_inode_blk)
+ goto next_inode;
+ prev_inode_blk = iloc.bh->b_blocknr;
brelse(bhs[COPY_BLOCK_BITMAP]);
bhs[COPY_BLOCK_BITMAP] = sb_bread(sb,
ext4_block_bitmap(sb, desc));
@@ -784,12 +807,59 @@ int ext4_snapshot_take(struct inode *inode)
goto out_unlockfs;
mask = NULL;
}
+ /* this is the copy pass */
+ goto next_inode;
+fix_inode_copy:
+ /* this is the fixing pass */
+ /* get snapshot copy of raw inode */
+ brelse(sbh);
+ sbh = ext4_getblk(NULL, inode,
+ SNAPSHOT_IBLOCK(iloc.bh->b_blocknr),
+ SNAPMAP_READ, &err);
+ if (!sbh)
+ goto out_unlockfs;
+ iloc.bh = sbh;
+ raw_inode = ext4_raw_inode(&iloc);
+ /*
+ * Snapshot inode blocks are excluded from COW bitmap,
+ * so they appear to be not allocated in the snapshot's
+ * block bitmap. If we want the snapshot image to pass
+ * fsck with no errors, we need to detach those blocks
+ * from the copy of the snapshot inode, so we fix the
+ * snapshot inodes to appear as empty regular files.
+ */
+ excluded_blocks += ext4_inode_blocks(raw_inode,
+ EXT4_I(curr_inode)) >>
+ (curr_inode->i_blkbits - 9);
+ lock_buffer(sbh);
+ ext4_isize_set(raw_inode, 0);
+ raw_inode->i_blocks_lo = 0;
+ raw_inode->i_blocks_high = 0;
+ raw_inode->i_flags &= cpu_to_le32(~EXT4_SNAPFILE_FL);
+ memset(raw_inode->i_block, 0, sizeof(raw_inode->i_block));
+ unlock_buffer(sbh);
+ mark_buffer_dirty(sbh);
+ sync_dirty_buffer(sbh);
+
+next_inode:
+ if (curr_inode->i_ino == EXT4_ROOT_INO) {
+ curr_inode = inode;
+ goto copy_inode_blocks;
+ }

/*
* copy super block to snapshot and fix it
*/
lock_buffer(es_bh);
memcpy(es_bh->b_data, sbi->s_sbh->b_data, sb->s_blocksize);
+ /* set the IS_SNAPSHOT flag to signal fsck this is a snapshot */
+ es->s_flags |= cpu_to_le32(EXT4_FLAGS_IS_SNAPSHOT);
+ /* reset snapshots list in snapshot's super block copy */
+ es->s_snapshot_inum = 0;
+ es->s_snapshot_list = 0;
+ /* fix free blocks count after clearing old snapshot inode blocks */
+ ext4_free_blocks_count_set(es, ext4_free_blocks_count(es) +
+ excluded_blocks);
set_buffer_uptodate(es_bh);
unlock_buffer(es_bh);
mark_buffer_dirty(es_bh);
--
1.7.4.1


2011-06-07 15:09:55

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 21/36] ext4: snapshot control - reserve disk space for snapshot

From: Amir Goldstein <[email protected]>

Ensure there is enough disk space for snapshot file future use.
Reserve disk space on snapshot take based on file system overhead
size, number of directories and number of blocks/inodes in use.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/balloc.c | 25 +++++++++++++++++++++++++
fs/ext4/ext4.h | 2 ++
fs/ext4/mballoc.c | 6 ++++++
fs/ext4/snapshot_ctl.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/super.c | 16 +++++++++++++++-
5 files changed, 92 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 8f1803f..1c140e4 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -372,6 +372,8 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
static int ext4_has_free_blocks(struct ext4_sb_info *sbi, s64 nblocks)
{
s64 free_blocks, dirty_blocks, root_blocks;
+ ext4_fsblk_t snapshot_r_blocks;
+ handle_t *handle = journal_current_handle();
struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;

@@ -379,6 +381,29 @@ static int ext4_has_free_blocks(struct ext4_sb_info *sbi, s64 nblocks)
dirty_blocks = percpu_counter_read_positive(dbc);
root_blocks = ext4_r_blocks_count(sbi->s_es);

+ if (ext4_snapshot_active(sbi)) {
+ if (unlikely(free_blocks < (nblocks + dirty_blocks)))
+ /* sorry, but we're really out of space */
+ return 0;
+ if (handle && unlikely(IS_COWING(handle)))
+ /* any available space may be used by COWing task */
+ return 1;
+ /* reserve blocks for active snapshot */
+ snapshot_r_blocks =
+ le64_to_cpu(sbi->s_es->s_snapshot_r_blocks_count);
+ /*
+ * The last snapshot_r_blocks are reserved for active snapshot
+ * and may not be allocated even by root.
+ */
+ if (free_blocks < (nblocks + dirty_blocks + snapshot_r_blocks))
+ return 0;
+ /*
+ * Mortal users must reserve blocks for both snapshot and
+ * root user.
+ */
+ root_blocks += snapshot_r_blocks;
+ }
+
if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
EXT4_FREEBLOCKS_WATERMARK) {
free_blocks = percpu_counter_sum_positive(fbc);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 198d7d4..8d82125 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1963,6 +1963,8 @@ extern __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 group,
struct ext4_group_desc *gdp);
extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
struct ext4_group_desc *gdp);
+struct kstatfs;
+extern int ext4_statfs_sb(struct super_block *sb, struct kstatfs *buf);

static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
{
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 6e4d960..899c12c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4296,10 +4296,16 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
return 0;
}
reserv_blks = ar->len;
+ if (unlikely(ar->flags & EXT4_MB_HINT_COWING)) {
+ /* don't fail when allocating blocks for COW */
+ dquot_alloc_block_nofail(ar->inode, ar->len);
+ goto nofail;
+ }
while (ar->len && dquot_alloc_block(ar->inode, ar->len)) {
ar->flags |= EXT4_MB_HINT_NOPREALLOC;
ar->len--;
}
+nofail:
inquota = ar->len;
if (ar->len == 0) {
*errp = -EDQUOT;
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 360581d..a610025 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -711,6 +711,8 @@ int ext4_snapshot_take(struct inode *inode)
int fixing = 0;
int i;
int err = -EIO;
+ u64 snapshot_r_blocks;
+ struct kstatfs statfs;

if (!sbi->s_sbh)
goto out_err;
@@ -739,6 +741,47 @@ int ext4_snapshot_take(struct inode *inode)
}

err = -EIO;
+ /* update fs statistics to calculate snapshot reserved space */
+ if (ext4_statfs_sb(sb, &statfs)) {
+ snapshot_debug(1, "failed to statfs before snapshot (%u) "
+ "take\n", inode->i_generation);
+ goto out_err;
+ }
+ /*
+ * Estimate maximum disk space for snapshot file metadata based on:
+ * 1 indirect block per 1K fs blocks (to map moved data blocks)
+ * +1 data block per 1K fs blocks (to copy indirect blocks)
+ * +1 data block per fs meta block (to copy meta blocks)
+ * +1 data block per directory (to copy small directory index blocks)
+ * +1 data block per X inodes (to copy large directory index blocks)
+ *
+ * We estimate no. of dir blocks from no. of allocated inode, assuming
+ * an avg. dir record size of 64 bytes. This assumption can break in
+ * 2 cases:
+ * 1. long file names (in avg.)
+ * 2. large no. of hard links (many dir records for the same inode)
+ *
+ * Under estimation can lead to potential ENOSPC during COW, which
+ * will trigger an ext4_error(). Hopefully, error behavior is set to
+ * remount-ro, so snapshot will not be corrupted.
+ *
+ * XXX: reserved space may be too small in data jounaling mode,
+ * which is currently not supported.
+ */
+#define AVG_DIR_RECORD_SIZE_BITS 6 /* 64 bytes */
+#define AVG_INODES_PER_DIR_BLOCK \
+ (SNAPSHOT_BLOCK_SIZE_BITS - AVG_DIR_RECORD_SIZE_BITS)
+ snapshot_r_blocks = 2 * (statfs.f_blocks >>
+ SNAPSHOT_ADDR_PER_BLOCK_BITS) +
+ statfs.f_spare[0] + statfs.f_spare[1] +
+ ((statfs.f_files - statfs.f_ffree) >>
+ AVG_INODES_PER_DIR_BLOCK);
+
+ /* verify enough free space before taking the snapshot */
+ if (statfs.f_bfree < snapshot_r_blocks) {
+ err = -ENOSPC;
+ goto out_err;
+ }

/*
* flush journal to disk and clear the RECOVER flag
@@ -876,6 +919,7 @@ next_inode:
goto out_unlockfs;

/* set as on-disk active snapshot */
+ sbi->s_es->s_snapshot_r_blocks_count = cpu_to_le64(snapshot_r_blocks);

sbi->s_es->s_snapshot_id =
cpu_to_le32(le32_to_cpu(sbi->s_es->s_snapshot_id) + 1);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index dbe5651..a7be485 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4515,7 +4515,11 @@ restore_opts:

static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
{
- struct super_block *sb = dentry->d_sb;
+ return ext4_statfs_sb(dentry->d_sb, buf);
+}
+
+int ext4_statfs_sb(struct super_block *sb, struct kstatfs *buf)
+{
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
u64 fsid;
@@ -4567,6 +4571,16 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
if (buf->f_bfree < ext4_r_blocks_count(es))
buf->f_bavail = 0;
+ if (ext4_snapshot_active(sbi)) {
+ if (buf->f_bfree < ext4_r_blocks_count(es) +
+ le64_to_cpu(es->s_snapshot_r_blocks_count))
+ buf->f_bavail = 0;
+ else
+ buf->f_bavail -=
+ le64_to_cpu(es->s_snapshot_r_blocks_count);
+ }
+ buf->f_spare[0] = percpu_counter_sum_positive(&sbi->s_dirs_counter);
+ buf->f_spare[1] = sbi->s_overhead_last;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
buf->f_namelen = EXT4_NAME_LEN;
--
1.7.4.1


2011-06-07 15:09:58

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 22/36] ext4: snapshot journaled - increase transaction credits

From: Amir Goldstein <[email protected]>

Snapshot operations are journaled as part of the running transaction.
The amount of requested credits is multiplied with a factor, to ensure
that enough buffer credits are reserved in the running transaction.
The new field h_base_credits stored to original credits request and
the new filed u_user_credits counts the number of credits used by
non-COW operations. They are especially useful when exteding a large
transaction, which did not use the extra COW credits it requested.
In this case, only the missing extra credits are requested.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 21 +++++++
fs/ext4/ext4_jbd2.h | 159 ++++++++++++++++++++++++++++++++++++++++++++++-----
fs/ext4/resize.c | 2 +-
fs/ext4/snapshot.c | 12 ++++
fs/ext4/super.c | 38 ++++++++++++-
5 files changed, 214 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index c44c362..015f727 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -131,6 +131,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
handle_t *handle, struct inode *inode,
struct buffer_head *bh)
{
+ struct super_block *sb;
int err = 0;

if (ext4_handle_valid(handle)) {
@@ -138,6 +139,26 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ if (err)
+ return err;
+ sb = handle->h_transaction->t_journal->j_private;
+ if (EXT4_SNAPSHOTS(sb) && !IS_COWING(handle)) {
+ struct journal_head *jh = bh2jh(bh);
+ jbd_lock_bh_state(bh);
+ /*
+ * buffer_credits was decremented when buffer was
+ * modified for the first time in the current
+ * transaction, which may have been during a COW
+ * operation. We decrement user_credits and mark
+ * b_modified = 2, on the first time that the buffer
+ * is modified not during a COW operation (!h_cowing).
+ */
+ if (jh->b_modified == 1) {
+ jh->b_modified = 2;
+ handle->h_user_credits--;
+ }
+ jbd_unlock_bh_state(bh);
+ }
} else {
if (inode)
mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 4af0bb5..2b0e1bd 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -83,6 +83,62 @@
* one block, plus two quota updates. Quota allocations are not
* needed. */

+/* on block write we have to journal the block itself */
+#define EXT4_WRITE_CREDITS 1
+/* on snapshot block alloc we have to journal block group bitmap, exclude
+ bitmap and gdb */
+#define EXT4_ALLOC_CREDITS 3
+/* number of credits for COW bitmap operation (allocated blocks are not
+ journalled): alloc(dind+ind+cow) = 9 */
+#define EXT4_COW_BITMAP_CREDITS (3*EXT4_ALLOC_CREDITS)
+/* number of credits for other block COW operations:
+ alloc(dind+ind+cow)+write(dind+ind) = 11 */
+#define EXT4_COW_BLOCK_CREDITS (3*EXT4_ALLOC_CREDITS+2*EXT4_WRITE_CREDITS)
+/* number of credits for the first COW operation in the block group, which
+ * is not the first group in a flex group (alloc 2 dind blocks):
+ 9+11 = 20 */
+#define EXT4_COW_CREDITS (EXT4_COW_BLOCK_CREDITS + \
+ EXT4_COW_BITMAP_CREDITS)
+/* number of credits for snapshot operations counted once per transaction:
+ write(sb+inode+tind) = 3 */
+#define EXT4_SNAPSHOT_CREDITS (3*EXT4_WRITE_CREDITS)
+/*
+ * in total, for N COW operations, we may have to journal 20N+3 blocks,
+ * and we also want to reserve 20+3 credits for the last COW operation,
+ * so we add 20(N-1)+3+(20+3) to the requested N buffer credits
+ * and request 21N+6 buffer credits.
+ * that's a lot of extra credits and much more then needed for the common
+ * case, but what can we do?
+ *
+ * we are going to need a bigger journal to accommodate the
+ * extra snapshot credits.
+ * mke2fs -j uses the following default formula for fs-size above 1G:
+ * journal-size = MIN(128M, fs-size/32)
+ * mke2fs -j -J big uses the following formula:
+ * journal-size = MIN(3G, fs-size/32)
+ */
+#define EXT4_SNAPSHOT_TRANS_BLOCKS(n) \
+ ((n)*(1+EXT4_COW_CREDITS)+EXT4_SNAPSHOT_CREDITS)
+#define EXT4_SNAPSHOT_START_TRANS_BLOCKS(n) \
+ ((n)*(1+EXT4_COW_CREDITS)+2*EXT4_SNAPSHOT_CREDITS)
+
+/*
+ * check for sufficient buffer and COW credits
+ */
+#define EXT4_SNAPSHOT_HAS_TRANS_BLOCKS(handle, n) \
+ ((handle)->h_buffer_credits >= EXT4_SNAPSHOT_TRANS_BLOCKS(n) && \
+ (handle)->h_user_credits >= (n))
+
+#define EXT4_RESERVE_COW_CREDITS (EXT4_COW_CREDITS + \
+ EXT4_SNAPSHOT_CREDITS)
+
+/*
+ * Ext4 is not designed for filesystems under 4G with journal size < 128M
+ * Recommended journal size is 3G (created with 'mke2fs -j -J big')
+ */
+#define EXT4_MIN_JOURNAL_BLOCKS 32768U
+#define EXT4_BIG_JOURNAL_BLOCKS (24*EXT4_MIN_JOURNAL_BLOCKS)
+
#define EXT4_RESERVE_TRANS_BLOCKS 12U

#define EXT4_INDEX_EXTRA_TRANS_BLOCKS 8
@@ -176,7 +232,19 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define trace_cow_add(handle, name, num)
#define trace_cow_inc(handle, name)

-handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks);
+#define ext4_journal_trace(n, caller, handle, nblocks)
+
+handle_t *__ext4_journal_start(const char *where,
+ struct super_block *sb, int nblocks);
+
+#define ext4_journal_start_sb(sb, nblocks) \
+ __ext4_journal_start(__func__, \
+ (sb), (nblocks))
+
+#define ext4_journal_start(inode, nblocks) \
+ __ext4_journal_start(__func__, \
+ (inode)->i_sb, (nblocks))
+
int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);

#define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
@@ -212,16 +280,20 @@ static inline int ext4_handle_is_aborted(handle_t *handle)

static inline int ext4_handle_has_enough_credits(handle_t *handle, int needed)
{
- if (ext4_handle_valid(handle) && handle->h_buffer_credits < needed)
+ struct super_block *sb;
+
+ if (!ext4_handle_valid(handle))
+ return 1;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (EXT4_SNAPSHOTS(sb))
+ return EXT4_SNAPSHOT_HAS_TRANS_BLOCKS(handle, needed);
+ /* sb has no snapshot feature */
+ if (handle->h_buffer_credits < needed)
return 0;
return 1;
}

-static inline handle_t *ext4_journal_start(struct inode *inode, int nblocks)
-{
- return ext4_journal_start_sb(inode->i_sb, nblocks);
-}
-
#define ext4_journal_stop(handle) \
__ext4_journal_stop(__func__, __LINE__, (handle))

@@ -230,20 +302,77 @@ static inline handle_t *ext4_journal_current_handle(void)
return journal_current_handle();
}

-static inline int ext4_journal_extend(handle_t *handle, int nblocks)
+/*
+ * Ext4 wrapper for journal_extend()
+ * When transaction runs out of buffer credits it is possible to try and
+ * extend the buffer credits without restarting the transaction.
+ * Ext4 wrapper for journal_start() has increased the user requested buffer
+ * credits to include the extra credits for COW operations.
+ * This wrapper checks the remaining user credits and how many COW credits
+ * are missing and then tries to extend the transaction.
+ */
+static inline int __ext4_journal_extend(const char *where,
+ handle_t *handle, int nblocks)
{
- if (ext4_handle_valid(handle))
- return jbd2_journal_extend(handle, nblocks);
- return 0;
+ int credits = 0;
+ int err = 0;
+ struct super_block *sb;
+
+ if (!ext4_handle_valid((handle_t *)handle))
+ return 0;
+
+ credits = nblocks;
+ sb = handle->h_transaction->t_journal->j_private;
+ if (EXT4_SNAPSHOTS(sb)) {
+ /* extend transaction to valid buffer/user credits ratio */
+ credits = EXT4_SNAPSHOT_TRANS_BLOCKS(handle->h_user_credits +
+ nblocks) - handle->h_buffer_credits;
+ }
+ if (credits > 0)
+ err = jbd2_journal_extend((handle_t *)handle, credits);
+ if (EXT4_SNAPSHOTS(sb) && !err) {
+ /* update base/user credits for future extends */
+ handle->h_base_credits += nblocks;
+ handle->h_user_credits += nblocks;
+ ext4_journal_trace(SNAP_WARN, where, handle, nblocks);
+ }
+ return err;
}

-static inline int ext4_journal_restart(handle_t *handle, int nblocks)
+/*
+ * Ext4 wrapper for journal_restart()
+ * When transaction runs out of buffer credits and cannot be extended,
+ * the alternative is to restart it (start a new transaction).
+ * This wrapper increases the user requested buffer credits to include the
+ * extra credits for COW operations.
+ */
+static inline int __ext4_journal_restart(const char *where,
+ handle_t *handle, int nblocks)
{
- if (ext4_handle_valid(handle))
- return jbd2_journal_restart(handle, nblocks);
- return 0;
+ int err = 0;
+ int credits = 0;
+ struct super_block *sb;
+
+ if (!ext4_handle_valid((handle_t *)handle))
+ return 0;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ credits = EXT4_SNAPSHOTS(sb) ?
+ EXT4_SNAPSHOT_START_TRANS_BLOCKS(nblocks) : nblocks;
+ err = jbd2_journal_restart((handle_t *)handle, credits);
+ if (EXT4_SNAPSHOTS(sb) && !err) {
+ handle->h_base_credits = nblocks;
+ handle->h_user_credits = nblocks;
+ ext4_journal_trace(SNAP_WARN, where, handle, nblocks);
+ }
+ return err;
}

+#define ext4_journal_extend(handle, nblocks) \
+ __ext4_journal_extend(__func__, (handle), (nblocks))
+
+#define ext4_journal_restart(handle, nblocks) \
+ __ext4_journal_restart(__func__, (handle), (nblocks))
static inline int ext4_journal_blocks_per_page(struct inode *inode)
{
if (EXT4_JOURNAL(inode) != NULL)
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 91f5473..d341a5c 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -668,7 +668,7 @@ static void update_backups(struct super_block *sb,

/* Out of journal space, and can't get more - abort - so sad */
if (ext4_handle_valid(handle) &&
- handle->h_buffer_credits == 0 &&
+ !ext4_handle_has_enough_credits(handle, 1) &&
ext4_journal_extend(handle, EXT4_MAX_TRANS_DATA) &&
(err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA)))
break;
diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index 9fb5c2f..e86dc42 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -405,6 +405,18 @@ __ext4_snapshot_trace_cow(const char *where, handle_t *handle,
*/
static inline void ext4_snapshot_cow_begin(handle_t *handle)
{
+ if (!ext4_handle_has_enough_credits(handle, 1)) {
+ /*
+ * The test above is based on lower limit heuristics of
+ * user_credits/buffer_credits, which is not always accurate,
+ * so it is possible that there is no bug here, just another
+ * false alarm.
+ */
+ snapshot_debug_hl(1, "warning: insufficient buffer/user "
+ "credits (%d/%d) for COW operation?\n",
+ handle->h_buffer_credits,
+ handle->h_user_credits);
+ }
snapshot_debug_hl(4, "{\n");
handle->h_cowing = 1;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a7be485..0d996be 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -264,8 +264,10 @@ static void ext4_put_nojournal(handle_t *handle)
* ext4 prevents a new handle from being started by s_frozen, which
* is in an upper layer.
*/
-handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
+handle_t *__ext4_journal_start(const char *where,
+ struct super_block *sb, int nblocks)
{
+ int credits;
journal_t *journal;
handle_t *handle;

@@ -296,7 +298,18 @@ handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
ext4_abort(sb, "Detected aborted journal");
return ERR_PTR(-EROFS);
}
- return jbd2_journal_start(journal, nblocks);
+
+ credits = EXT4_SNAPSHOTS(sb) ?
+ EXT4_SNAPSHOT_START_TRANS_BLOCKS(nblocks) : nblocks;
+ handle = jbd2_journal_start(journal, credits);
+ if (EXT4_SNAPSHOTS(sb) && !IS_ERR(handle)) {
+ if (handle->h_ref == 1) {
+ handle->h_base_credits = nblocks;
+ handle->h_user_credits = nblocks;
+ }
+ ext4_journal_trace(SNAP_WARN, where, handle, nblocks);
+ }
+ return handle;
}

/*
@@ -3874,6 +3887,27 @@ static journal_t *ext4_get_journal(struct super_block *sb,
return NULL;
}

+ if (EXT4_SNAPSHOTS(sb) &&
+ (journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb)) <
+ EXT4_MIN_JOURNAL_BLOCKS) {
+ ext4_msg(sb, KERN_ERR,
+ "journal is too small (%lld < %u) for snapshots",
+ journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb),
+ EXT4_MIN_JOURNAL_BLOCKS);
+ iput(journal_inode);
+ return NULL;
+ }
+
+ if (EXT4_SNAPSHOTS(sb) &&
+ (journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb)) <
+ EXT4_BIG_JOURNAL_BLOCKS) {
+ snapshot_debug(1, "warning: journal is not big enough "
+ "(%lld < %u) - this might affect concurrent "
+ "filesystem writers performance!\n",
+ journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb),
+ EXT4_BIG_JOURNAL_BLOCKS);
+ }
+
journal = jbd2_journal_init_inode(journal_inode);
if (!journal) {
ext4_msg(sb, KERN_ERR, "Could not load journal inode");
--
1.7.4.1


2011-06-07 15:10:03

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 24/36] ext4: snapshot journaled - bypass to save credits

From: Amir Goldstein <[email protected]>

Don't journal COW bitmap indirect blocks to save journal credits.
On very few COW operations (i.e., first block group access after
snapshot take), there may be up to 3 extra blocks allocated for the
active snapshot (i.e., COW bitmap block and up to 2 indirect blocks).
Taking these 2 indorect blocks into account on every COW operation
would further increase the transaction's COW credits factor.
Instead, we choose to pay a small performance penalty on these few
COW bitmap operations and wait until they are synced to disk.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 31 +++++++++++++++++++++++++++----
1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d703a55..de40993 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -835,7 +835,8 @@ static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
branch[n].bh = bh;
lock_buffer(bh);
BUFFER_TRACE(bh, "call get_create_access");
- err = ext4_journal_get_create_access(handle, bh);
+ if (!SNAPMAP_ISSYNC(flags))
+ err = ext4_journal_get_create_access(handle, bh);
if (err) {
/* Don't brelse(bh) here; it's done in
* ext4_journal_forget() below */
@@ -862,7 +863,21 @@ static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
unlock_buffer(bh);

BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, inode, bh);
+ /*
+ * When accessing a block group for the first time, the
+ * block bitmap is the first block to be copied to the
+ * snapshot. We don't want to reserve journal credits for
+ * the indirect blocks that map the bitmap copy (the COW
+ * bitmap), so instead of writing through the journal, we
+ * sync the indirect blocks directly to disk. Of course,
+ * this is not good for performance but it only happens once
+ * per snapshot/blockgroup.
+ */
+ if (SNAPMAP_ISSYNC(flags)) {
+ mark_buffer_dirty(bh);
+ sync_dirty_buffer(bh);
+ } else
+ err = ext4_handle_dirty_metadata(handle, inode, bh);
if (err)
goto failed;
}
@@ -871,6 +886,9 @@ static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
failed:
/* Allocation failed, free what we already allocated */
ext4_free_blocks(handle, inode, NULL, new_blocks[0], 1, 0);
+ /* If we bypassed journal, we don't need to forget any block */
+ if (SNAPMAP_ISSYNC(flags))
+ n = 1;
for (i = 1; i <= n ; i++) {
/*
* branch[i].bh is newly allocated, so there is no
@@ -966,13 +984,18 @@ static int ext4_splice_branch_cow(handle_t *handle, struct inode *inode,

err_out:
for (i = 1; i <= num; i++) {
+ int forget = EXT4_FREE_BLOCKS_FORGET;
+
+ /* If we bypassed journal, we don't need to forget */
+ if (SNAPMAP_ISSYNC(flags))
+ forget = 0;
+
/*
* branch[i].bh is newly allocated, so there is no
* need to revoke the block, which is why we don't
* need to set EXT4_FREE_BLOCKS_METADATA.
*/
- ext4_free_blocks(handle, inode, where[i].bh, 0, 1,
- EXT4_FREE_BLOCKS_FORGET);
+ ext4_free_blocks(handle, inode, where[i].bh, 0, 1, forget);
}
if (SNAPMAP_ISMOVE(flags))
/* don't charge snapshot file owner if move failed */
--
1.7.4.1


2011-06-07 15:10:00

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 23/36] ext4: snapshot journaled - implement journal_release_buffer()

From: Amir Goldstein <[email protected]>

The API journal_release_buffer() is called to cancel a previous call
to journal_get_write_access() and to recall the used buffer credit.
Current implementation of journal_release_buffer() in JBD is empty,
since no buffer credits are used until the buffer is marked dirty.
However, since the resulting snapshot COW operation cannot be undone,
we try to extend the current transaction to compensate for the used
credits of the extra COW operation, so we don't run out of buffer
credits too soon.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 39 +++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 11 +++++------
fs/ext4/ialloc.c | 10 ++++++++--
fs/ext4/xattr.c | 4 +++-
4 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 015f727..e8287f4 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -127,6 +127,45 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
return err;
}

+int __ext4_handle_release_buffer(const char *where, handle_t *handle,
+ struct buffer_head *bh)
+{
+ struct super_block *sb;
+ int err = 0;
+
+ if (!ext4_handle_valid(handle))
+ return 0;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (!EXT4_SNAPSHOTS(sb) || IS_COWING(handle))
+ goto out;
+
+ /*
+ * Trying to cancel a previous call to get_write_access(), which may
+ * have resulted in a single COW operation. We don't need to add
+ * user credits, but if COW credits are too low we will try to
+ * extend the transaction to compensate for the buffer credits used
+ * by the extra COW operation.
+ */
+ err = ext4_journal_extend(handle, 0);
+ if (err > 0) {
+ /* well, we can't say we didn't try - now lets hope
+ * we have enough buffer credits to spare */
+ snapshot_debug(handle->h_buffer_credits < EXT4_MAX_TRANS_DATA
+ ? 1 : 2,
+ "%s: warning: couldn't extend transaction "
+ "from %s (credits=%d/%d)\n", __func__,
+ where, handle->h_buffer_credits,
+ handle->h_user_credits);
+ err = 0;
+ }
+ ext4_journal_trace(SNAP_WARN, where, handle, -1);
+out:
+ if (!err)
+ jbd2_journal_release_buffer(handle, bh);
+ return err;
+}
+
int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
handle_t *handle, struct inode *inode,
struct buffer_head *bh)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 2b0e1bd..cee6f2a 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -264,12 +264,11 @@ static inline void ext4_handle_sync(handle_t *handle)
handle->h_sync = 1;
}

-static inline void ext4_handle_release_buffer(handle_t *handle,
- struct buffer_head *bh)
-{
- if (ext4_handle_valid(handle))
- jbd2_journal_release_buffer(handle, bh);
-}
+int __ext4_handle_release_buffer(const char *where, handle_t *handle,
+ struct buffer_head *bh);
+
+#define ext4_handle_release_buffer(handle, bh) \
+ __ext4_handle_release_buffer(__func__, (handle), (bh))

static inline int ext4_handle_is_aborted(handle_t *handle)
{
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index b0e5749..fdb6b12 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -917,8 +917,14 @@ repeat_in_this_group:
goto got;
}
/* we lost it */
- ext4_handle_release_buffer(handle, inode_bitmap_bh);
- ext4_handle_release_buffer(handle, group_desc_bh);
+ err = ext4_handle_release_buffer(handle,
+ inode_bitmap_bh);
+ if (err)
+ goto fail;
+ err = ext4_handle_release_buffer(handle,
+ group_desc_bh);
+ if (err)
+ goto fail;

if (++ino < EXT4_INODES_PER_GROUP(sb))
goto repeat_in_this_group;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index b545ca1..83f5f9d 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -735,7 +735,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
int offset = (char *)s->here - bs->bh->b_data;

unlock_buffer(bs->bh);
- ext4_handle_release_buffer(handle, bs->bh);
+ error = ext4_handle_release_buffer(handle, bs->bh);
+ if (error)
+ goto cleanup;
if (ce) {
mb_cache_entry_release(ce);
ce = NULL;
--
1.7.4.1


2011-06-07 15:10:06

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 25/36] ext4: snapshot journaled - cache last COW tid in journal_head

From: Amir Goldstein <[email protected]>

Cache last COW transaction id in buffer's journal_head.
The cache suppresses COW tests until the transaction in committed.
By default, the running transaction is committed every 5 seconds
which implies an average COW cache expiry of 2.5 seconds.
Before taking a new snapshot, the journal is flushed to disk
and the current transaction in committed, so the COW cache is
invalidated (as it should be).


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/snapshot.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot.h | 2 +
fs/ext4/snapshot_debug.c | 7 +++
3 files changed, 102 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index e86dc42..2724381 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -400,6 +400,90 @@ __ext4_snapshot_trace_cow(const char *where, handle_t *handle,
#define ext4_snapshot_trace_cow(where, handle, sb, inode, bh, blk, cnt, cmd)
#endif
/*
+ * The last transaction ID during which the buffer has been COWed is stored in
+ * the b_cow_tid field of the journal_head struct. If we know that the buffer
+ * was COWed during the current transaction, we don't need to COW it again.
+ * [jbd_lock_bh_state()]
+ */
+
+void init_ext4_snapshot_cow_cache(void)
+{
+#ifdef CONFIG_EXT4_DEBUG
+ cow_cache_enabled = 1;
+#endif
+}
+
+#ifdef CONFIG_EXT4_DEBUG
+#define cow_cache_enabled() (cow_cache_enabled)
+#else
+#define cow_cache_enabled() (1)
+#endif
+
+#define test_cow_tid(jh, handle) \
+ ((jh)->b_cow_tid == (handle)->h_transaction->t_tid)
+#define set_cow_tid(jh, handle) \
+ ((jh)->b_cow_tid = (handle)->h_transaction->t_tid)
+
+/*
+ * Journal COW cache functions.
+ * a block can only be COWed once per snapshot,
+ * so a block can only be COWed once per transaction,
+ * so a buffer that was COWed in the current transaction,
+ * doesn't need to be COWed.
+ *
+ * Return values:
+ * 1 - block was COWed in current transaction
+ * 0 - block wasn't COWed in current transaction
+ */
+static int
+ext4_snapshot_test_cowed(handle_t *handle, struct buffer_head *bh)
+{
+ struct journal_head *jh;
+
+ if (!cow_cache_enabled())
+ return 0;
+
+ /* check the COW tid in the journal head */
+ if (bh && buffer_jbd(bh)) {
+ jbd_lock_bh_state(bh);
+ jh = bh2jh(bh);
+ if (jh && !test_cow_tid(jh, handle))
+ jh = NULL;
+ jbd_unlock_bh_state(bh);
+ if (jh)
+ /*
+ * Block was already COWed in the running transaction,
+ * so we don't need to COW it again.
+ */
+ return 1;
+ }
+ return 0;
+}
+
+static void
+ext4_snapshot_mark_cowed(handle_t *handle, struct buffer_head *bh)
+{
+ struct journal_head *jh;
+
+ if (!cow_cache_enabled())
+ return;
+
+ if (bh && buffer_jbd(bh)) {
+ jbd_lock_bh_state(bh);
+ jh = bh2jh(bh);
+ if (jh && !test_cow_tid(jh, handle))
+ /*
+ * this is the first time this block was COWed
+ * in the running transaction.
+ * update the COW tid in the journal head
+ * to mark that this block doesn't need to be COWed.
+ */
+ set_cow_tid(jh, handle);
+ jbd_unlock_bh_state(bh);
+ }
+}
+
+/*
* Begin COW or move operation.
* No locks needed here, because @handle is a per-task struct.
*/
@@ -479,6 +563,13 @@ int ext4_snapshot_test_and_cow(const char *where, handle_t *handle,
snapshot_debug_hl(4, "active snapshot access denied!\n");
return -EPERM;
}
+ /* check if the buffer was COWed in the current transaction */
+ if (ext4_snapshot_test_cowed(handle, bh)) {
+ snapshot_debug_hl(4, "buffer found in COW cache - "
+ "skip block cow!\n");
+ trace_cow_inc(handle, ok_jh);
+ return 0;
+ }

/* BEGIN COWing */
ext4_snapshot_cow_begin(handle);
@@ -572,6 +663,8 @@ int ext4_snapshot_test_and_cow(const char *where, handle_t *handle,
test_pending_cow:

cowed:
+ /* mark the buffer COWed in the current transaction */
+ ext4_snapshot_mark_cowed(handle, bh);
out:
brelse(sbh);
/* END COWing */
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 007fec0..44bac96 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -342,6 +342,7 @@ static inline int ext4_snapshot_get_delete_access(handle_t *handle,
return ext4_snapshot_move(handle, inode, block, pcount, 1);
}

+extern void init_ext4_snapshot_cow_cache(void);

/* snapshot_ctl.c */

@@ -364,6 +365,7 @@ extern void ext4_snapshot_destroy(struct super_block *sb);

static inline int init_ext4_snapshot(void)
{
+ init_ext4_snapshot_cow_cache();
return 0;
}

diff --git a/fs/ext4/snapshot_debug.c b/fs/ext4/snapshot_debug.c
index 35f552a..265ee2c 100644
--- a/fs/ext4/snapshot_debug.c
+++ b/fs/ext4/snapshot_debug.c
@@ -47,10 +47,12 @@ static const char *snapshot_test_names[SNAPSHOT_TESTS_NUM] = {

u16 snapshot_enable_test[SNAPSHOT_TESTS_NUM] __read_mostly = {0};
u8 snapshot_enable_debug __read_mostly = 1;
+u8 cow_cache_enabled __read_mostly = 1;

static struct dentry *snapshot_debug;
static struct dentry *snapshot_version;
static struct dentry *snapshot_test[SNAPSHOT_TESTS_NUM];
+static struct dentry *cow_cache;

static char snapshot_version_str[] = EXT4_SNAPSHOT_VERSION;
static struct debugfs_blob_wrapper snapshot_version_blob = {
@@ -79,6 +81,9 @@ void ext4_snapshot_create_debugfs_entry(struct dentry *debugfs_dir)
S_IRUGO|S_IWUSR,
debugfs_dir,
&snapshot_enable_test[i]);
+ cow_cache = debugfs_create_u8("cow-cache", S_IRUGO|S_IWUSR,
+ debugfs_dir,
+ &cow_cache_enabled);
}

/*
@@ -89,6 +94,8 @@ void ext4_snapshot_remove_debugfs_entry(void)
{
int i;

+ if (cow_cache)
+ debugfs_remove(cow_cache);
for (i = 0; i < SNAPSHOT_TESTS_NUM && i < SNAPSHOT_TEST_NAMES; i++)
if (snapshot_test[i])
debugfs_remove(snapshot_test[i]);
--
1.7.4.1


2011-06-07 15:10:08

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 26/36] ext4: snapshot journaled - trace COW/buffer credits

From: Amir Goldstein <[email protected]>

Extra debug prints to trace snapshot usage of buffer credits.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 26 ++++++++++++++++
fs/ext4/super.c | 2 +
3 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index e8287f4..eb88564 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -23,6 +23,7 @@ int __ext4_handle_get_bitmap_access(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
}
return err;
}
@@ -40,6 +41,7 @@ int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
}
return err;
}
@@ -91,6 +93,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
return err;
}
return 0;
@@ -108,6 +111,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
"error %d when attempting revoke", err);
}
BUFFER_TRACE(bh, "exit");
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
return err;
}

@@ -123,6 +127,7 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, -1);
}
return err;
}
@@ -198,6 +203,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
}
jbd_unlock_bh_state(bh);
}
+ ext4_journal_trace(SNAP_DEBUG, where, handle, -1);
} else {
if (inode)
mark_buffer_dirty_inode(bh, inode);
@@ -236,3 +242,77 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
sb->s_dirt = 1;
return err;
}
+
+#ifdef CONFIG_JBD2_DEBUG
+static void ext4_journal_cow_stats(int n, handle_t *handle)
+{
+ snapshot_debug(n, "COW stats: moved/copied=%d/%d, "
+ "mapped/bitmap/cached=%d/%d/%d, "
+ "bitmaps/cleared=%d/%d\n", handle->h_cow_moved,
+ handle->h_cow_copied, handle->h_cow_ok_mapped,
+ handle->h_cow_ok_bitmap, handle->h_cow_ok_jh,
+ handle->h_cow_bitmaps, handle->h_cow_excluded);
+}
+#else
+#define ext4_journal_cow_stats(n, handle)
+#endif
+
+#ifdef CONFIG_EXT4_DEBUG
+void __ext4_journal_trace(int n, const char *fn, const char *caller,
+ handle_t *handle, int nblocks)
+{
+ int active_snapshot;
+ int upper;
+ int lower;
+ int final;
+ struct super_block *sb;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (!EXT4_SNAPSHOTS(sb))
+ return;
+
+ active_snapshot = ext4_snapshot_active(EXT4_SB(sb));
+ upper = EXT4_SNAPSHOT_START_TRANS_BLOCKS(handle->h_base_credits);
+ lower = EXT4_SNAPSHOT_TRANS_BLOCKS(handle->h_user_credits);
+ final = (nblocks == 0 && handle->h_ref == 1 &&
+ !IS_COWING(handle));
+
+ switch (snapshot_enable_debug) {
+ case SNAP_INFO:
+ /* trace final journal_stop if any credits have been used */
+ if (final && (handle->h_buffer_credits < upper ||
+ handle->h_user_credits < handle->h_base_credits))
+ break;
+ case SNAP_WARN:
+ /*
+ * trace if buffer credits are too low - lower limit is only
+ * valid if there is an active snapshot and not during COW
+ */
+ if (handle->h_buffer_credits < lower &&
+ active_snapshot && !IS_COWING(handle))
+ break;
+ case SNAP_ERR:
+ /* trace if user credits are too low */
+ if (handle->h_user_credits < 0)
+ break;
+ case 0:
+ /* no trace */
+ return;
+
+ case SNAP_DEBUG:
+ default:
+ /* trace all calls */
+ break;
+ }
+
+ snapshot_debug_l(n, IS_COWING(handle), "%s(%d): credits=%d,"
+ " limit=%d/%d, user=%d/%d, ref=%d, caller=%s\n",
+ fn, nblocks, handle->h_buffer_credits, lower, upper,
+ handle->h_user_credits, handle->h_base_credits,
+ handle->h_ref, caller);
+ if (!final)
+ return;
+
+ ext4_journal_cow_stats(n, handle);
+}
+#endif
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index cee6f2a..41951da 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -229,10 +229,36 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define ext4_handle_dirty_super(handle, sb) \
__ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))

+/*
+ * macros for ext4 to update transaction COW statistics.
+ * if the kernel was compiled without CONFIG_JBD2_DEBUG
+ * then the h_cow_* fields are not allocated in handle objects.
+ */
+#ifdef CONFIG_JBD2_DEBUG
+#define trace_cow_add(handle, name, num) \
+ (handle)->h_cow_##name += (num)
+#define trace_cow_inc(handle, name) \
+ (handle)->h_cow_##name++;
+
+#else
#define trace_cow_add(handle, name, num)
#define trace_cow_inc(handle, name)

+#endif
+#ifdef CONFIG_EXT4_DEBUG
+void __ext4_journal_trace(int debug, const char *fn, const char *caller,
+ handle_t *handle, int nblocks);
+
+#define ext4_journal_trace(n, caller, handle, nblocks) \
+ do { \
+ if ((n) <= snapshot_enable_debug) \
+ __ext4_journal_trace((n), __func__, (caller), \
+ (handle), (nblocks)); \
+ } while (0)
+
+#else
#define ext4_journal_trace(n, caller, handle, nblocks)
+#endif

handle_t *__ext4_journal_start(const char *where,
struct super_block *sb, int nblocks);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0d996be..fc8bfda 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -324,6 +324,8 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
int err;
int rc;

+ ext4_journal_trace(SNAP_WARN, where, handle, 0);
+
if (!ext4_handle_valid(handle)) {
ext4_put_nojournal(handle);
return 0;
--
1.7.4.1


2011-06-07 15:10:12

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 27/36] ext4: snapshot list support

From: Amir Goldstein <[email protected]>

Implementation of multiple incremental snapshots.
Snapshot inodes are chained on a list starting at the super block,
both on-disk and in-memory, similar to the orphan inodes.
Unlink and truncate of snapshot inodes on the list is not allowed,
so an inode can never be chained on both orphan and snapshot lists.
We make use of this fact to overload the in-memory inode field
ext4_inode_info.i_orphan for the chaining of snapshots.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/snapshot_ctl.c | 329 ++++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/super.c | 1 +
3 files changed, 307 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8d82125..ea1f38a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1194,6 +1194,7 @@ struct ext4_sb_info {
struct block_device *journal_bdev;
struct mutex s_snapshot_mutex; /* protects 2 fields below: */
struct inode *s_active_snapshot; /* [ s_snapshot_mutex ] */
+ struct list_head s_snapshot_list; /* [ s_snapshot_mutex ] */
#ifdef CONFIG_JBD2_DEBUG
struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */
wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index a610025..298405a 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -132,6 +132,168 @@ static void ext4_snapshot_reset_bitmap_cache(struct super_block *sb)
}

/*
+ * A modified version of ext4_orphan_add(), used to add a snapshot inode
+ * to the head of the on-disk and in-memory lists.
+ * in-memory i_orphan list field is overloaded, because inodes on snapshots
+ * list cannot be unlinked nor truncated.
+ */
+static int ext4_inode_list_add(handle_t *handle, struct inode *inode,
+ __u32 *i_next, __le32 *s_last,
+ struct list_head *s_list, const char *name)
+{
+ struct super_block *sb = inode->i_sb;
+ struct ext4_iloc iloc;
+ int err = 0, rc;
+
+ if (!ext4_handle_valid(handle))
+ return 0;
+
+ mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
+ if (!list_empty(&EXT4_I(inode)->i_orphan))
+ goto out_unlock;
+
+ BUFFER_TRACE(EXT4_SB(sb)->s_sbh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, EXT4_SB(sb)->s_sbh);
+ if (err)
+ goto out_unlock;
+
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ if (err)
+ goto out_unlock;
+
+ snapshot_debug(4, "add inode %lu to %s list\n",
+ inode->i_ino, name);
+
+ /* Insert this inode at the head of the on-disk inode list... */
+ *i_next = le32_to_cpu(*s_last);
+ *s_last = cpu_to_le32(inode->i_ino);
+ err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
+ rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
+ if (!err)
+ err = rc;
+
+ /* Only add to the head of the in-memory list if all the
+ * previous operations succeeded. */
+ if (!err)
+ list_add(&EXT4_I(inode)->i_orphan, s_list);
+
+ snapshot_debug(4, "last_%s will point to inode %lu\n",
+ name, inode->i_ino);
+ snapshot_debug(4, "%s inode %lu will point to inode %d\n",
+ name, inode->i_ino, *i_next);
+out_unlock:
+ mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
+ ext4_std_error(inode->i_sb, err);
+ return err;
+}
+
+static int ext4_snapshot_list_add(handle_t *handle, struct inode *inode)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ return ext4_inode_list_add(handle, inode, &NEXT_SNAPSHOT(inode),
+ &sbi->s_es->s_snapshot_list,
+ &sbi->s_snapshot_list, "snapshot");
+}
+
+#define NEXT_INODE_OFFSET (((char *)inode)-((char *)i_next))
+#define NEXT_INODE(i_prev) (*(__u32 *)(((char *)i_prev)-NEXT_INODE_OFFSET))
+
+/*
+ * A modified version of ext4_orphan_del(), used to remove a snapshot inode
+ * from the on-disk and in-memory lists.
+ * in-memory i_orphan list field is overloaded, because inodes on snapshots
+ * list cannot be unlinked nor truncated.
+ */
+static int ext4_inode_list_del(handle_t *handle, struct inode *inode,
+ __u32 *i_next, __le32 *s_last,
+ struct list_head *s_list, const char *name)
+{
+ struct list_head *prev;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi;
+ __u32 ino_next;
+ struct ext4_iloc iloc;
+ int err = 0;
+
+ /* ext4_handle_valid() assumes a valid handle_t pointer */
+ if (handle && !ext4_handle_valid(handle))
+ return 0;
+
+ mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+ if (list_empty(&ei->i_orphan))
+ goto out;
+
+ ino_next = *i_next;
+ prev = ei->i_orphan.prev;
+ sbi = EXT4_SB(inode->i_sb);
+
+ snapshot_debug(4, "remove inode %lu from %s list\n", inode->i_ino,
+ name);
+
+ list_del_init(&ei->i_orphan);
+
+ /* If we're on an error path, we may not have a valid
+ * transaction handle with which to update the orphan list on
+ * disk, but we still need to remove the inode from the linked
+ * list in memory. */
+ if (sbi->s_journal && !handle)
+ goto out;
+
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ if (err)
+ goto out_err;
+
+ if (prev == s_list) {
+ snapshot_debug(4, "last_%s will point to inode %lu\n", name,
+ (long unsigned int)ino_next);
+ BUFFER_TRACE(sbi->s_sbh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, sbi->s_sbh);
+ if (err)
+ goto out_brelse;
+ *s_last = cpu_to_le32(ino_next);
+ err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
+ } else {
+ struct ext4_iloc iloc2;
+ struct inode *i_prev;
+ i_prev = &list_entry(prev, struct ext4_inode_info,
+ i_orphan)->vfs_inode;
+
+ snapshot_debug(4, "%s inode %lu will point to inode %lu\n",
+ name, i_prev->i_ino, (long unsigned int)ino_next);
+ err = ext4_reserve_inode_write(handle, i_prev, &iloc2);
+ if (err)
+ goto out_brelse;
+ NEXT_INODE(i_prev) = ino_next;
+ err = ext4_mark_iloc_dirty(handle, i_prev, &iloc2);
+ }
+ if (err)
+ goto out_brelse;
+ *i_next = 0;
+ err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+
+out_err:
+ ext4_std_error(inode->i_sb, err);
+out:
+ mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+ return err;
+
+out_brelse:
+ brelse(iloc.bh);
+ goto out_err;
+}
+
+static int ext4_snapshot_list_del(handle_t *handle, struct inode *inode)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ return ext4_inode_list_del(handle, inode, &NEXT_SNAPSHOT(inode),
+ &sbi->s_es->s_snapshot_list,
+ &sbi->s_snapshot_list, "snapshot");
+}
+
+
+/*
* Snapshot control functions
*
* Snapshot files are controlled by changing snapshot flags with chattr and
@@ -395,11 +557,18 @@ static int ext4_snapshot_create(struct inode *inode)
ext4_fsblk_t bmap_blk = 0, imap_blk = 0, inode_blk = 0;
ext4_fsblk_t prev_inode_blk = 0;
ext4_fsblk_t snapshot_blocks = ext4_blocks_count(sbi->s_es);
- if (active_snapshot) {
- snapshot_debug(1, "failed to add snapshot because active "
- "snapshot (%u) has to be deleted first\n",
- active_snapshot->i_generation);
- return -EINVAL;
+ struct list_head *l, *list = &sbi->s_snapshot_list;
+
+ if (!list_empty(list)) {
+ struct inode *last_snapshot =
+ &list_first_entry(list, struct ext4_inode_info,
+ i_snaplist)->vfs_inode;
+ if (active_snapshot != last_snapshot) {
+ snapshot_debug(1, "failed to add snapshot because last"
+ " snapshot (%u) is not active\n",
+ last_snapshot->i_generation);
+ return -EINVAL;
+ }
}

/* prevent take of unlinked snapshot file */
@@ -455,14 +624,27 @@ static int ext4_snapshot_create(struct inode *inode)
/* record the file system size in the snapshot inode disksize field */
SNAPSHOT_SET_BLOCKS(inode, snapshot_blocks);

- lock_super(sb);
- err = ext4_journal_get_write_access(handle, sbi->s_sbh);
- sbi->s_es->s_snapshot_list = cpu_to_le32(inode->i_ino);
- if (!err)
- err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
- unlock_super(sb);
- if (err)
+ /* add snapshot list reference */
+ if (!igrab(inode)) {
+ err = -EIO;
+ goto out_handle;
+ }
+ /*
+ * First, the snapshot is added to the in-memory and on-disk list.
+ * At the end of snapshot_take(), it will become the active snapshot
+ * in-memory and on-disk.
+ * Finally, if snapshot_create() or snapshot_take() has failed,
+ * snapshot_update() will remove it from the in-memory and on-disk list.
+ */
+ err = ext4_snapshot_list_add(handle, inode);
+ /* add snapshot list reference */
+ if (err) {
+ snapshot_debug(1, "failed to add snapshot (%u) to list\n",
+ inode->i_generation);
+ iput(inode);
goto out_handle;
+ }
+ l = list->next;

err = ext4_mark_inode_dirty(handle, inode);
if (err)
@@ -609,8 +791,10 @@ next_snapshot:
err = -EIO;
goto out_handle;
}
- if (ino == EXT4_ROOT_INO) {
- ino = inode->i_ino;
+ if (l != list) {
+ ino = list_entry(l, struct ext4_inode_info,
+ i_snaplist)->vfs_inode.i_ino;
+ l = l->next;
goto alloc_inode_blocks;
}
snapshot_debug(1, "snapshot (%u) created\n", inode->i_generation);
@@ -695,6 +879,8 @@ static char *copy_inode_block_name[COPY_INODE_BLOCKS_NUM] = {
*/
int ext4_snapshot_take(struct inode *inode)
{
+ struct list_head *list = &EXT4_SB(inode->i_sb)->s_snapshot_list;
+ struct list_head *l = list->next;
struct super_block *sb = inode->i_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = NULL;
@@ -885,8 +1071,15 @@ fix_inode_copy:
sync_dirty_buffer(sbh);

next_inode:
- if (curr_inode->i_ino == EXT4_ROOT_INO) {
- curr_inode = inode;
+ if (l == list && !fixing) {
+ /* done with copy pass - start fixing pass */
+ l = l->next;
+ fixing = 1;
+ }
+ if (l != list) {
+ curr_inode = &list_entry(l, struct ext4_inode_info,
+ i_snaplist)->vfs_inode;
+ l = l->next;
goto copy_inode_blocks;
}

@@ -1082,14 +1275,11 @@ static int ext4_snapshot_remove(struct inode *inode)
if (err)
goto out_handle;

- lock_super(inode->i_sb);
- err = ext4_journal_get_write_access(handle, sbi->s_sbh);
- sbi->s_es->s_snapshot_list = 0;
- if (!err)
- err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
- unlock_super(inode->i_sb);
+ err = ext4_snapshot_list_del(handle, inode);
if (err)
goto out_handle;
+ /* remove snapshot list reference - taken on snapshot_create() */
+ iput(inode);
/*
* At this point, this snapshot is empty and not on the snapshots list.
* As long as it was on the list it had to have the LIST flag to prevent
@@ -1145,6 +1335,10 @@ int ext4_snapshot_load(struct super_block *sb, struct ext4_super_block *es,
int err = 0, num = 0, snapshot_id = 0;
int has_active = 0;

+ if (!list_empty(&EXT4_SB(sb)->s_snapshot_list)) {
+ snapshot_debug(1, "warning: snapshots already loaded!\n");
+ return -EINVAL;
+ }

if (!load_ino && active_ino) {
/* snapshots list is empty and active snapshot exists */
@@ -1197,8 +1391,10 @@ int ext4_snapshot_load(struct super_block *sb, struct ext4_super_block *es,
has_active = 1;
}

- iput(inode);
- break;
+ list_add_tail(&EXT4_I(inode)->i_snaplist,
+ &EXT4_SB(sb)->s_snapshot_list);
+ load_ino = NEXT_SNAPSHOT(inode);
+ /* keep snapshot list reference */
}

if (err) {
@@ -1225,6 +1421,16 @@ int ext4_snapshot_load(struct super_block *sb, struct ext4_super_block *es,
*/
void ext4_snapshot_destroy(struct super_block *sb)
{
+ struct list_head *l, *n;
+ /* iterate safe because we are deleting from list and freeing the
+ * inodes */
+ list_for_each_safe(l, n, &EXT4_SB(sb)->s_snapshot_list) {
+ struct inode *inode = &list_entry(l, struct ext4_inode_info,
+ i_snaplist)->vfs_inode;
+ list_del_init(&EXT4_I(inode)->i_snaplist);
+ /* remove snapshot list reference */
+ iput(inode);
+ }
/* deactivate in-memory active snapshot - cannot fail */
(void) ext4_snapshot_set_active(sb, NULL);
}
@@ -1244,6 +1450,11 @@ int ext4_snapshot_update(struct super_block *sb, int cleanup, int read_only)
struct inode *active_snapshot = ext4_snapshot_has_active(sb);
struct inode *used_by = NULL; /* last non-deleted snapshot found */
int deleted;
+ struct inode *inode;
+ struct ext4_inode_info *ei;
+ int found_active = 0;
+ int found_enabled = 0;
+ struct list_head *prev;
int err = 0;

BUG_ON(read_only && cleanup);
@@ -1255,6 +1466,76 @@ int ext4_snapshot_update(struct super_block *sb, int cleanup, int read_only)
EXT4_SNAPSTATE_ACTIVE);
}

+ /* iterate safe from oldest snapshot backwards */
+ prev = EXT4_SB(sb)->s_snapshot_list.prev;
+ if (list_empty(prev))
+ return 0;
+
+update_snapshot:
+ ei = list_entry(prev, struct ext4_inode_info, i_snaplist);
+ inode = &ei->vfs_inode;
+ prev = ei->i_snaplist.prev;
+
+ /* all snapshots on the list have the LIST flag */
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_LIST);
+ /* set the 'No_Dump' flag on all snapshots */
+ ext4_set_inode_flag(inode, EXT4_NODUMP_FL);
+
+ /*
+ * snapshots later than active (failed take) should be removed.
+ * no active snapshot means failed first snapshot take.
+ */
+ if (found_active || !active_snapshot) {
+ if (!read_only)
+ err = ext4_snapshot_remove(inode);
+ goto prev_snapshot;
+ }
+
+ deleted = ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_DELETED);
+ if (!deleted && read_only)
+ /* auto enable snapshots on readonly mount */
+ ext4_snapshot_enable(inode);
+
+ /*
+ * after completion of a snapshot management operation,
+ * only the active snapshot can have the ACTIVE flag
+ */
+ if (inode == active_snapshot) {
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_ACTIVE);
+ found_active = 1;
+ deleted = 0;
+ } else
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_ACTIVE);
+
+ if (found_enabled)
+ /* snapshot is in use by an older enabled snapshot */
+ ext4_set_inode_snapstate(inode, EXT4_SNAPSTATE_INUSE);
+ else
+ /* snapshot is not in use by older enabled snapshots */
+ ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_INUSE);
+
+ if (cleanup && deleted && !used_by)
+ /* remove permanently unused deleted snapshot */
+ err = ext4_snapshot_remove(inode);
+
+ if (!deleted) {
+ if (!found_active)
+ /* newer snapshots are potentially used by
+ * this snapshot (when it is enabled) */
+ used_by = inode;
+ if (ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_ENABLED))
+ found_enabled = 1;
+ else
+ SNAPSHOT_SET_DISABLED(inode);
+ } else
+ SNAPSHOT_SET_DISABLED(inode);
+
+prev_snapshot:
+ if (err)
+ return err;
+ /* update prev snapshot */
+ if (prev != &EXT4_SB(sb)->s_snapshot_list)
+ goto update_snapshot;

if (!active_snapshot || !cleanup || used_by)
return 0;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index fc8bfda..a1c4728 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3543,6 +3543,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

mutex_init(&sbi->s_snapshot_mutex);
sbi->s_active_snapshot = NULL;
+ INIT_LIST_HEAD(&sbi->s_snapshot_list); /* snapshot files */

needs_recovery = (es->s_last_orphan != 0 ||
EXT4_HAS_INCOMPAT_FEATURE(sb,
--
1.7.4.1


2011-06-07 15:10:16

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 29/36] ext4: snapshot race conditions - concurrent COW bitmap operations

From: Amir Goldstein <[email protected]>

Wait for pending COW bitmap creations to complete.
When concurrent tasks try to COW buffers from the same block group
for the first time, the first task to reset the COW bitmap cache
is elected to create the new COW bitmap block. The rest of the tasks
wait (in msleep(1) loop), until the COW bitmap cache is uptodate.
The COWing task copies the bitmap block into the new COW bitmap block
and updates the COW bitmap cache with the new block number.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/snapshot.c | 45 ++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index 2724381..000e655 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -248,9 +248,48 @@ ext4_snapshot_read_cow_bitmap(handle_t *handle, struct inode *snapshot,

bitmap_blk = ext4_block_bitmap(sb, desc);

- ext4_lock_group(sb, block_group);
- cow_bitmap_blk = grp->bg_cow_bitmap;
- ext4_unlock_group(sb, block_group);
+ /*
+ * Handle concurrent COW bitmap operations.
+ * bg_cow_bitmap has 3 states:
+ * = 0 - uninitialized (after mount and after snapshot take).
+ * = bg_block_bitmap - marks pending COW of block bitmap.
+ * other - location of initialized COW bitmap block.
+ *
+ * The first task to access block group after mount or snapshot take,
+ * will read the uninitialized state, mark pending COW state, initialize
+ * the COW bitmap block and update COW bitmap cache. Other tasks will
+ * busy wait until the COW bitmap cache is in initialized state, before
+ * reading the COW bitmap block.
+ */
+ do {
+ ext4_lock_group(sb, block_group);
+ cow_bitmap_blk = grp->bg_cow_bitmap;
+ if (cow_bitmap_blk == 0)
+ /* mark pending COW of bitmap block */
+ grp->bg_cow_bitmap = bitmap_blk;
+ ext4_unlock_group(sb, block_group);
+
+ if (cow_bitmap_blk == 0) {
+ snapshot_debug(3, "initializing COW bitmap #%u "
+ "of snapshot (%u)...\n",
+ block_group, snapshot->i_generation);
+ /* sleep 1 tunable delay unit */
+ snapshot_test_delay(SNAPTEST_BITMAP);
+ break;
+ }
+ if (cow_bitmap_blk == bitmap_blk) {
+ /* wait for another task to COW bitmap block */
+ snapshot_debug_once(2, "waiting for pending COW "
+ "bitmap #%d...\n", block_group);
+ /*
+ * This is an unlikely event that can happen only once
+ * per block_group/snapshot, so msleep(1) is sufficient
+ * and there is no need for a wait queue.
+ */
+ msleep(1);
+ }
+ /* XXX: Should we fail after N retries? */
+ } while (cow_bitmap_blk == 0 || cow_bitmap_blk == bitmap_blk);
if (cow_bitmap_blk)
return sb_bread(sb, cow_bitmap_blk);

--
1.7.4.1


2011-06-07 15:10:14

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 28/36] ext4: snapshot list - read through to previous snapshot

From: Amir Goldstein <[email protected]>

On snapshot page read, the function ext4_get_block() is called
to map the page to a disk block. If the page is not mapped in the
snapshot file, the newer snapshots on the list are checked and the
oldest found mapping is returned. If the page is not mapped in any of
the newer snapshots, a direct mapping to the block device is returned.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/snapshot_inode.c | 74 +++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
index 74b455d..a97411e 100644
--- a/fs/ext4/snapshot_inode.c
+++ b/fs/ext4/snapshot_inode.c
@@ -46,12 +46,62 @@
* in which case 'prev_snapshot' is pointed to the previous snapshot
* on the list or set to NULL to indicate read through to block device.
*/
+/*
+ * In-memory snapshot list manipulation is protected by snapshot_mutex.
+ * In this function we read the in-memory snapshot list without holding
+ * snapshot_mutex, because we don't want to slow down snapshot read performance.
+ * Following is a proof, that even though we don't hold snapshot_mutex here,
+ * reading the list is safe from races with snapshot list delete and add (take).
+ *
+ * Proof of no race with snapshot delete:
+ * --------------------------------------
+ * We get here only when reading from an enabled snapshot or when reading
+ * through from an enabled snapshot to a newer snapshot. Snapshot delete
+ * operation is only allowed for a disabled snapshot, when no older enabled
+ * snapshot exists (i.e., the deleted snapshot in not 'in-use'). Hence,
+ * read through is safe from races with snapshot list delete operations.
+ *
+ * Proof of no race with snapshot take:
+ * ------------------------------------
+ * Snapshot B take is composed of the following steps:
+ * ext4_snapshot_create():
+ * - Add snapshot B to head of list (active_snapshot is A).
+ * - Allocate and copy snapshot B initial blocks.
+ * ext4_snapshot_take():
+ * - Freeze FS
+ * - Clear snapshot A 'active' flag.
+ * - Set snapshot B 'list'+'active' flags.
+ * - Set snapshot B as active snapshot (active_snapshot=B).
+ * - Unfreeze FS
+ *
+ * Note that we do not need to rely on correct order of instructions within
+ * each of the functions above, but we can assume that Freeze FS will provide
+ * a strong barrier between adding B to list and the ops inside snapshot_take.
+ *
+ * When reading from snapshot A during snapshot B take, we have 2 cases:
+ * 1. is_active(A) is tested before setting active_snapshot=B -
+ * read through from A to block device.
+ * 2. is_active(A) is tested after setting active_snapshot=B -
+ * read through from A to B.
+ *
+ * When reading from snapshot B during snapshot B take, we have 2 cases:
+ * 1. B->flags and B->prev are read before adding B to list
+ * AND/OR before setting the 'list'+'active' flags -
+ * access to B denied.
+ * 2. is_active(B) is tested after setting active_snapshot=B
+ * AND/OR after setting the 'list'+'active' flags -
+ * read through from B to block device.
+ */
static int ext4_snapshot_get_block_access(struct inode *inode,
struct inode **prev_snapshot)
{
struct ext4_inode_info *ei = EXT4_I(inode);
unsigned long flags = ext4_get_snapstate_flags(inode);
+ struct list_head *prev = ei->i_snaplist.prev;

+ if (!(flags & 1UL<<EXT4_SNAPSTATE_LIST))
+ /* snapshot not on the list - read/write access denied */
+ return -EPERM;

*prev_snapshot = NULL;
if (ext4_snapshot_is_active(inode) ||
@@ -59,7 +109,23 @@ static int ext4_snapshot_get_block_access(struct inode *inode,
/* read through from active snapshot to block device */
return 0;

- return -EPERM;
+ if (prev == &ei->i_snaplist)
+ /* not on snapshots list? */
+ return -EIO;
+
+ if (prev == &EXT4_SB(inode->i_sb)->s_snapshot_list)
+ /* active snapshot not found on list? */
+ return -EIO;
+
+ /* read through to prev snapshot on the list */
+ ei = list_entry(prev, struct ext4_inode_info, i_snaplist);
+ *prev_snapshot = &ei->vfs_inode;
+
+ if (!ext4_snapshot_file(*prev_snapshot))
+ /* non snapshot file on the list? */
+ return -EIO;
+
+ return 0;
}

#ifdef CONFIG_EXT4_DEBUG
@@ -122,6 +188,7 @@ static int ext4_snapshot_read_through(struct inode *inode, sector_t iblock,
map.m_pblk = 0;
map.m_len = bh_result->b_size >> inode->i_blkbits;

+get_block:
prev_snapshot = NULL;
/* request snapshot file read access */
err = ext4_snapshot_get_block_access(inode, &prev_snapshot);
@@ -134,6 +201,11 @@ static int ext4_snapshot_read_through(struct inode *inode, sector_t iblock,
prev_snapshot ? prev_snapshot->i_generation : 0);
if (err < 0)
return err;
+ if (!err && prev_snapshot) {
+ /* hole in snapshot - check again with prev snapshot */
+ inode = prev_snapshot;
+ goto get_block;
+ }
if (!err)
/* hole in active snapshot - read though to block device */
return 0;
--
1.7.4.1


2011-06-07 15:10:19

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 30/36] ext4: snapshot race conditions - concurrent COW operations

From: Amir Goldstein <[email protected]>

Wait for pending COW operations to complete.
When concurrent tasks try to COW the same buffer, the task that takes
the active snapshot i_data_sem is elected as the the COWing task.
The COWing task allocates a new snapshot block and creates a buffer
cache entry with ref_count=1 for that new block. It then locks the
new buffer and marks it with the buffer_new flag. The rest of the
tasks wait (in msleep(1) loop), until the buffer_new flag is cleared.
The COWing task copies the source buffer into the 'new' buffer,
unlocks it, clears the new_buffer flag and drops its reference count.
On active snapshot readpage, the buffer cache is checked.
If a 'new' buffer entry is found, the reader task waits until the
buffer_new flag is cleared and then copies the 'new' buffer directly
into the snapshot file page.
The sleep loop method was copied from LVM snapshot code, which does
the same thing to deal with these (rare) races without wait queues.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 26 ++++++++++++++++++
fs/ext4/snapshot.c | 11 ++++++++
fs/ext4/snapshot.h | 64 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot_inode.c | 40 ++++++++++++++++++++++++++++
4 files changed, 141 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index de40993..89a97da 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1049,6 +1049,7 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
int depth;
int count = 0;
ext4_fsblk_t first_block = 0;
+ struct buffer_head *sbh = NULL;

trace_ext4_ind_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
J_ASSERT(!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)));
@@ -1155,6 +1156,25 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
if (err)
goto cleanup;

+ if (SNAPMAP_ISCOW(flags)) {
+ /*
+ * COWing block or creating COW bitmap.
+ * we now have exclusive access to the COW destination block
+ * and we are about to create the snapshot block mapping
+ * and make it public.
+ * grab the buffer cache entry and mark it new
+ * to indicate a pending COW operation.
+ * the refcount for the buffer cache will be released
+ * when the COW operation is either completed or canceled.
+ */
+ sbh = sb_getblk(inode->i_sb, le32_to_cpu(chain[depth-1].key));
+ if (!sbh) {
+ err = -EIO;
+ goto cleanup;
+ }
+ ext4_snapshot_start_pending_cow(sbh);
+ }
+
if (map->m_flags & EXT4_MAP_REMAP) {
map->m_len = count;
/* move old block to snapshot */
@@ -1198,6 +1218,12 @@ got_it:
/* Clean up and exit */
partial = chain + depth - 1; /* the whole chain */
cleanup:
+ /* cancel pending COW operation on failure to alloc snapshot block */
+ if (SNAPMAP_ISCOW(flags)) {
+ if (err < 0 && sbh)
+ ext4_snapshot_end_pending_cow(sbh);
+ brelse(sbh);
+ }
while (partial > chain) {
BUFFER_TRACE(partial->bh, "call brelse");
brelse(partial->bh);
diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index 000e655..bd6a833 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -115,6 +115,8 @@ ext4_snapshot_complete_cow(handle_t *handle, struct inode *snapshot,
if (sync)
sync_dirty_buffer(sbh);
out:
+ /* COW operation is complete */
+ ext4_snapshot_end_pending_cow(sbh);
return err;
}

@@ -688,6 +690,12 @@ int ext4_snapshot_test_and_cow(const char *where, handle_t *handle,
* we allocated this block -
* copy block data to snapshot and complete COW operation
*/
+ snapshot_debug(3, "COWing block [%llu/%llu] of snapshot "
+ "(%u)...\n",
+ SNAPSHOT_BLOCK_TUPLE(block),
+ active_snapshot->i_generation);
+ /* sleep 1 tunable delay unit */
+ snapshot_test_delay(SNAPTEST_COW);
err = ext4_snapshot_copy_buffer_cow(handle, active_snapshot,
sbh, bh);
if (err)
@@ -700,6 +708,9 @@ int ext4_snapshot_test_and_cow(const char *where, handle_t *handle,

trace_cow_inc(handle, copied);
test_pending_cow:
+ if (sbh)
+ /* wait for pending COW to complete */
+ ext4_snapshot_test_pending_cow(sbh, block);

cowed:
/* mark the buffer COWed in the current transaction */
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 44bac96..37f5c2d 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -474,6 +474,70 @@ static inline int ext4_snapshot_mow_in_tid(struct inode *inode)
ext4_snapshot_get_tid(inode->i_sb));
}

+/*
+ * Pending COW functions
+ */
+
+/*
+ * Start pending COW operation from get_blocks_handle()
+ * after allocating snapshot block and before connecting it
+ * to the snapshot inode.
+ */
+static inline void ext4_snapshot_start_pending_cow(struct buffer_head *sbh)
+{
+ /*
+ * setting the 'new' flag on a newly allocated snapshot block buffer
+ * indicates that the COW operation is pending.
+ */
+ set_buffer_new(sbh);
+ /* keep buffer in cache as long as we need to test the 'new' flag */
+ get_bh(sbh);
+}
+
+/*
+ * End pending COW operation started in get_blocks_handle().
+ * Called on failure to connect the new snapshot block to the inode
+ * or on successful completion of the COW operation.
+ */
+static inline void ext4_snapshot_end_pending_cow(struct buffer_head *sbh)
+{
+ /*
+ * clearing the 'new' flag from the snapshot block buffer
+ * indicates that the COW operation is complete.
+ */
+ clear_buffer_new(sbh);
+ /* we no longer need to keep the buffer in cache */
+ put_bh(sbh);
+}
+
+/*
+ * Test for pending COW operation and wait for its completion.
+ */
+static inline void ext4_snapshot_test_pending_cow(struct buffer_head *sbh,
+ sector_t blocknr)
+{
+ while (buffer_new(sbh)) {
+ /* wait for pending COW to complete */
+ snapshot_debug_once(2, "waiting for pending cow: "
+ "block = [%llu/%llu]...\n",
+ SNAPSHOT_BLOCK_TUPLE(blocknr));
+ /*
+ * An unusually long pending COW operation can be caused by
+ * the debugging function snapshot_test_delay(SNAPTEST_COW)
+ * and by waiting for tracked reads to complete.
+ * The new COW buffer is locked during those events, so wait
+ * on the buffer before the short msleep.
+ */
+ wait_on_buffer(sbh);
+ /*
+ * This is an unlikely event that can happen only once per
+ * block/snapshot, so msleep(1) is sufficient and there is
+ * no need for a wait queue.
+ */
+ msleep(1);
+ /* XXX: Should we fail after N retries? */
+ }
+}

#else /* CONFIG_EXT4_FS_SNAPSHOT */

diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
index a97411e..55cac07 100644
--- a/fs/ext4/snapshot_inode.c
+++ b/fs/ext4/snapshot_inode.c
@@ -183,6 +183,7 @@ static int ext4_snapshot_read_through(struct inode *inode, sector_t iblock,
int err;
struct ext4_map_blocks map;
struct inode *prev_snapshot;
+ struct buffer_head *sbh = NULL;

map.m_lblk = iblock;
map.m_pblk = 0;
@@ -214,6 +215,45 @@ get_block:
bh_result->b_state = (bh_result->b_state & ~EXT4_MAP_FLAGS) |
map.m_flags;

+ /*
+ * On read of active snapshot, a mapped block may belong to a non
+ * completed COW operation. Use the buffer cache to test this
+ * condition. if (bh_result->b_blocknr == SNAPSHOT_BLOCK(iblock)),
+ * then this is either read through to block device or moved block.
+ * Either way, it is not a COWed block, so it cannot be pending COW.
+ */
+ if (ext4_snapshot_is_active(inode) &&
+ bh_result->b_blocknr != SNAPSHOT_BLOCK(iblock))
+ sbh = sb_find_get_block(inode->i_sb, bh_result->b_blocknr);
+ if (!sbh)
+ return 0;
+ /* wait for pending COW to complete */
+ ext4_snapshot_test_pending_cow(sbh, SNAPSHOT_BLOCK(iblock));
+ lock_buffer(sbh);
+ if (buffer_uptodate(sbh)) {
+ /*
+ * Avoid disk I/O and copy out snapshot page directly
+ * from block device page when possible.
+ */
+ BUG_ON(!sbh->b_page);
+ BUG_ON(!bh_result->b_page);
+ lock_buffer(bh_result);
+ copy_highpage(bh_result->b_page, sbh->b_page);
+ set_buffer_uptodate(bh_result);
+ unlock_buffer(bh_result);
+ } else if (buffer_dirty(sbh)) {
+ /*
+ * If snapshot data buffer is dirty (just been COWed),
+ * then it is not safe to read it from disk yet.
+ * We shouldn't get here because snapshot data buffer
+ * only becomes dirty during COW and because we waited
+ * for pending COW to complete, which means that a
+ * dirty snapshot data buffer should be uptodate.
+ */
+ WARN_ON(1);
+ }
+ unlock_buffer(sbh);
+ brelse(sbh);
return 0;
}

--
1.7.4.1


2011-06-07 15:10:21

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 31/36] ext4: snapshot race conditions - tracked reads

From: Amir Goldstein <[email protected]>

Wait for pending read I/O requests to complete.
When a snapshot file readpage reads through to the block device,
the reading task increments the block tracked readers count.
Upon completion of the async read I/O request of the snapshot page,
the tracked readers count is decremented.
When a task is COWing a block with non-zero tracked readers count,
that task has to wait (in msleep(1) loop), until the block's tracked
readers count drops to zero, before the COW operation is completed.
After a pending COW operation has started, reader tasks have to wait
(again, in msleep(1) loop), until the pending COW operation is
completed, so the COWing task cannot be starved by reader tasks.
The sleep loop method was copied from LVM snapshot code, which does
the same thing to deal with these (rare) races without wait queues.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 6 ++
fs/ext4/snapshot.c | 17 +++++
fs/ext4/snapshot.h | 76 ++++++++++++++++++++++
fs/ext4/snapshot_buffer.c | 155 +++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot_inode.c | 24 +++++++
5 files changed, 278 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ea1f38a..0599fef 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2238,12 +2238,18 @@ enum ext4_state_bits {
* now used by snapshot to do mow
*/
BH_Partial_Write, /* Buffer should be uptodate before write */
+ BH_Tracked_Read, /* Buffer read I/O is being tracked,
+ * to serialize write I/O to block device.
+ * that is, don't write over this block
+ * until I finished reading it.
+ */
};

BUFFER_FNS(Uninit, uninit)
TAS_BUFFER_FNS(Uninit, uninit)
BUFFER_FNS(Remap, remap)
BUFFER_FNS(Partial_Write, partial_write)
+BUFFER_FNS(Tracked_Read, tracked_read)

/*
* Add new method to test wether block and inode bitmaps are properly
diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index bd6a833..a1e4175 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -107,6 +107,23 @@ ext4_snapshot_complete_cow(handle_t *handle, struct inode *snapshot,
{
int err = 0;

+ /* wait for completion of tracked reads before completing COW */
+ while (bh && buffer_tracked_readers_count(bh) > 0) {
+ snapshot_debug_once(2, "waiting for tracked reads: "
+ "block = [%llu/%llu], "
+ "tracked_readers_count = %d...\n",
+ SNAPSHOT_BLOCK_TUPLE(bh->b_blocknr),
+ buffer_tracked_readers_count(bh));
+ /*
+ * Quote from LVM snapshot pending_complete() function:
+ * "Check for conflicting reads. This is extremely improbable,
+ * so msleep(1) is sufficient and there is no need for a wait
+ * queue." (drivers/md/dm-snap.c).
+ */
+ msleep(1);
+ /* XXX: Should we fail after N retries? */
+ }
+
unlock_buffer(sbh);
err = ext4_jbd2_file_inode(handle, snapshot);
if (err)
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 37f5c2d..3282fe7 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -538,6 +538,82 @@ static inline void ext4_snapshot_test_pending_cow(struct buffer_head *sbh,
/* XXX: Should we fail after N retries? */
}
}
+/*
+ * A tracked reader takes 0x10000 reference counts on the block device buffer.
+ * b_count is not likely to reach 0x10000 by get_bh() calls, but even if it
+ * does, that will only affect the result of buffer_tracked_readers_count().
+ * After 0x10000 subsequent calls to get_bh_tracked_reader(), b_count will
+ * overflow, but that requires 0x10000 parallel readers from 0x10000 different
+ * snapshots and very slow disk I/O...
+ */
+#define BH_TRACKED_READERS_COUNT_SHIFT 16
+
+static inline void get_bh_tracked_reader(struct buffer_head *bdev_bh)
+{
+ atomic_add(1<<BH_TRACKED_READERS_COUNT_SHIFT, &bdev_bh->b_count);
+}
+
+static inline void put_bh_tracked_reader(struct buffer_head *bdev_bh)
+{
+ atomic_sub(1<<BH_TRACKED_READERS_COUNT_SHIFT, &bdev_bh->b_count);
+}
+
+static inline int buffer_tracked_readers_count(struct buffer_head *bdev_bh)
+{
+ return atomic_read(&bdev_bh->b_count)>>BH_TRACKED_READERS_COUNT_SHIFT;
+}
+
+/* buffer.c */
+extern int start_buffer_tracked_read(struct buffer_head *bh);
+extern void cancel_buffer_tracked_read(struct buffer_head *bh);
+extern int ext4_read_full_page(struct page *page, get_block_t *get_block);
+
+#ifdef CONFIG_EXT4_DEBUG
+extern void __ext4_trace_bh_count(const char *fn, struct buffer_head *bh);
+#define ext4_trace_bh_count(bh) __ext4_trace_bh_count(__func__, bh)
+#else
+#define ext4_trace_bh_count(bh)
+#define __ext4_trace_bh_count(fn, bh)
+#endif
+
+#define sb_bread(sb, blk) ext4_sb_bread(__func__, sb, blk)
+#define sb_getblk(sb, blk) ext4_sb_getblk(__func__, sb, blk)
+#define sb_find_get_block(sb, blk) ext4_sb_find_get_block(__func__, sb, blk)
+
+static inline struct buffer_head *
+ext4_sb_bread(const char *fn, struct super_block *sb, sector_t block)
+{
+ struct buffer_head *bh;
+
+ bh = __bread(sb->s_bdev, block, sb->s_blocksize);
+ if (bh)
+ __ext4_trace_bh_count(fn, bh);
+ return bh;
+}
+
+static inline struct buffer_head *
+ext4_sb_getblk(const char *fn, struct super_block *sb, sector_t block)
+{
+ struct buffer_head *bh;
+
+ bh = __getblk(sb->s_bdev, block, sb->s_blocksize);
+ if (bh)
+ __ext4_trace_bh_count(fn, bh);
+ return bh;
+}
+
+static inline struct buffer_head *
+ext4_sb_find_get_block(const char *fn, struct super_block *sb, sector_t block)
+{
+ struct buffer_head *bh;
+
+ bh = __find_get_block(sb->s_bdev, block, sb->s_blocksize);
+ if (bh)
+ __ext4_trace_bh_count(fn, bh);
+ return bh;
+}
+
+

#else /* CONFIG_EXT4_FS_SNAPSHOT */

diff --git a/fs/ext4/snapshot_buffer.c b/fs/ext4/snapshot_buffer.c
index acea9a3..387965e 100644
--- a/fs/ext4/snapshot_buffer.c
+++ b/fs/ext4/snapshot_buffer.c
@@ -55,6 +55,156 @@ static void buffer_io_error(struct buffer_head *bh)
}

/*
+ * Tracked read functions.
+ * When reading through a ext4 snapshot file hole to a block device block,
+ * all writes to this block need to wait for completion of the async read.
+ * ext4_snapshot_readpage() always calls ext4_read_full_page() to attach
+ * a buffer head to the page and be aware of tracked reads.
+ * ext4_snapshot_get_block() calls start_buffer_tracked_read() to mark both
+ * snapshot page buffer and block device page buffer.
+ * ext4_snapshot_get_block() calls cancel_buffer_tracked_read() if snapshot
+ * doesn't need to read through to the block device.
+ * ext4_read_full_page() calls submit_buffer_tracked_read() to submit a
+ * tracked async read.
+ * end_buffer_async_read() calls end_buffer_tracked_read() to complete the
+ * tracked read operation.
+ * The only lock needed in all these functions is PageLock on the snapshot page,
+ * which is guarantied in readpage() and verified in ext4_read_full_page().
+ * The block device page buffer doesn't need any lock because the operations
+ * {get|put}_bh_tracked_reader() are atomic.
+ */
+
+#ifdef CONFIG_EXT4_DEBUG
+/*
+ * trace maximum value of b_count on all fs buffers to see if we are
+ * overflowing to upper word (tracked readers count)
+ */
+void __ext4_trace_bh_count(const char *fn, struct buffer_head *bh)
+{
+ static sector_t blocknr;
+ static int maxcount;
+ static int maxbit = 1;
+ static int maxorder;
+ int count = atomic_read(&bh->b_count) & 0x0000ffff;
+
+ BUG_ON(count < 0);
+ if (count <= maxcount)
+ return;
+ maxcount = count;
+ blocknr = bh->b_blocknr;
+
+ if (count <= maxbit)
+ return;
+ while (count > maxbit) {
+ maxbit <<= 1;
+ maxorder++;
+ }
+
+ snapshot_debug(maxorder > 7 ? 1 : 2,
+ "%s: buffer refcount maxorder = %d, "
+ "maxcount = 0x%08x, block = [%llu/%llu].\n",
+ fn, maxorder, maxcount,
+ SNAPSHOT_BLOCK_TUPLE(blocknr));
+}
+#endif
+
+/*
+ * start buffer tracked read
+ * called from inside get_block()
+ * get tracked reader ref count on buffer cache entry
+ * and set buffer tracked read flag
+ */
+int start_buffer_tracked_read(struct buffer_head *bh)
+{
+ struct buffer_head *bdev_bh;
+
+ BUG_ON(buffer_tracked_read(bh));
+ BUG_ON(!buffer_mapped(bh));
+
+ /* grab the buffer cache entry */
+ bdev_bh = __getblk(bh->b_bdev, bh->b_blocknr, bh->b_size);
+ if (!bdev_bh)
+ return -EIO;
+
+ BUG_ON(bdev_bh == bh);
+ ext4_trace_bh_count(bdev_bh);
+ set_buffer_tracked_read(bh);
+ get_bh_tracked_reader(bdev_bh);
+ put_bh(bdev_bh);
+ return 0;
+}
+
+/*
+ * cancel buffer tracked read
+ * called for tracked read that was started but was not submitted
+ * put tracked reader ref count on buffer cache entry
+ * and clear buffer tracked read flag
+ */
+void cancel_buffer_tracked_read(struct buffer_head *bh)
+{
+ struct buffer_head *bdev_bh;
+
+ BUG_ON(!buffer_tracked_read(bh));
+ BUG_ON(!buffer_mapped(bh));
+
+ /* try to grab the buffer cache entry */
+ bdev_bh = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
+ BUG_ON(!bdev_bh || bdev_bh == bh);
+ ext4_trace_bh_count(bdev_bh);
+ clear_buffer_tracked_read(bh);
+ clear_buffer_mapped(bh);
+ put_bh_tracked_reader(bdev_bh);
+ put_bh(bdev_bh);
+}
+
+/*
+ * submit buffer tracked read
+ * save a reference to buffer cache entry and submit I/O
+ */
+static int submit_buffer_tracked_read(struct buffer_head *bh)
+{
+ struct buffer_head *bdev_bh;
+ BUG_ON(!buffer_tracked_read(bh));
+ BUG_ON(!buffer_mapped(bh));
+ /* tracked read doesn't work with multiple buffers per page */
+ BUG_ON(bh->b_this_page != bh);
+
+ /*
+ * Try to grab the buffer cache entry before submitting async read
+ * because we cannot call blocking function __find_get_block()
+ * in interrupt context inside end_buffer_tracked_read().
+ */
+ bdev_bh = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);
+ BUG_ON(!bdev_bh || bdev_bh == bh);
+ ext4_trace_bh_count(bdev_bh);
+ /* override page buffers list with reference to buffer cache entry */
+ bh->b_this_page = bdev_bh;
+ submit_bh(READ, bh);
+ return 0;
+}
+
+/*
+ * end buffer tracked read
+ * complete submitted tracked read
+ */
+static void end_buffer_tracked_read(struct buffer_head *bh)
+{
+ struct buffer_head *bdev_bh = bh->b_this_page;
+
+ BUG_ON(!buffer_tracked_read(bh));
+ BUG_ON(!bdev_bh || bdev_bh == bh);
+ bh->b_this_page = bh;
+ /*
+ * clear the buffer mapping to make sure
+ * that get_block() will always be called
+ */
+ clear_buffer_mapped(bh);
+ clear_buffer_tracked_read(bh);
+ put_bh_tracked_reader(bdev_bh);
+ put_bh(bdev_bh);
+}
+
+/*
* I/O completion handler for ext4_read_full_page() - pages
* which come unlocked at the end of I/O.
*/
@@ -68,6 +218,9 @@ static void end_buffer_async_read(struct buffer_head *bh, int uptodate)

BUG_ON(!buffer_async_read(bh));

+ if (buffer_tracked_read(bh))
+ end_buffer_tracked_read(bh);
+
page = bh->b_page;
if (uptodate) {
set_buffer_uptodate(bh);
@@ -229,6 +382,8 @@ int ext4_read_full_page(struct page *page, get_block_t *get_block)
*/
for (i = 0; i < nr; i++) {
bh = arr[i];
+ if (buffer_tracked_read(bh))
+ return submit_buffer_tracked_read(bh);
if (buffer_uptodate(bh))
end_buffer_async_read(bh, 1);
else
diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
index 55cac07..f2311e4 100644
--- a/fs/ext4/snapshot_inode.c
+++ b/fs/ext4/snapshot_inode.c
@@ -195,11 +195,35 @@ get_block:
err = ext4_snapshot_get_block_access(inode, &prev_snapshot);
if (err < 0)
return err;
+ if (!prev_snapshot) {
+ /*
+ * Possible read through to block device.
+ * Start tracked read before checking if block is mapped to
+ * avoid race condition with COW that maps the block after
+ * we checked if the block is mapped. If we find that the
+ * block is mapped, we will cancel the tracked read before
+ * returning from this function.
+ */
+ map_bh(bh_result, inode->i_sb, SNAPSHOT_BLOCK(iblock));
+ err = start_buffer_tracked_read(bh_result);
+ if (err < 0) {
+ snapshot_debug(1,
+ "snapshot (%u) failed to start "
+ "tracked read on block (%lld) "
+ "(err=%d)\n", inode->i_generation,
+ (long long)bh_result->b_blocknr, err);
+ return err;
+ }
+ }
err = ext4_map_blocks(NULL, inode, &map, 0);
snapshot_debug(4, "ext4_snapshot_read_through(%lld): block = "
"(%lld), err = %d\n prev_snapshot = %u",
(long long)iblock, map.m_pblk, err,
prev_snapshot ? prev_snapshot->i_generation : 0);
+ /* if it's not a hole - cancel tracked read before we deadlock
+ * on pending COW */
+ if (err && buffer_tracked_read(bh_result))
+ cancel_buffer_tracked_read(bh_result);
if (err < 0)
return err;
if (!err && prev_snapshot) {
--
1.7.4.1


2011-06-07 15:10:24

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 32/36] ext4: snapshot exclude - the exclude bitmap

From: Amir Goldstein <[email protected]>

Mark all snapshot blocks excluded from COW (i.e., mark that
they do not need to be COWed). The excluded blocks appear as not
allocated inside the snapshot image (no snapshots of snapshot files).
Excluding snapshot file blocks is essential for efficient cleanup
of deleted snapshot files.
Excluding blocks is done by setting their bit in the exclude bitmap.
There is one exclude bitmap block per block group, which is allocated
on mkfs when setting the exclude_bitmap feature. The exclude bitmap
location is stored in the group descriptor.
The exclude_bitmap feature is backward compatible, but online resize
support with exclude_bitmap was not yet implemented.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/balloc.c | 107 ++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 12 ++++-
fs/ext4/mballoc.c | 69 +++++++++++++++++++++++
fs/ext4/resize.c | 6 ++
fs/ext4/snapshot.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot.h | 28 ++++++++++
fs/ext4/snapshot_ctl.c | 7 +++
fs/ext4/snapshot_inode.c | 11 ++++
fs/ext4/super.c | 35 ++++++++++++-
9 files changed, 408 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1c140e4..b1303f2 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -65,6 +65,11 @@ static int ext4_group_used_meta_blocks(struct super_block *sb,
/* block bitmap, inode bitmap, and inode table blocks */
int used_blocks = sbi->s_itb_per_group + 2;

+ if (EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP))
+ /* exclude bitmap */
+ used_blocks += 1;
+
if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_FLEX_BG)) {
if (!ext4_block_in_group(sb, ext4_block_bitmap(sb, gdp),
block_group))
@@ -157,6 +162,14 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
tmp = ext4_block_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
ext4_set_bit(tmp - start, bh->b_data);
+ if (EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP)) {
+ tmp = ext4_exclude_bitmap(sb, gdp);
+ if (!flex_bg ||
+ ext4_block_in_group(sb, tmp, block_group))
+ ext4_set_bit(tmp - start, bh->b_data);
+ }
+
tmp = ext4_inode_bitmap(sb, gdp);
if (!flex_bg || ext4_block_in_group(sb, tmp, block_group))
ext4_set_bit(tmp - start, bh->b_data);
@@ -361,6 +374,100 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
return bh;
}

+/* Initializes an uninitialized exclude bitmap if given, and returns 0 */
+unsigned ext4_init_exclude_bitmap(struct super_block *sb,
+ struct buffer_head *bh,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
+{
+ if (!bh)
+ /* we can return no. of blocks in exclude bitmap */
+ return 0;
+
+ J_ASSERT_BH(bh, buffer_locked(bh));
+ memset(bh->b_data, 0, sb->s_blocksize);
+ return 0;
+}
+
+/**
+ * read_exclude_bitmap()
+ * @sb: super block
+ * @block_group: given block group
+ *
+ * Read the exclude bitmap for a given block_group
+ *
+ * Return buffer_head on success or NULL in case of failure.
+ */
+struct buffer_head *
+ext4_read_exclude_bitmap(struct super_block *sb, ext4_group_t block_group)
+{
+ struct ext4_group_desc *desc;
+ struct buffer_head *bh = NULL;
+ ext4_fsblk_t bitmap_blk;
+
+ desc = ext4_get_group_desc(sb, block_group, NULL);
+ if (!desc)
+ return NULL;
+ bitmap_blk = ext4_exclude_bitmap(sb, desc);
+ if (!bitmap_blk)
+ return NULL;
+ bh = sb_getblk(sb, bitmap_blk);
+ if (unlikely(!bh)) {
+ ext4_error(sb, "Cannot read exclude bitmap - "
+ "block_group = %d, exclude_bitmap = %llu",
+ block_group, bitmap_blk);
+ return NULL;
+ }
+
+ if (bitmap_uptodate(bh))
+ return bh;
+
+ lock_buffer(bh);
+ if (bitmap_uptodate(bh)) {
+ unlock_buffer(bh);
+ return bh;
+ }
+
+ ext4_lock_group(sb, block_group);
+ if (desc->bg_flags & cpu_to_le16(EXT4_BG_EXCLUDE_UNINIT)) {
+ ext4_init_exclude_bitmap(sb, bh, block_group, desc);
+ set_bitmap_uptodate(bh);
+ set_buffer_uptodate(bh);
+ ext4_unlock_group(sb, block_group);
+ unlock_buffer(bh);
+ return bh;
+ }
+ ext4_unlock_group(sb, block_group);
+ if (buffer_uptodate(bh)) {
+ /*
+ * if not uninit if bh is uptodate,
+ * bitmap is also uptodate
+ */
+ set_bitmap_uptodate(bh);
+ unlock_buffer(bh);
+ return bh;
+ }
+ /*
+ * submit the buffer_head for read. We can
+ * safely mark the bitmap as uptodate now.
+ * We do it here so the bitmap uptodate bit
+ * get set with buffer lock held.
+ */
+ set_bitmap_uptodate(bh);
+ if (bh_submit_read(bh) < 0) {
+ put_bh(bh);
+ ext4_error(sb, "Cannot read exclude bitmap - "
+ "block_group = %u, block_bitmap = %llu",
+ block_group, bitmap_blk);
+ return NULL;
+ }
+ /*
+ * file system mounted not to panic on error,
+ * continue with corrupt bitmap
+ */
+ return bh;
+}
+
/**
* ext4_has_free_blocks()
* @sbi: in-core super block structure.
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0599fef..34aaade 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -269,7 +269,8 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_lo;/* Free inodes count */
__le16 bg_used_dirs_count_lo; /* Directories count */
__le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
- __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */
+ __le32 bg_exclude_bitmap_lo; /* Exclude bitmap block */
+ __u32 bg_reserved[1]; /* Likely block/inode bitmap checksum */
__le16 bg_itable_unused_lo; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
@@ -279,7 +280,8 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
__le16 bg_used_dirs_count_hi; /* Directories count MSB */
__le16 bg_itable_unused_hi; /* Unused inodes count MSB */
- __u32 bg_reserved2[3];
+ __le32 bg_exclude_bitmap_hi; /* Exclude bitmap block MSB */
+ __u32 bg_reserved2[2];
};

/*
@@ -295,6 +297,7 @@ struct flex_groups {
#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
#define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
#define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
+#define EXT4_BG_EXCLUDE_UNINIT 0x0008 /* Exclude bitmap not in use */

/*
* Macro-instructions used to manage group descriptors
@@ -1433,6 +1436,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_COMPAT_EXT_ATTR 0x0008
#define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010
#define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020
+#define EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100 /* Has exclude bitmap */

#define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
@@ -1775,6 +1779,8 @@ extern unsigned ext4_init_block_bitmap(struct super_block *sb,
struct ext4_group_desc *desc);
#define ext4_free_blocks_after_init(sb, group, desc) \
ext4_init_block_bitmap(sb, NULL, group, desc)
+extern struct buffer_head *ext4_read_exclude_bitmap(struct super_block *sb,
+ unsigned int block_group);

/* dir.c */
extern int __ext4_check_dir_entry(const char *, unsigned int, struct inode *,
@@ -1936,6 +1942,8 @@ extern ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
struct ext4_group_desc *bg);
extern ext4_fsblk_t ext4_inode_bitmap(struct super_block *sb,
struct ext4_group_desc *bg);
+extern ext4_fsblk_t ext4_exclude_bitmap(struct super_block *sb,
+ struct ext4_group_desc *bg);
extern ext4_fsblk_t ext4_inode_table(struct super_block *sb,
struct ext4_group_desc *bg);
extern __u32 ext4_free_blks_count(struct super_block *sb,
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 899c12c..50d2c9d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2846,6 +2846,11 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
if (err)
goto out_err;

+ if (EXT4_SNAPSHOTS(sb) && ext4_snapshot_excluded(ac->ac_inode)) {
+ err = ext4_snapshot_exclude_blocks(handle, sb, block, ac->ac_b_ex.fe_len);
+ if (err < 0)
+ goto out_err;
+ }

err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);

@@ -4515,6 +4520,12 @@ void __ext4_free_blocks(const char *where, unsigned int line, handle_t *handle,
int err = 0;
int ret;
int maxblocks;
+ struct buffer_head *exclude_bitmap_bh = NULL;
+ int exclude_bitmap_dirty = 0;
+ /* excluded_block is determined by testing exclude bitmap */
+ int excluded_block;
+ /* excluded_file is an attribute of the inode */
+ int excluded_file = ext4_snapshot_excluded(inode);

if (bh) {
if (block)
@@ -4634,6 +4645,26 @@ do_more:
err = ext4_journal_get_write_access(handle, gd_bh);
if (err)
goto error_return;
+ /*
+ * we may be freeing blocks of snapshot/excluded file
+ * which we would need to clear from exclude bitmap -
+ * try to read exclude bitmap and if it fails
+ * skip the exclude bitmap update
+ */
+ if (EXT4_SNAPSHOTS(sb)) {
+ brelse(exclude_bitmap_bh);
+ exclude_bitmap_bh = ext4_read_exclude_bitmap(sb, block_group);
+ if (!exclude_bitmap_bh) {
+ err = -EIO;
+ goto error_return;
+ }
+ err = ext4_journal_get_write_access_exclude(handle,
+ exclude_bitmap_bh);
+ if (err)
+ goto error_return;
+ exclude_bitmap_dirty = 0;
+ }
+
#ifdef AGGRESSIVE_CHECK
{
int i;
@@ -4641,6 +4672,29 @@ do_more:
BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
}
#endif
+ if (exclude_bitmap_bh) {
+ unsigned long i;
+
+ if (excluded_file)
+ i = mb_find_next_zero_bit(exclude_bitmap_bh->b_data,
+ bit + count, bit) - bit;
+ else
+ i = mb_find_next_bit(exclude_bitmap_bh->b_data,
+ bit + count, bit) - bit;
+ if (i < count) {
+ EXT4_SET_FLAGS(sb, EXT4_FLAGS_FIX_EXCLUDE);
+ ext4_error(sb, "%sexcluded file (ino=%lu)"
+ " block [%lu-%lu/%u, %llu] was %sexcluded!"
+ " - run fsck to fix exclude bitmap.\n",
+ excluded_file ? "" : "non-",
+ inode ? inode->i_ino : 0,
+ bit + i, bit + count,
+ block_group, block + i,
+ excluded_file ? "not " : "");
+ if (!excluded_file)
+ excluded_block = 1;
+ }
+ }
trace_ext4_mballoc_free(sb, inode, block_group, bit, count);

err = ext4_mb_load_buddy(sb, block_group, &e4b);
@@ -4675,6 +4729,14 @@ do_more:
mb_clear_bits(bitmap_bh->b_data, bit, count);
mb_free_blocks(inode, &e4b, bit, count);
}
+ /*
+ * A free block should never be excluded from snapshot, so we
+ * always clear exclude bitmap just to be on the safe side.
+ */
+ if (exclude_bitmap_bh && (excluded_file || excluded_block)) {
+ mb_clear_bits(exclude_bitmap_bh->b_data, bit, count);
+ exclude_bitmap_dirty = 1;
+ }

ret = ext4_free_blks_count(sb, gdp) + count;
ext4_free_blks_set(sb, gdp, ret);
@@ -4694,6 +4756,12 @@ do_more:
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
+ if (exclude_bitmap_bh && exclude_bitmap_dirty) {
+ ret = ext4_handle_dirty_metadata(handle, NULL,
+ exclude_bitmap_bh);
+ if (!err)
+ err = ret;
+ }

/* And the group descriptor block */
BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
@@ -4711,6 +4779,7 @@ do_more:
error_return:
if (freed)
dquot_free_block(inode, freed);
+ brelse(exclude_bitmap_bh);
brelse(bitmap_bh);
ext4_std_error(sb, err);
return;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index d341a5c..741d72c 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -753,6 +753,12 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
gdb_num = input->group / EXT4_DESC_PER_BLOCK(sb);
gdb_off = input->group % EXT4_DESC_PER_BLOCK(sb);

+ if (EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP)) {
+ ext4_warning(sb, "Can't resize filesystem with exclude bitmap");
+ return -ENOTSUPP;
+ }
+
if (gdb_off == 0 && !EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER)) {
ext4_warning(sb, "Can't resize non-sparse filesystem further");
diff --git a/fs/ext4/snapshot.c b/fs/ext4/snapshot.c
index a1e4175..58ec349 100644
--- a/fs/ext4/snapshot.c
+++ b/fs/ext4/snapshot.c
@@ -187,6 +187,7 @@ ext4_snapshot_init_cow_bitmap(struct super_block *sb,
unsigned int block_group, struct buffer_head *cow_bh)
{
struct buffer_head *bitmap_bh;
+ struct buffer_head *exclude_bitmap_bh = NULL;
char *dst, *src, *mask = NULL;

bitmap_bh = ext4_read_block_bitmap(sb, block_group);
@@ -194,6 +195,10 @@ ext4_snapshot_init_cow_bitmap(struct super_block *sb,
return -EIO;

src = bitmap_bh->b_data;
+ exclude_bitmap_bh = ext4_read_exclude_bitmap(sb, block_group);
+ if (exclude_bitmap_bh)
+ /* mask block bitmap with exclude bitmap */
+ mask = exclude_bitmap_bh->b_data;
/*
* Another COWing task may be changing this block bitmap
* (allocating active snapshot blocks) while we are trying
@@ -214,6 +219,7 @@ ext4_snapshot_init_cow_bitmap(struct super_block *sb,

ext4_unlock_group(sb, block_group);

+ brelse(exclude_bitmap_bh);
brelse(bitmap_bh);
return 0;
}
@@ -420,10 +426,129 @@ ext4_snapshot_test_cow_bitmap(handle_t *handle, struct inode *snapshot,

ret = ext4_mb_test_bit_range(bit, cow_bh->b_data, maxblocks);

+ if (ret && excluded) {
+ int i, inuse = *maxblocks;
+
+ /*
+ * We should never get here because excluded file blocks should
+ * be excluded from COW bitmap. The blocks will not be COWed
+ * anyway, but this can indicate a messed up exclude bitmap.
+ * Mark that exclude bitmap needs to be fixed and clear blocks
+ * from COW bitmap.
+ */
+ EXT4_SET_FLAGS(excluded->i_sb, EXT4_FLAGS_FIX_EXCLUDE);
+ ext4_warning(excluded->i_sb,
+ "clearing excluded file (ino=%lu) blocks [%d-%d/%lu] "
+ "from COW bitmap! - running fsck to fix exclude bitmap "
+ "is recommended.\n",
+ excluded->i_ino, bit, bit+inuse-1, block_group);
+ for (i = 0; i < inuse; i++)
+ ext4_clear_bit(bit+i, cow_bh->b_data);
+ ret = ext4_jbd2_file_inode(handle, snapshot);
+ mark_buffer_dirty(cow_bh);
+ }
+
brelse(cow_bh);
return ret;
}
/*
+ * ext4_snapshot_test_and_exclude() marks blocks in exclude bitmap
+ * @where: name of caller function
+ * @handle: JBD handle
+ * @sb: super block handle
+ * @block: address of first block to exclude
+ * @maxblocks: max. blocks to exclude
+ * @exclude: if false, return -EIO if block needs to be excluded
+ *
+ * Return values:
+ * >= 0 - no. of blocks set in exclude bitmap
+ * < 0 - error
+ */
+int ext4_snapshot_test_and_exclude(const char *where, handle_t *handle,
+ struct super_block *sb, ext4_fsblk_t block, int count,
+ int exclude)
+{
+ struct buffer_head *exclude_bitmap_bh = NULL;
+ struct buffer_head *gdp_bh = NULL;
+ struct ext4_group_desc *gdp = NULL;
+ ext4_group_t block_group = SNAPSHOT_BLOCK_GROUP(block);
+ ext4_grpblk_t bit = SNAPSHOT_BLOCK_GROUP_OFFSET(block);
+ int err = 0, n = 0, excluded = 0, exclude_uninit;
+
+ err = -EIO;
+ gdp = ext4_get_group_desc(sb, block_group, &gdp_bh);
+ if (!gdp)
+ goto out;
+
+ exclude_uninit = gdp->bg_flags & cpu_to_le16(EXT4_BG_EXCLUDE_UNINIT);
+ if (exclude && exclude_uninit) {
+ err = ext4_journal_get_write_access(handle, gdp_bh);
+ if (err)
+ goto out;
+ }
+
+ exclude_bitmap_bh = ext4_read_exclude_bitmap(sb, block_group);
+ if (!exclude_bitmap_bh)
+ return 0;
+
+ if (exclude)
+ err = ext4_journal_get_write_access_exclude(handle,
+ exclude_bitmap_bh);
+ if (err)
+ goto out;
+
+ while (count > 0 && bit < SNAPSHOT_BLOCKS_PER_GROUP) {
+ if (!ext4_set_bit_atomic(sb_bgl_lock(EXT4_SB(sb),
+ block_group),
+ bit, exclude_bitmap_bh->b_data)) {
+ n++;
+ if (!exclude)
+ break;
+ } else if (n) {
+ snapshot_debug(2, "excluded blocks: [%d-%d/%d]\n",
+ bit-n, bit-1, block_group);
+ excluded += n;
+ n = 0;
+ }
+ bit++;
+ count--;
+ }
+
+ if (n && !exclude) {
+ EXT4_SET_FLAGS(sb, EXT4_FLAGS_FIX_EXCLUDE);
+ ext4_error(sb, where,
+ "snapshot file block [%d/%d] not in exclude bitmap! - "
+ "running fsck to fix exclude bitmap is recommended.\n",
+ bit, block_group);
+ err = -EIO;
+ goto out;
+ }
+
+ if (n) {
+ snapshot_debug(2, "excluded blocks: [%d-%d/%d]\n",
+ bit-n, bit-1, block_group);
+ excluded += n;
+ }
+
+ if (exclude && excluded) {
+ err = ext4_handle_dirty_metadata(handle,
+ NULL, exclude_bitmap_bh);
+ if (err)
+ goto out;
+
+ if (exclude_uninit) {
+ gdp->bg_flags &= cpu_to_le16(~EXT4_BG_EXCLUDE_UNINIT);
+ err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
+ }
+
+ trace_cow_add(handle, excluded, excluded);
+ }
+out:
+ brelse(exclude_bitmap_bh);
+ return err ? err : excluded;
+}
+
+/*
* COW functions
*/

@@ -732,6 +857,13 @@ test_pending_cow:
cowed:
/* mark the buffer COWed in the current transaction */
ext4_snapshot_mark_cowed(handle, bh);
+ if (clear) {
+ /* mark COWed block in exclude bitmap */
+ clear = ext4_snapshot_exclude_blocks(handle, sb,
+ block, 1);
+ if (clear < 0)
+ err = clear;
+ }
out:
brelse(sbh);
/* END COWing */
@@ -854,6 +986,10 @@ int ext4_snapshot_test_and_move(const char *where, handle_t *handle,
*/
if (inode)
dquot_free_block(inode, count);
+ /* mark moved blocks in exclude bitmap */
+ excluded = ext4_snapshot_exclude_blocks(handle, sb, block, count);
+ if (excluded < 0)
+ err = excluded;
trace_cow_add(handle, moved, count);
out:
/* END moving */
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 3282fe7..11fa43d 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -341,6 +341,34 @@ static inline int ext4_snapshot_get_delete_access(handle_t *handle,

return ext4_snapshot_move(handle, inode, block, pcount, 1);
}
+extern int ext4_snapshot_test_and_exclude(const char *where, handle_t *handle,
+ struct super_block *sb, ext4_fsblk_t block, int maxblocks,
+ int exclude);
+
+/*
+ * ext4_snapshot_exclude_blocks() - exclude snapshot blocks
+ *
+ * Called from ext4_snapshot_test_and_{cow,move}() when copying/moving
+ * blocks to active snapshot.
+ *
+ * Return <0 on error.
+ */
+#define ext4_snapshot_exclude_blocks(handle, sb, block, count) \
+ ext4_snapshot_test_and_exclude(__func__, (handle), (sb), \
+ (block), (count), 1)
+
+/*
+ * ext4_snapshot_test_excluded() - test that snapshot blocks are excluded
+ *
+ * Called from ext4_count_branches() and
+ * ext4_count_blocks() under snapshot_mutex.
+ *
+ * Return <0 on error or if snapshot blocks are not excluded.
+ */
+#define ext4_snapshot_test_excluded(inode, block, count) \
+ ext4_snapshot_test_and_exclude(__func__, NULL, (inode)->i_sb, \
+ (block), (count), 0)
+

extern void init_ext4_snapshot_cow_cache(void);

diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 298405a..9e39c04 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -886,6 +886,7 @@ int ext4_snapshot_take(struct inode *inode)
struct ext4_super_block *es = NULL;
struct buffer_head *es_bh = NULL;
struct buffer_head *sbh = NULL;
+ struct buffer_head *exclude_bitmap_bh = NULL;
struct buffer_head *bhs[COPY_INODE_BLOCKS_NUM] = { NULL };
const char *mask = NULL;
struct inode *curr_inode;
@@ -1027,6 +1028,11 @@ copy_inode_blocks:
brelse(bhs[COPY_INODE_BITMAP]);
bhs[COPY_INODE_BITMAP] = sb_bread(sb,
ext4_inode_bitmap(sb, desc));
+ brelse(exclude_bitmap_bh);
+ exclude_bitmap_bh = ext4_read_exclude_bitmap(sb, iloc.block_group);
+ if (exclude_bitmap_bh)
+ /* mask block bitmap with exclude bitmap */
+ mask = exclude_bitmap_bh->b_data;
err = -EIO;
for (i = 0; i < COPY_INODE_BLOCKS_NUM; i++) {
brelse(sbh);
@@ -1134,6 +1140,7 @@ out_unlockfs:
inode->i_generation);

out_err:
+ brelse(exclude_bitmap_bh);
brelse(es_bh);
brelse(sbh);
for (i = 0; i < COPY_INODE_BLOCKS_NUM; i++)
diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
index f2311e4..3d6cb7b 100644
--- a/fs/ext4/snapshot_inode.c
+++ b/fs/ext4/snapshot_inode.c
@@ -149,6 +149,7 @@ static int ext4_snapshot_get_blockdev_access(struct super_block *sb,
unsigned long block_group = SNAPSHOT_BLOCK_GROUP(bh->b_blocknr);
ext4_grpblk_t bit = SNAPSHOT_BLOCK_GROUP_OFFSET(bh->b_blocknr);
struct buffer_head *bitmap_bh;
+ struct buffer_head *exclude_bitmap_bh = NULL;
int err = 0;

if (PageReadahead(bh->b_page))
@@ -166,6 +167,16 @@ static int ext4_snapshot_get_blockdev_access(struct super_block *sb,
return -EIO;
}

+ exclude_bitmap_bh = ext4_read_exclude_bitmap(sb, block_group);
+ if (exclude_bitmap_bh &&
+ ext4_test_bit(bit, exclude_bitmap_bh->b_data)) {
+ snapshot_debug(2, "warning: attempt to read through to "
+ "excluded block [%d/%lu] - read ahead?\n",
+ bit, block_group);
+ err = -EIO;
+ }
+
+ brelse(exclude_bitmap_bh);
brelse(bitmap_bh);
return err;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a1c4728..7f15983 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -126,6 +126,18 @@ ext4_fsblk_t ext4_inode_bitmap(struct super_block *sb,
(ext4_fsblk_t)le32_to_cpu(bg->bg_inode_bitmap_hi) << 32 : 0);
}

+ext4_fsblk_t ext4_exclude_bitmap(struct super_block *sb,
+ struct ext4_group_desc *bg)
+{
+ if (!EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP))
+ return 0;
+
+ return le32_to_cpu(bg->bg_exclude_bitmap_lo) |
+ (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
+ (ext4_fsblk_t)le32_to_cpu(bg->bg_exclude_bitmap_hi) << 32 : 0);
+}
+
ext4_fsblk_t ext4_inode_table(struct super_block *sb,
struct ext4_group_desc *bg)
{
@@ -2067,6 +2079,7 @@ static int ext4_check_descriptors(struct super_block *sb,
ext4_fsblk_t block_bitmap;
ext4_fsblk_t inode_bitmap;
ext4_fsblk_t inode_table;
+ ext4_fsblk_t exclude_bitmap;
int flexbg_flag = 0;
ext4_group_t i, grp = sbi->s_groups_count;

@@ -2110,10 +2123,23 @@ static int ext4_check_descriptors(struct super_block *sb,
"(block %llu)!", i, inode_table);
return 0;
}
+ if (EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP)) {
+ exclude_bitmap = ext4_exclude_bitmap(sb, gdp);
+ if (exclude_bitmap < first_block ||
+ exclude_bitmap > last_block) {
+ ext4_msg(sb, KERN_ERR,
+ "ext4_check_descriptors: "
+ "Exclude bitmap for group %u "
+ "not in group (block %llu)!",
+ i, exclude_bitmap);
+ return 0;
+ }
+ }
ext4_lock_group(sb, i);
if (!ext4_group_desc_csum_verify(sbi, i, gdp)) {
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
- "Checksum for group %u failed (%u!=%u)",
+ "Checksum for group %u failed (%x!=%x)",
i, le16_to_cpu(ext4_group_desc_csum(sbi, i,
gdp)), le16_to_cpu(gdp->bg_checksum));
if (!(sb->s_flags & MS_RDONLY)) {
@@ -2653,6 +2679,13 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
"features: meta_bg, 64bit");
return 0;
}
+ if (!EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP)) {
+ ext4_msg(sb, KERN_ERR,
+ "exclude_bitmap feature is required "
+ "for snapshots");
+ return 0;
+ }
if (EXT4_TEST_FLAGS(sb, EXT4_FLAGS_IS_SNAPSHOT)) {
ext4_msg(sb, KERN_ERR,
"A snapshot image must be mounted read-only. "
--
1.7.4.1


2011-06-07 15:10:30

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 34/36] ext4: snapshot cleanup - shrink deleted snapshots

From: Amir Goldstein <[email protected]>

Free blocks of deleted snapshots, which are not in use by an older
non-deleted snapshot. Shrinking helps reclaiming disk space
while older snapshots are currently in use (enabled).
We modify the indirect inode truncate helper functions so that they
can be used by the snapshot cleanup functions to free blocks
selectively according to a COW bitmap buffer.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 10 ++
fs/ext4/inode.c | 51 ++++++++---
fs/ext4/snapshot.h | 5 +
fs/ext4/snapshot_ctl.c | 232 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/snapshot_inode.c | 220 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 504 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6f0f310..92d75c2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1894,6 +1894,16 @@ extern void ext4_free_branches(handle_t *handle, struct inode *inode,
struct buffer_head *parent_bh,
__le32 *first, __le32 *last,
int depth);
+extern void ext4_free_data_cow(handle_t *handle, struct inode *inode,
+ struct buffer_head *this_bh,
+ __le32 *first, __le32 *last,
+ const char *bitmap, int bit,
+ int *pfreed_blocks);
+
+#define ext4_free_data(handle, inode, bh, first, last) \
+ ext4_free_data_cow(handle, inode, bh, first, last, \
+ NULL, 0, NULL)
+
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5199035..ad9463f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4523,11 +4523,15 @@ no_top:
* Return 0 on success, 1 on invalid block range
* and < 0 on fatal error.
*/
-static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
- struct buffer_head *bh,
- ext4_fsblk_t block_to_free,
- unsigned long count, __le32 *first,
- __le32 *last)
+/*
+ * ext4_clear_blocks_cow - Zero a number of block pointers (consult COW bitmap)
+ * @bitmap: COW bitmap to consult when shrinking deleted snapshot
+ * @bit: bit number representing the @first block
+ */
+static int ext4_clear_blocks_cow(handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, ext4_fsblk_t block_to_free,
+ unsigned long count, __le32 *first, __le32 *last,
+ const char *bitmap, int bit)
{
__le32 *p;
int flags = EXT4_FREE_BLOCKS_FORGET | EXT4_FREE_BLOCKS_VALIDATED;
@@ -4567,8 +4571,12 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
}
}

- for (p = first; p < last; p++)
+ for (p = first; p < last; p++) {
+ if (*p && bitmap && ext4_test_bit(bit + (p - first), bitmap))
+ /* don't free block used by older snapshot */
+ continue;
*p = 0;
+ }

ext4_free_blocks(handle, inode, NULL, block_to_free, count, flags);
return 0;
@@ -4596,9 +4604,17 @@ out_err:
* @this_bh will be %NULL if @first and @last point into the inode's direct
* block pointers.
*/
-static void ext4_free_data(handle_t *handle, struct inode *inode,
+/*
+ * ext4_free_data_cow - free a list of data blocks (consult COW bitmap)
+ * @bitmap: COW bitmap to consult when shrinking deleted snapshot
+ * @bit: bit number representing the @first block
+ * @pfreed_blocks: return number of freed blocks
+ */
+void ext4_free_data_cow(handle_t *handle, struct inode *inode,
struct buffer_head *this_bh,
- __le32 *first, __le32 *last)
+ __le32 *first, __le32 *last,
+ const char *bitmap, int bit,
+ int *pfreed_blocks)
{
ext4_fsblk_t block_to_free = 0; /* Starting block # of a run */
unsigned long count = 0; /* Number of blocks in the run */
@@ -4622,6 +4638,11 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,

for (p = first; p < last; p++) {
nr = le32_to_cpu(*p);
+ if (nr && bitmap && ext4_test_bit(bit + (p - first), bitmap))
+ /* don't free block used by older snapshot */
+ nr = 0;
+ if (nr && pfreed_blocks)
+ ++(*pfreed_blocks);
if (nr) {
/* accumulate blocks to free if they're contiguous */
if (count == 0) {
@@ -4631,9 +4652,10 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
} else if (nr == block_to_free + count) {
count++;
} else {
- err = ext4_clear_blocks(handle, inode, this_bh,
- block_to_free, count,
- block_to_free_p, p);
+ err = ext4_clear_blocks_cow(handle, inode,
+ this_bh, block_to_free, count,
+ block_to_free_p, p, bitmap,
+ bit + (block_to_free_p - first));
if (err)
break;
block_to_free = nr;
@@ -4643,9 +4665,10 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
}
}

- if (!err && count > 0)
- err = ext4_clear_blocks(handle, inode, this_bh, block_to_free,
- count, block_to_free_p, p);
+ if (count > 0)
+ err = ext4_clear_blocks_cow(handle, inode, this_bh,
+ block_to_free, count, block_to_free_p, p,
+ bitmap, bit + (block_to_free_p - first));
if (err < 0)
/* fatal error */
return;
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 11fa43d..53d4481 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -401,6 +401,11 @@ static inline void exit_ext4_snapshot(void)
{
}

+/* snapshot_inode.c */
+extern int ext4_snapshot_shrink_blocks(handle_t *handle, struct inode *inode,
+ sector_t iblock, unsigned long maxblocks,
+ struct buffer_head *cow_bh,
+ int shrink, int *pmapped);

/* tests if @inode is a snapshot file */
static inline int ext4_snapshot_file(struct inode *inode)
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 13048f5..710e157 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -1379,6 +1379,221 @@ out_err:
}

/*
+ * ext4_snapshot_shrink_range - free unused blocks from deleted snapshots
+ * @handle: JBD handle for this transaction
+ * @start: latest non-deleted snapshot before deleted snapshots group
+ * @end: first non-deleted snapshot after deleted snapshots group
+ * @iblock: inode offset to first data block to shrink
+ * @maxblocks: inode range of data blocks to shrink
+ * @cow_bh: buffer head to map the COW bitmap block of snapshot @start
+ * if NULL, don't look for COW bitmap block
+ *
+ * Shrinks @maxblocks blocks starting at inode offset @iblock in a group of
+ * subsequent deleted snapshots starting after @start and ending before @end.
+ * Shrinking is done by finding a range of mapped blocks in @start snapshot
+ * or in one of the deleted snapshots, where no other blocks are mapped in the
+ * same range in @start snapshot or in snapshots between them.
+ * The blocks in the found range may be 'in-use' by @start snapshot, so only
+ * blocks which are not set in the COW bitmap are freed.
+ * All mapped blocks of other deleted snapshots in the same range are freed.
+ *
+ * Called from ext4_snapshot_shrink() under snapshot_mutex.
+ * Returns the shrunk blocks range and <0 on error.
+ */
+static int ext4_snapshot_shrink_range(handle_t *handle,
+ struct inode *start, struct inode *end,
+ sector_t iblock, unsigned long maxblocks,
+ struct buffer_head *cow_bh)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(start->i_sb);
+ struct list_head *l;
+ struct inode *inode = start;
+ /* start with @maxblocks range and narrow it down */
+ int err, count = maxblocks;
+ /* @start snapshot blocks should not be freed only counted */
+ int mapped, shrink = 0;
+
+ /* iterate on (@start <= snapshot < @end) */
+ list_for_each_prev(l, &EXT4_I(start)->i_snaplist) {
+ err = ext4_snapshot_shrink_blocks(handle, inode,
+ iblock, count, cow_bh, shrink, &mapped);
+ if (err < 0)
+ return err;
+
+ /* 0 < new range <= old range */
+ BUG_ON(!err || err > count);
+ count = err;
+ cond_resched();
+
+ /*
+ * shrink mode state transitions:
+ * 1. on @start, shrink is set to 0 ('don't free' mode).
+ * 2. after @start, shrink is incremented until mapped blocks
+ * are found in the shrunk range ('free unused' mode).
+ * 3. after mapped block were found, or if cow_bh is NULL,
+ * shrink is set to -1 and decremented until the end of
+ * the deleted snapshots group ('free all' mode).
+ */
+ if (shrink < 0)
+ /* stay in 'free all' mode */
+ shrink--;
+ else if (!cow_bh)
+ /* no COW bitmap - enter 'free all' mode */
+ shrink = -1;
+ else if (mapped)
+ /* found mapped blocks - enter 'free all' mode */
+ shrink = -1;
+ else
+ /* enter/stay in 'free unused' mode */
+ shrink++;
+
+ if (l == &sbi->s_snapshot_list)
+ /* didn't reach @end */
+ return -EINVAL;
+ inode = &list_entry(l, struct ext4_inode_info,
+ i_snaplist)->vfs_inode;
+ if (inode == end)
+ break;
+ /* indicate shrink progress via i_size */
+ SNAPSHOT_SET_PROGRESS(inode, SNAPSHOT_BLOCK(iblock));
+ }
+ return count;
+}
+
+/*
+ * ext4_snapshot_shrink - free unused blocks from deleted snapshot files
+ * @handle: JBD handle for this transaction
+ * @start: latest non-deleted snapshot before deleted snapshots group
+ * @end: first non-deleted snapshot after deleted snapshots group
+ * @need_shrink: no. of deleted snapshots in the group
+ *
+ * Frees all blocks in subsequent deleted snapshots starting after @start and
+ * ending before @end, except for blocks which are 'in-use' by @start snapshot.
+ * (blocks 'in-use' are set in snapshot COW bitmap and not copied to snapshot).
+ * Called from ext4_snapshot_update() under snapshot_mutex.
+ * Returns 0 on success and <0 on error.
+ */
+static int ext4_snapshot_shrink(struct inode *start, struct inode *end,
+ int need_shrink)
+{
+ struct list_head *l;
+ handle_t *handle;
+ struct buffer_head cow_bitmap, *cow_bh = NULL;
+ ext4_fsblk_t block = 1; /* skip super block */
+ struct ext4_sb_info *sbi = EXT4_SB(start->i_sb);
+ /* blocks beyond the size of @start are not in-use by @start */
+ ext4_fsblk_t snapshot_blocks = SNAPSHOT_BLOCKS(start);
+ unsigned long count = ext4_blocks_count(sbi->s_es) - block;
+ long block_group = -1;
+ ext4_fsblk_t bg_boundary = 0;
+ int err, ret;
+
+ snapshot_debug(3, "snapshot (%u-%u) shrink: "
+ "count = 0x%lx, need_shrink = %d\n",
+ start->i_generation, end->i_generation,
+ count, need_shrink);
+
+ /* start large truncate transaction that will be extended/restarted */
+ handle = ext4_journal_start(start, EXT4_MAX_TRANS_DATA);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ while (count > 0) {
+ while (block >= bg_boundary) {
+ /* reset COW bitmap cache */
+ cow_bitmap.b_state = 0;
+ cow_bitmap.b_blocknr = 0;
+ cow_bh = &cow_bitmap;
+ bg_boundary += SNAPSHOT_BLOCKS_PER_GROUP;
+ block_group++;
+ if (block >= snapshot_blocks)
+ /*
+ * Past last snapshot block group - pass NULL
+ * cow_bh to ext4_snapshot_shrink_range().
+ * This will cause snapshots after resize to
+ * shrink to the size of @start snapshot.
+ */
+ cow_bh = NULL;
+ cond_resched();
+ }
+
+ err = extend_or_restart_transaction(handle,
+ EXT4_MAX_TRANS_DATA);
+ if (err)
+ goto out_err;
+
+ err = ext4_snapshot_shrink_range(handle, start, end,
+ SNAPSHOT_IBLOCK(block), count,
+ cow_bh);
+
+ snapshot_debug(3, "snapshot (%u-%u) shrink: "
+ "block = 0x%llu, count = 0x%lx, err = 0x%x\n",
+ start->i_generation, end->i_generation,
+ block, count, err);
+
+ if (buffer_mapped(&cow_bitmap) && buffer_new(&cow_bitmap)) {
+ snapshot_debug(2, "snapshot (%u-%u) shrink: "
+ "block group = %ld/%u, "
+ "COW bitmap = [%llu/%llu]\n",
+ start->i_generation, end->i_generation,
+ block_group, sbi->s_groups_count,
+ SNAPSHOT_BLOCK_TUPLE(cow_bitmap.b_blocknr));
+ clear_buffer_new(&cow_bitmap);
+ }
+
+ if (err <= 0)
+ goto out_err;
+
+ block += err;
+ count -= err;
+ }
+
+ /* marks need_shrink snapshots shrunk */
+ err = extend_or_restart_transaction(handle, need_shrink);
+ if (err)
+ goto out_err;
+
+ /* iterate on (@start < snapshot < @end) */
+ list_for_each_prev(l, &EXT4_I(start)->i_snaplist) {
+ struct inode *inode;
+ struct ext4_iloc iloc;
+
+ if (l == &sbi->s_snapshot_list)
+ break;
+
+ inode = &list_entry(l, struct ext4_inode_info,
+ i_snaplist)->vfs_inode;
+ if (inode == end)
+ break;
+ /* reset i_size that was used as progress indicator */
+ SNAPSHOT_SET_DISABLED(inode);
+ if (ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_DELETED) &&
+ !(ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_SHRUNK) &&
+ ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_ACTIVE))) {
+ /* mark snapshot shrunk */
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ ext4_set_inode_flag(inode, EXT4_INODE_SNAPFILE_SHRUNK);
+ if (!err)
+ ext4_mark_iloc_dirty(handle, inode, &iloc);
+ if (--need_shrink <= 0)
+ break;
+ }
+ }
+
+ err = 0;
+out_err:
+ ret = ext4_journal_stop(handle);
+ if (!err)
+ err = ret;
+ if (need_shrink)
+ snapshot_debug(1, "snapshot (%u-%u) shrink: "
+ "need_shrink=%d(>0!), err=%d\n",
+ start->i_generation, end->i_generation,
+ need_shrink, err);
+ return err;
+}
+
+/*
* ext4_snapshot_cleanup - shrink/merge/remove snapshot marked for deletion
* @inode - inode in question
* @used_by - latest non-deleted snapshot
@@ -1403,6 +1618,23 @@ static int ext4_snapshot_cleanup(struct inode *inode, struct inode *used_by,
/* remove permanently unused deleted snapshot */
return ext4_snapshot_remove(inode);

+ if (deleted) {
+ /* deleted (non-active) snapshot file */
+ if (!ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_SHRUNK))
+ /* deleted snapshot needs shrinking */
+ (*need_shrink)++;
+ return 0;
+ }
+
+ /* non-deleted (or active) snapshot file */
+ if (*need_shrink) {
+ /* pass 1: shrink all deleted snapshots
+ * between 'used_by' and 'inode' */
+ err = ext4_snapshot_shrink(used_by, inode, *need_shrink);
+ if (err)
+ return err;
+ *need_shrink = 0;
+ }
return 0;
}

diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
index 3d6cb7b..391aa92 100644
--- a/fs/ext4/snapshot_inode.c
+++ b/fs/ext4/snapshot_inode.c
@@ -38,6 +38,226 @@

#include <trace/events/ext4.h>
#include "snapshot.h"
+
+/**
+ * ext4_blks_to_skip - count the number blocks that can be skipped
+ * @inode: inode in question
+ * @i_block: start block number
+ * @maxblocks: max number of data blocks to be skipped
+ * @chain: chain of indirect blocks
+ * @depth: length of chain from inode to data block
+ * @offsets: array of offsets in chain blocks
+ * @k: number of allocated blocks in the chain
+ *
+ * Counts the number of non-allocated data blocks (holes) at offset @i_block.
+ * Called from ext4_snapshot_merge_blocks() and ext4_snapshot_shrink_blocks()
+ * under snapshot_mutex.
+ * Returns the total number of data blocks to be skipped.
+ */
+
+static int ext4_blks_to_skip(struct inode *inode, long i_block,
+ unsigned long maxblocks, Indirect chain[4], int depth,
+ int *offsets, int k)
+{
+ int ptrs = EXT4_ADDR_PER_BLOCK(inode->i_sb);
+ int ptrs_bits = EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb);
+ const long direct_blocks = EXT4_NDIR_BLOCKS,
+ indirect_blocks = ptrs,
+ double_blocks = (1 << (ptrs_bits * 2));
+ /* number of data blocks mapped with a single splice to the chain */
+ int data_ptrs_bits = ptrs_bits * (depth - k - 1);
+ int max_ptrs = maxblocks >> data_ptrs_bits;
+ int final = 0;
+ unsigned long count = 0;
+
+ switch (depth) {
+ case 4: /* tripple indirect */
+ i_block -= double_blocks;
+ /* fall through */
+ case 3: /* double indirect */
+ i_block -= indirect_blocks;
+ /* fall through */
+ case 2: /* indirect */
+ i_block -= direct_blocks;
+ final = (k == 0 ? 1 : ptrs);
+ break;
+ case 1: /* direct */
+ final = direct_blocks;
+ break;
+ }
+ /* offset of block from start of splice point */
+ i_block &= ((1 << data_ptrs_bits) - 1);
+
+ count++;
+ while (count <= max_ptrs &&
+ offsets[k] + count < final &&
+ le32_to_cpu(*(chain[k].p + count)) == 0) {
+ count++;
+ }
+ /* number of data blocks mapped by 'count' splice points */
+ count <<= data_ptrs_bits;
+ count -= i_block;
+ return count < maxblocks ? count : maxblocks;
+}
+
+/*
+ * ext4_snapshot_shrink_blocks - free unused blocks from deleted snapshot
+ * @handle: JBD handle for this transaction
+ * @inode: inode we're shrinking
+ * @iblock: inode offset to first data block to shrink
+ * @maxblocks: inode range of data blocks to shrink
+ * @cow_bh: buffer head to map the COW bitmap block
+ * if NULL, don't look for COW bitmap block
+ * @shrink: shrink mode: 0 (don't free), >0 (free unused), <0 (free all)
+ * @pmapped: return no. of mapped blocks or 0 for skipped holes
+ *
+ * Frees @maxblocks blocks starting at offset @iblock in @inode, which are not
+ * 'in-use' by non-deleted snapshots (blocks 'in-use' are set in COW bitmap).
+ * If @shrink is false, just count mapped blocks and look for COW bitmap block.
+ * The first time that a COW bitmap block is found in @inode, whether @inode is
+ * deleted or not, it is stored in @cow_bh and is used in subsequent calls to
+ * this function with other deleted snapshots within the block group boundaries.
+ * Called from ext4_snapshot_shrink_blocks() under snapshot_mutex.
+ *
+ * Return values:
+ * >= 0 - no. of shrunk blocks (*@pmapped ? mapped blocks : skipped holes)
+ * < 0 - error
+ */
+int ext4_snapshot_shrink_blocks(handle_t *handle, struct inode *inode,
+ sector_t iblock, unsigned long maxblocks,
+ struct buffer_head *cow_bh,
+ int shrink, int *pmapped)
+{
+ int offsets[4];
+ Indirect chain[4], *partial;
+ int err, blocks_to_boundary, depth, count;
+ struct buffer_head *sbh = NULL;
+ struct ext4_group_desc *desc = NULL;
+ ext4_snapblk_t block_bitmap, block = SNAPSHOT_BLOCK(iblock);
+ unsigned long block_group = SNAPSHOT_BLOCK_GROUP(block);
+ int mapped_blocks = 0, freed_blocks = 0;
+ const char *cow_bitmap;
+
+ BUG_ON(shrink &&
+ (!(ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_DELETED)) ||
+ ext4_snapshot_is_active(inode)));
+
+ depth = ext4_block_to_path(inode, iblock, offsets,
+ &blocks_to_boundary);
+ if (depth == 0)
+ return -EIO;
+
+ desc = ext4_get_group_desc(inode->i_sb, block_group, NULL);
+ if (!desc)
+ return -EIO;
+ block_bitmap = ext4_block_bitmap(inode->i_sb, desc);
+ partial = ext4_get_branch(inode, depth, offsets, chain, &err);
+ if (err)
+ return err;
+
+ if (partial) {
+ /* block not mapped (hole) - count the number of holes to
+ * skip */
+ count = ext4_blks_to_skip(inode, iblock, maxblocks, chain,
+ depth, offsets, (partial - chain));
+ snapshot_debug(3, "skipping snapshot (%u) blocks: block=0x%llx"
+ ", count=0x%x\n", inode->i_generation,
+ block, count);
+ goto shrink_indirect_blocks;
+ }
+
+ /* data block mapped - check if data blocks should be freed */
+ partial = chain + depth - 1;
+ /* scan all blocks upto maxblocks/boundary */
+ count = 0;
+ while (count < maxblocks && count <= blocks_to_boundary) {
+ ext4_fsblk_t blk = le32_to_cpu(*(partial->p + count));
+ if (blk && block + count == block_bitmap &&
+ cow_bh && !buffer_mapped(cow_bh)) {
+ /*
+ * 'blk' is the COW bitmap physical block -
+ * store it in cow_bh for subsequent calls
+ * FIXME: for non-first flex_bg group,
+ * we will not find COW bitmap like this.
+ */
+ map_bh(cow_bh, inode->i_sb, blk);
+ set_buffer_new(cow_bh);
+ snapshot_debug(3, "COW bitmap #%lu: snapshot "
+ "(%u), bitmap_blk=(+%lld)\n",
+ block_group, inode->i_generation,
+ SNAPSHOT_BLOCK_GROUP_OFFSET(block_bitmap));
+ }
+ if (blk)
+ /* count mapped blocks in range */
+ mapped_blocks++;
+ else if (shrink >= 0)
+ /*
+ * Unless we are freeing all block in range,
+ * we cannot have holes inside mapped range
+ */
+ break;
+ /* count size of range */
+ count++;
+ }
+
+ if (!shrink)
+ goto done_shrinking;
+
+ cow_bitmap = NULL;
+ if (shrink > 0 && cow_bh && buffer_mapped(cow_bh)) {
+ /* we found COW bitmap - consult it when shrinking */
+ sbh = sb_bread(inode->i_sb, cow_bh->b_blocknr);
+ if (!sbh) {
+ err = -EIO;
+ goto cleanup;
+ }
+ cow_bitmap = sbh->b_data;
+ }
+ if (shrink < 0 || cow_bitmap) {
+ int bit = SNAPSHOT_BLOCK_GROUP_OFFSET(block);
+
+ BUG_ON(bit + count > SNAPSHOT_BLOCKS_PER_GROUP);
+ /* free blocks with or without consulting COW bitmap */
+ ext4_free_data_cow(handle, inode, partial->bh,
+ partial->p, partial->p + count,
+ cow_bitmap, bit, &freed_blocks);
+ }
+
+shrink_indirect_blocks:
+ /* check if the indirect block should be freed */
+ if (shrink && partial == chain + depth - 1) {
+ Indirect *ind = partial - 1;
+ __le32 *p = NULL;
+ if (freed_blocks == mapped_blocks &&
+ count > blocks_to_boundary) {
+ for (p = (__le32 *)(partial->bh->b_data);
+ !*p && p < partial->p; p++)
+ ;
+ }
+ if (p == partial->p)
+ /* indirect block maps zero data blocks - free it */
+ ext4_free_branches(handle, inode, ind->bh, ind->p,
+ ind->p+1, 1);
+ }
+
+done_shrinking:
+ snapshot_debug(3, "shrinking snapshot (%u) blocks: shrink=%d, "
+ "block=0x%llx, count=0x%x, mapped=0x%x, freed=0x%x\n",
+ inode->i_generation, shrink, block, count,
+ mapped_blocks, freed_blocks);
+
+ if (pmapped)
+ *pmapped = mapped_blocks;
+ err = count;
+cleanup:
+ while (partial > chain) {
+ brelse(partial->bh);
+ partial--;
+ }
+ brelse(sbh);
+return err;
+}
+
/*
* ext4_snapshot_get_block_access() - called from ext4_snapshot_read_through()
* on snapshot file access.
--
1.7.4.1


2011-06-07 15:10:35

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 36/36] ext4: snapshot rocompat - enable rw mount

From: Amir Goldstein <[email protected]>

Enable readwrite mount of filesystem with has_snapshot feature only
if ext4 was compiled with snapshot support.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 10 ++++++++++
2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 92d75c2..633a835 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1483,6 +1483,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_INCOMPAT_FLEX_BG)
#define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
+ EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT| \
EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7f15983..9a81828 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2693,6 +2693,16 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
"must run fsck -xy to make it writable.");
return 0;
}
+ } else if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT)) {
+ /*
+ * We get here when CONFIG_EXT4_FS_SNAPSHOT is not defined
+ * so EXT4_SNAPSHOTS(sb) is defined to (0)
+ */
+ ext4_msg(sb, KERN_ERR,
+ "Filesystem with has_snapshot feature cannot be "
+ "mounted RDWR without CONFIG_EXT4_FS_SNAPSHOT");
+ return 0;
}
return 1;
}
--
1.7.4.1


2011-06-07 15:10:27

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 33/36] ext4: snapshot cleanup

From: Amir Goldstein <[email protected]>

Cleanup snapshots list and reclaim unused blocks of deleted snapshots.
Oldest snapshot can be removed from list and its blocks can be freed.
Non-oldest snapshots have to be shrunk and merged before they can be
removed from the list. All snapshot blocks must be excluded in order
to properly shrink/merge deleted old snapshots.


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 16 ++++++++
fs/ext4/inode.c | 19 ++++++----
fs/ext4/snapshot_ctl.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 118 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 34aaade..6f0f310 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1737,6 +1737,12 @@ struct ext4_features {
struct completion f_kobj_unregister;
};

+typedef struct {
+ __le32 *p;
+ __le32 key;
+ struct buffer_head *bh;
+} Indirect;
+
/*
* Function prototypes
*/
@@ -1878,6 +1884,16 @@ extern void ext4_da_update_reserve_space(struct inode *inode,
/* snapshot_inode.c */
extern int ext4_snapshot_readpage(struct file *file, struct page *page);

+extern int ext4_block_to_path(struct inode *inode,
+ ext4_lblk_t i_block,
+ ext4_lblk_t offsets[4], int *boundary);
+extern Indirect *ext4_get_branch(struct inode *inode, int depth,
+ ext4_lblk_t *offsets,
+ Indirect chain[4], int *err);
+extern void ext4_free_branches(handle_t *handle, struct inode *inode,
+ struct buffer_head *parent_bh,
+ __le32 *first, __le32 *last,
+ int depth);
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 89a97da..5199035 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -176,6 +176,14 @@ int ext4_truncate_restart_trans(handle_t *handle, struct inode *inode,
*/
BUG_ON(EXT4_JOURNAL(inode) == NULL);
jbd_debug(2, "restarting handle %p\n", handle);
+ /*
+ * Snapshot shrink/merge/clean do not take i_data_sem, so we cannot
+ * release it here. Luckily, snapshot files are not writable,
+ * so deadlock with ext4_map_blocks on writepage is impossible.
+ * Snapshot files also don't have preallocations.
+ */
+ if (ext4_snapshot_file(inode))
+ return ext4_journal_restart(handle, nblocks);
up_write(&EXT4_I(inode)->i_data_sem);
ret = ext4_journal_restart(handle, nblocks);
down_write(&EXT4_I(inode)->i_data_sem);
@@ -281,11 +289,6 @@ no_delete:
ext4_clear_inode(inode); /* We must guarantee clearing of inode... */
}

-typedef struct {
- __le32 *p;
- __le32 key;
- struct buffer_head *bh;
-} Indirect;

static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
{
@@ -324,7 +327,7 @@ static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
* get there at all.
*/

-static int ext4_block_to_path(struct inode *inode,
+int ext4_block_to_path(struct inode *inode,
ext4_lblk_t i_block,
ext4_lblk_t offsets[4], int *boundary)
{
@@ -440,7 +443,7 @@ static int __ext4_check_blockref(const char *function, unsigned int line,
* Need to be called with
* down_read(&EXT4_I(inode)->i_data_sem)
*/
-static Indirect *ext4_get_branch(struct inode *inode, int depth,
+Indirect *ext4_get_branch(struct inode *inode, int depth,
ext4_lblk_t *offsets,
Indirect chain[4], int *err)
{
@@ -4679,7 +4682,7 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
* stored as little-endian 32-bit) and updating @inode->i_blocks
* appropriately.
*/
-static void ext4_free_branches(handle_t *handle, struct inode *inode,
+void ext4_free_branches(handle_t *handle, struct inode *inode,
struct buffer_head *parent_bh,
__le32 *first, __le32 *last, int depth)
{
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 9e39c04..13048f5 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -1149,6 +1149,53 @@ out_err:
}

/*
+ * ext4_snapshot_clean() frees snapshot file blocks
+ * before removing snapshot file from snapshots list.
+ * Called from ext4_snapshot_remove() under snapshot_mutex.
+ *
+ * Returns 0 on success and < 0 on error.
+ */
+static int ext4_snapshot_clean(handle_t *handle, struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ int i;
+
+ if (!ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "ext4_snapshot_clean() called with "
+ "snapshot file (ino=%lu) not on list\n",
+ inode->i_ino);
+ return -EINVAL;
+ }
+
+ if (ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_ACTIVE)) {
+ snapshot_debug(1, "clean of active snapshot (%u) "
+ "is not allowed.\n",
+ inode->i_generation);
+ return -EPERM;
+ }
+
+ /*
+ * A very simplified version of ext4_truncate() for snapshot files.
+ * A non-active snapshot file never allocates new blocks and only frees
+ * blocks under snapshot_mutex, so no need to take truncate_mutex here.
+ * No need to add inode to orphan list for post crash truncate, because
+ * snapshot is still on the snapshot list and marked for deletion.
+ * Free DIND branch last, to keep snapshot's super block around longer.
+ */
+ for (i = EXT4_SNAPSHOT_N_BLOCKS - 1; i >= EXT4_DIND_BLOCK; i--) {
+ int depth = (i == EXT4_DIND_BLOCK ? 2 : 3);
+ int j = i%EXT4_N_BLOCKS;
+
+ if (!ei->i_data[j])
+ continue;
+ ext4_free_branches(handle, inode, NULL,
+ ei->i_data+j, ei->i_data+j+1, depth);
+ ei->i_data[j] = 0;
+ }
+ return 0;
+}
+
+/*
* ext4_snapshot_enable() enables snapshot mount
* sets the in-use flag and the active snapshot
* Called under i_mutex and snapshot_mutex
@@ -1277,6 +1324,17 @@ static int ext4_snapshot_remove(struct inode *inode)
}
sbi = EXT4_SB(inode->i_sb);

+ /* free snapshot inode blocks */
+ err = ext4_snapshot_clean(handle, inode);
+ if (err)
+ goto out_handle;
+
+ /* reset i_size and i_disksize and invalidate page cache */
+ SNAPSHOT_SET_REMOVED(inode);
+
+ err = ext4_mark_inode_dirty(handle, inode);
+ if (err)
+ goto out_handle;

err = extend_or_restart_transaction_inode(handle, inode, 2);
if (err)
@@ -1321,6 +1379,34 @@ out_err:
}

/*
+ * ext4_snapshot_cleanup - shrink/merge/remove snapshot marked for deletion
+ * @inode - inode in question
+ * @used_by - latest non-deleted snapshot
+ * @deleted - true if snapshot is marked for deletion and not active
+ * @need_shrink - counter of deleted snapshots to shrink
+ * @need_merge - counter of deleted snapshots to merge
+ *
+ * Deleted snapshot with no older non-deleted snapshot - remove from list
+ * Deleted snapshot with no older enabled snapshot - add to merge count
+ * Deleted snapshot with older enabled snapshot - add to shrink count
+ * Non-deleted snapshot - shrink and merge deleted snapshots group
+ *
+ * Called from ext4_snapshot_update() under snapshot_mutex.
+ * Returns 0 on success and <0 on error.
+ */
+static int ext4_snapshot_cleanup(struct inode *inode, struct inode *used_by,
+ int deleted, int *need_shrink, int *need_merge)
+{
+ int err = 0;
+
+ if (deleted && !used_by)
+ /* remove permanently unused deleted snapshot */
+ return ext4_snapshot_remove(inode);
+
+ return 0;
+}
+
+/*
* Snapshot constructor/destructor
*/
/*
@@ -1462,6 +1548,8 @@ int ext4_snapshot_update(struct super_block *sb, int cleanup, int read_only)
int found_active = 0;
int found_enabled = 0;
struct list_head *prev;
+ int need_shrink = 0;
+ int need_merge = 0;
int err = 0;

BUG_ON(read_only && cleanup);
@@ -1521,9 +1609,9 @@ update_snapshot:
/* snapshot is not in use by older enabled snapshots */
ext4_clear_inode_snapstate(inode, EXT4_SNAPSTATE_INUSE);

- if (cleanup && deleted && !used_by)
- /* remove permanently unused deleted snapshot */
- err = ext4_snapshot_remove(inode);
+ if (cleanup)
+ err = ext4_snapshot_cleanup(inode, used_by, deleted,
+ &need_shrink, &need_merge);

if (!deleted) {
if (!found_active)
--
1.7.4.1


2011-06-07 15:10:32

by Amir G.

[permalink] [raw]
Subject: [PATCH v1 35/36] ext4: snapshot cleanup - merge shrunk snapshots

From: Amir Goldstein <[email protected]>

Move blocks of deleted and shrunk snapshots to an older non-deleted
and disabled snapshot. Merging helps removing snapshots from list
while older snapshots are not currently in use (disabled).


Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/snapshot.h | 3 +
fs/ext4/snapshot_ctl.c | 113 +++++++++++++++++
fs/ext4/snapshot_inode.c | 307 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 423 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index 53d4481..fafb38d 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -406,6 +406,9 @@ extern int ext4_snapshot_shrink_blocks(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks,
struct buffer_head *cow_bh,
int shrink, int *pmapped);
+extern int ext4_snapshot_merge_blocks(handle_t *handle,
+ struct inode *src, struct inode *dst,
+ sector_t iblock, unsigned long maxblocks);

/* tests if @inode is a snapshot file */
static inline int ext4_snapshot_file(struct inode *inode)
diff --git a/fs/ext4/snapshot_ctl.c b/fs/ext4/snapshot_ctl.c
index 710e157..6c8dc35 100644
--- a/fs/ext4/snapshot_ctl.c
+++ b/fs/ext4/snapshot_ctl.c
@@ -1594,6 +1594,107 @@ out_err:
}

/*
+ * ext4_snapshot_merge - merge deleted snapshots
+ * @handle: JBD handle for this transaction
+ * @start: latest non-deleted snapshot before deleted snapshots group
+ * @end: first non-deleted snapshot after deleted snapshots group
+ * @need_merge: no. of deleted snapshots in the group
+ *
+ * Move all blocks from deleted snapshots group starting after @start and
+ * ending before @end to @start snapshot. All moved blocks are 'in-use' by
+ * @start snapshot, because these deleted snapshots have already been shrunk
+ * (blocks 'in-use' are set in snapshot COW bitmap and not copied to snapshot).
+ * Called from ext4_snapshot_update() under snapshot_mutex.
+ * Returns 0 on success and <0 on error.
+ */
+static int ext4_snapshot_merge(struct inode *start, struct inode *end,
+ int need_merge)
+{
+ struct list_head *l, *n;
+ handle_t *handle = NULL;
+ struct ext4_sb_info *sbi = EXT4_SB(start->i_sb);
+ int err, ret;
+
+ snapshot_debug(3, "snapshot (%u-%u) merge: need_merge=%d\n",
+ start->i_generation, end->i_generation, need_merge);
+
+ /* iterate safe on (@start < snapshot < @end) */
+ list_for_each_prev_safe(l, n, &EXT4_I(start)->i_snaplist) {
+ struct inode *inode = &list_entry(l, struct ext4_inode_info,
+ i_snaplist)->vfs_inode;
+
+ ext4_fsblk_t block = 1; /* skip super block */
+ /* blocks beyond the size of @start are not in-use by @start */
+ int count = SNAPSHOT_BLOCKS(start) - block;
+
+ if (n == &sbi->s_snapshot_list || inode == end ||
+ !(ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_SHRUNK)))
+ break;
+
+ /* start large transaction that will be extended/restarted */
+ handle = ext4_journal_start(inode, EXT4_MAX_TRANS_DATA);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ while (count > 0) {
+ /* we modify one indirect block and the inode itself
+ * for both the source and destination inodes */
+ err = extend_or_restart_transaction(handle, 4);
+ if (err)
+ goto out_err;
+
+ err = ext4_snapshot_merge_blocks(handle, inode, start,
+ SNAPSHOT_IBLOCK(block), count);
+
+ snapshot_debug(3, "snapshot (%u) -> snapshot (%u) "
+ "merge: block = 0x%llu, count = 0x%x, "
+ "err = 0x%x\n", inode->i_generation,
+ start->i_generation, block, count, err);
+
+ if (err <= 0)
+ goto out_err;
+
+ block += err;
+ count -= err;
+ /* indicate merge progress via i_size */
+ SNAPSHOT_SET_PROGRESS(inode, block);
+ cond_resched();
+ }
+
+ /* reset i_size that was used as progress indicator */
+ SNAPSHOT_SET_DISABLED(inode);
+
+ err = ext4_journal_stop(handle);
+ handle = NULL;
+ if (err)
+ goto out_err;
+
+ /* we finished moving all blocks of interest from 'inode'
+ * into 'start' so it is now safe to remove 'inode' from the
+ * snapshots list forever */
+ err = ext4_snapshot_remove(inode);
+ if (err)
+ goto out_err;
+
+ if (--need_merge <= 0)
+ break;
+ }
+
+ err = 0;
+out_err:
+ if (handle) {
+ ret = ext4_journal_stop(handle);
+ if (!err)
+ err = ret;
+ }
+ if (need_merge)
+ snapshot_debug(1, "snapshot (%u-%u) merge: need_merge=%d(>0!), "
+ "err=%d\n", start->i_generation,
+ end->i_generation, need_merge, err);
+ return err;
+}
+
+/*
* ext4_snapshot_cleanup - shrink/merge/remove snapshot marked for deletion
* @inode - inode in question
* @used_by - latest non-deleted snapshot
@@ -1623,6 +1724,10 @@ static int ext4_snapshot_cleanup(struct inode *inode, struct inode *used_by,
if (!ext4_test_inode_flag(inode, EXT4_INODE_SNAPFILE_SHRUNK))
/* deleted snapshot needs shrinking */
(*need_shrink)++;
+ if (!ext4_test_inode_snapstate(inode, EXT4_SNAPSTATE_INUSE))
+ /* temporarily unused deleted
+ * snapshot needs merging */
+ (*need_merge)++;
return 0;
}

@@ -1635,6 +1740,14 @@ static int ext4_snapshot_cleanup(struct inode *inode, struct inode *used_by,
return err;
*need_shrink = 0;
}
+ if (*need_merge) {
+ /* pass 2: merge all shrunk snapshots
+ * between 'used_by' and 'inode' */
+ err = ext4_snapshot_merge(used_by, inode, *need_merge);
+ if (err)
+ return err;
+ *need_merge = 0;
+ }
return 0;
}

diff --git a/fs/ext4/snapshot_inode.c b/fs/ext4/snapshot_inode.c
index 391aa92..73defa8 100644
--- a/fs/ext4/snapshot_inode.c
+++ b/fs/ext4/snapshot_inode.c
@@ -259,6 +259,313 @@ return err;
}

/*
+ * ext4_snapshot_count_blocks - count blocks and verify that
+ * snapshot blocks are excluded.
+ * @inode: snapshot we are merging
+ * @block: first block to test
+ * @count: no. of blocks to test
+ * @pblocks: pointer to counter of blocks
+ *
+ * Return <0 on error or if blocks are not excluded.
+ */
+static int ext4_snapshot_count_blocks(struct inode *inode, ext4_fsblk_t block,
+ unsigned long count, int *pblocks)
+{
+ int err;
+
+ /* test that blocks are excluded and update blocks counter */
+ err = ext4_snapshot_test_excluded(inode, block, count);
+ if (err)
+ return err;
+ *pblocks += count;
+ return 0;
+}
+
+/**
+ * ext4_snapshot_count_data - count blocks on an array of data blocks
+ * and verify that snapshot blocks are excluded.
+ * @inode: snapshot we are merging
+ * @first: array of block numbers
+ * @last: points immediately past the end of array
+ * @pblocks: pointer to counter of branch blocks
+ *
+ * We accumulate contiguous runs of blocks to test they are excluded.
+ *
+ * Return <0 on error or if blocks are not excluded.
+ */
+static int ext4_snapshot_count_data(struct inode *inode,
+ __le32 *first, __le32 *last,
+ int *pblocks)
+{
+ ext4_fsblk_t block = 0; /* Starting block # of a run */
+ unsigned long count = 0; /* Number of blocks in the run */
+ __le32 *block_p = NULL; /* Pointer into inode/ind
+ corresponding to block */
+ ext4_fsblk_t nr; /* Current block # */
+ __le32 *p; /* Pointer into inode/ind
+ for current block */
+ int err = 0;
+
+ for (p = first; p < last; p++) {
+ nr = le32_to_cpu(*p);
+ if (nr) {
+ /* accumulate blocks to test if they're contiguous */
+ if (count == 0) {
+ block = nr;
+ block_p = p;
+ count = 1;
+ } else if (nr == block + count) {
+ count++;
+ } else {
+ err = ext4_snapshot_count_blocks(inode,
+ block, count, pblocks);
+ if (err)
+ return err;
+ block = nr;
+ block_p = p;
+ count = 1;
+ }
+ }
+ }
+
+ if (count > 0)
+ err = ext4_snapshot_count_blocks(inode, block, count, pblocks);
+ return err;
+}
+
+/**
+ * ext4_snapshot_count_branches - count blocks on an array of branches
+ * and verify that snapshot blocks are excluded.
+ * @inode: snapshot we are merging
+ * @first: array of block numbers
+ * @last: pointer immediately past the end of array
+ * @depth: depth of the branches to free
+ * @pblocks: pointer to counter of branch blocks
+ *
+ * Return <0 on error or if blocks are not excluded.
+ */
+static int ext4_snapshot_count_branches(struct inode *inode,
+ __le32 *first, __le32 *last, int depth,
+ int *pblocks)
+{
+ ext4_fsblk_t nr;
+ __le32 *p;
+ int err = 0;
+
+ if (depth--) {
+ struct buffer_head *bh;
+ int addr_per_block = EXT4_ADDR_PER_BLOCK(inode->i_sb);
+ p = last;
+ while (--p >= first) {
+ nr = le32_to_cpu(*p);
+ if (!nr)
+ continue; /* A hole */
+
+ if (!ext4_data_block_valid(EXT4_SB(inode->i_sb),
+ nr, 1)) {
+ EXT4_ERROR_INODE(inode,
+ "invalid indirect mapped "
+ "block %lu (level %d)",
+ (unsigned long) nr, depth);
+ break;
+ }
+
+ /* Go read the buffer for the next level down */
+ bh = sb_bread(inode->i_sb, nr);
+
+ /*
+ * A read failure? Report error and clear slot
+ * (should be rare).
+ */
+ if (!bh) {
+ EXT4_ERROR_INODE_BLOCK(inode, nr,
+ "Read failure");
+ continue;
+ }
+
+ /* This counts the entire branch. Bottom up. */
+ BUFFER_TRACE(bh, "count child branches");
+ err = ext4_snapshot_count_branches(inode,
+ (__le32 *)bh->b_data,
+ (__le32 *)bh->b_data + addr_per_block,
+ depth, pblocks);
+ if (err)
+ break;
+ /* Count the parent block */
+ err = ext4_snapshot_count_blocks(inode, nr, 1, pblocks);
+ if (err)
+ break;
+ }
+ } else {
+ /* We have reached the bottom of the tree. */
+ BUFFER_TRACE(parent_bh, "count data blocks");
+ err = ext4_snapshot_count_data(inode, first, last, pblocks);
+ }
+ return err;
+}
+
+/*
+ * ext4_move_branches - move an array of branches
+ * @handle: JBD handle for this transaction
+ * @src: inode we're moving blocks from
+ * @ps: array of src block numbers
+ * @pd: array of dst block numbers
+ * @depth: depth of the branches to move
+ * @count: max branches to move
+ * @pmoved: pointer to counter of moved blocks
+ *
+ * We move whole branches from src to dst, skipping the holes in src
+ * and stopping at the first branch that needs to be merged at higher level.
+ * Called from ext4_snapshot_merge_blocks() under snapshot_mutex.
+ * Returns the number of merged branches.
+ * Return <0 on error or if blocks are not excluded.
+ */
+static int ext4_move_branches(handle_t *handle, struct inode *src,
+ __le32 *ps, __le32 *pd, int depth,
+ int count, int *pmoved)
+{
+ int i, err;
+
+ for (i = 0; i < count; i++, ps++, pd++) {
+ __le32 s = *ps, d = *pd;
+ if (s && d && depth)
+ /* can't move or skip entire branch, need to merge
+ these 2 branches */
+ break;
+ if (!s || d)
+ /* skip holes is src and mapped data blocks in dst */
+ continue;
+
+ /* count moved blocks (and verify they are excluded) */
+ err = ext4_snapshot_count_branches(src, ps, ps+1, depth,
+ pmoved);
+ if (err)
+ return err;
+
+ /* move the entire branch from src to dst inode */
+ *pd = s;
+ *ps = 0;
+ }
+ return i;
+}
+
+/*
+ * ext4_snapshot_merge_blocks - merge blocks from @src to @dst inode
+ * @handle: JBD handle for this transaction
+ * @src: inode we're merging blocks from
+ * @dst: inode we're merging blocks to
+ * @iblock: inode offset to first data block to merge
+ * @maxblocks: inode range of data blocks to merge
+ *
+ * Merges @maxblocks data blocks starting at @iblock and all the indirect
+ * blocks that map them.
+ * Called from ext4_snapshot_merge() under snapshot_mutex.
+ * Returns the merged blocks range and <0 on error.
+ */
+int ext4_snapshot_merge_blocks(handle_t *handle,
+ struct inode *src, struct inode *dst,
+ sector_t iblock, unsigned long maxblocks)
+{
+ Indirect S[4], D[4], *pS, *pD;
+ int offsets[4];
+ int ks, kd, depth, count;
+ int ptrs = EXT4_ADDR_PER_BLOCK(src->i_sb);
+ int ptrs_bits = EXT4_ADDR_PER_BLOCK_BITS(src->i_sb);
+ int data_ptrs_bits, data_ptrs_mask, max_ptrs;
+ int moved = 0, err;
+
+ depth = ext4_block_to_path(src, iblock, offsets, NULL);
+ if (depth < 3)
+ /* snapshot blocks are mapped with double and tripple
+ indirect blocks */
+ return -1;
+
+ memset(D, 0, sizeof(D));
+ memset(S, 0, sizeof(S));
+ pD = ext4_get_branch(dst, depth, offsets, D, &err);
+ kd = (pD ? pD - D : depth - 1);
+ if (err)
+ goto out;
+ pS = ext4_get_branch(src, depth, offsets, S, &err);
+ ks = (pS ? pS - S : depth - 1);
+ if (err)
+ goto out;
+
+ if (ks < 1 || kd < 1) {
+ /* snapshot double and tripple tree roots are pre-allocated */
+ err = -EIO;
+ goto out;
+ }
+
+ if (ks < kd) {
+ /* nothing to move from src to dst */
+ count = ext4_blks_to_skip(src, iblock, maxblocks,
+ S, depth, offsets, ks);
+ snapshot_debug(3, "skipping src snapshot (%u) holes: "
+ "block=0x%llx, count=0x%x\n", src->i_generation,
+ SNAPSHOT_BLOCK(iblock), count);
+ err = count;
+ goto out;
+ }
+
+ /* move branches from level kd in src to dst */
+ pS = S+kd;
+ pD = D+kd;
+
+ /* compute max branches that can be moved */
+ data_ptrs_bits = ptrs_bits * (depth - kd - 1);
+ data_ptrs_mask = (1 << data_ptrs_bits) - 1;
+ max_ptrs = (maxblocks >> data_ptrs_bits) + 1;
+ if (max_ptrs > ptrs-offsets[kd])
+ max_ptrs = ptrs-offsets[kd];
+
+ /* get write access for the splice point */
+ err = ext4_journal_get_write_access_inode(handle, src, pS->bh);
+ if (err)
+ goto out;
+ err = ext4_journal_get_write_access_inode(handle, dst, pD->bh);
+ if (err)
+ goto out;
+
+ /* move as many whole branches as possible */
+ err = ext4_move_branches(handle, src, pS->p, pD->p, depth-1-kd,
+ max_ptrs, &moved);
+ if (err < 0)
+ goto out;
+ count = err;
+ if (moved) {
+ snapshot_debug(3, "moved snapshot (%u) -> snapshot (%d) "
+ "branches: block=0x%llx, count=0x%x, k=%d/%d, "
+ "moved_blocks=%d\n", src->i_generation,
+ dst->i_generation, SNAPSHOT_BLOCK(iblock),
+ count, kd, depth, moved);
+ /* update src and dst inodes blocks usage */
+ dquot_free_block(src, moved);
+ dquot_alloc_block_nofail(dst, moved);
+ err = ext4_handle_dirty_metadata(handle, NULL, pD->bh);
+ if (err)
+ goto out;
+ err = ext4_handle_dirty_metadata(handle, NULL, pS->bh);
+ if (err)
+ goto out;
+ }
+
+ /* we merged at least 1 partial branch and optionally count-1 full
+ branches */
+ err = (count << data_ptrs_bits) -
+ (SNAPSHOT_BLOCK(iblock) & data_ptrs_mask);
+out:
+ /* count_branch_blocks may use the entire depth of S */
+ for (ks = 1; ks < depth; ks++) {
+ if (S[ks].bh)
+ brelse(S[ks].bh);
+ if (ks <= kd)
+ brelse(D[ks].bh);
+ }
+ return err < maxblocks ? err : maxblocks;
+}
+
+/*
* ext4_snapshot_get_block_access() - called from ext4_snapshot_read_through()
* on snapshot file access.
* return value <0 indicates access not granted
--
1.7.4.1


2011-06-07 15:56:59

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

Hi Amir,

thanks very much for the resend. I'll take a look at the whole patch
series, but first I want to bring up one important thing.

While this being a huge feature for ext4 (regardless on how
intrusive it is for the usual code paths) and while we already have
patches in the list with people interesting in looking into them, you
should clearly clarify what is the gain of it, what is the use case (and
I know you have one), and why it is better than other approaches. You
know, advertise it a bit in the marketing way :).

There is some confusion among developers on what actually are benefits
of ext4 snapshots in comparison to btrfs, or in comparison to the new
dm_multisnap code. I know that you have done quite a lot of testing to
assure that it does not actually change old ext4 behavior when snapshot
disabled, and that it works well when enabled, but have you done any
performance related benchmarks ? Do you have any expectations on how it
should behave in different work loads ?

It would be great to see and be able to confirm that ext4 snapshots are
really a win, not only on the feature side, but on the performance side
as well. I know that there are people out there still undecided or
having a strange feeling about your snapshot work. But who can blame
them, when we have not seen any hard data on this matter ?

So I, for myself, and I believe there are others, would like to see some
benchmark numbers and comparison (both, features and performance) with at
least new dm-multisnap code and probably btrfs and plain ext4 as well.

Thanks!
-Lukas


On Tue, 7 Jun 2011, [email protected] wrote:

> Hi All,
>
> I am resending the snapshots patch series as per Lukas's request.
> This time, the snapshot*.c files have not been omitted, as in
> the previous posting.
>
> The series is still based on ext4 dev branch sometime in the preparation
> for 3.0 merge window. It was not yet rebased on 3.0-rc1, so punch holes
> changes have not been addressed yet.
>
> As always, I advocate online review of the patches at:
> https://github.com/amir73il/ext4-snapshots/commits/for-ext4-v1
> but if you insist on doing it the old way, I won't complain.
>
> Thanks,
> Amir.
>
> [PATCH v1 01/36] ext4: EXT4 snapshots (Experimental)
> [PATCH v1 02/36] ext4: snapshot debugging support
> [PATCH v1 03/36] ext4: snapshot hooks - inside JBD hooks
> [PATCH v1 04/36] ext4: snapshot hooks - block bitmap access
> [PATCH v1 05/36] ext4: snapshot hooks - delete blocks
> [PATCH v1 06/36] ext4: snapshot hooks - move data blocks
> [PATCH v1 07/36] ext4: snapshot hooks - direct I/O
> [PATCH v1 08/36] ext4: snapshot hooks - move extent file data blocks
> [PATCH v1 09/36] ext4: snapshot file
> [PATCH v1 10/36] ext4: snapshot file - read through to block device
> [PATCH v1 11/36] ext4: snapshot file - permissions
> [PATCH v1 12/36] ext4: snapshot file - store on disk
> [PATCH v1 13/36] ext4: snapshot file - increase maximum file size limit to 16TB
> [PATCH v1 14/36] ext4: snapshot block operations
> [PATCH v1 15/36] ext4: snapshot block operation - copy blocks to snapshot
> [PATCH v1 16/36] ext4: snapshot block operation - move blocks to snapshot
> [PATCH v1 17/36] ext4: snapshot block operation - copy block bitmap to snapshot
> [PATCH v1 18/36] ext4: snapshot control
> [PATCH v1 19/36] ext4: snapshot control - init new snapshot
> [PATCH v1 20/36] ext4: snapshot control - fix new snapshot
> [PATCH v1 21/36] ext4: snapshot control - reserve disk space for snapshot
> [PATCH v1 22/36] ext4: snapshot journaled - increase transaction credits
> [PATCH v1 23/36] ext4: snapshot journaled - implement journal_release_buffer()
> [PATCH v1 24/36] ext4: snapshot journaled - bypass to save credits
> [PATCH v1 25/36] ext4: snapshot journaled - cache last COW tid in journal_head
> [PATCH v1 26/36] ext4: snapshot journaled - trace COW/buffer credits
> [PATCH v1 27/36] ext4: snapshot list support
> [PATCH v1 28/36] ext4: snapshot list - read through to previous snapshot
> [PATCH v1 29/36] ext4: snapshot race conditions - concurrent COW bitmap operations
> [PATCH v1 30/36] ext4: snapshot race conditions - concurrent COW operations
> [PATCH v1 31/36] ext4: snapshot race conditions - tracked reads
> [PATCH v1 32/36] ext4: snapshot exclude - the exclude bitmap
> [PATCH v1 33/36] ext4: snapshot cleanup
> [PATCH v1 34/36] ext4: snapshot cleanup - shrink deleted snapshots
> [PATCH v1 35/36] ext4: snapshot cleanup - merge shrunk snapshots
> [PATCH v1 36/36] ext4: snapshot rocompat - enable rw mount
>
> fs/ext4/Kconfig | 11 +
> fs/ext4/Makefile | 3 +
> fs/ext4/balloc.c | 132 +++
> fs/ext4/ext4.h | 188 ++++-
> fs/ext4/ext4_jbd2.c | 162 ++++-
> fs/ext4/ext4_jbd2.h | 266 ++++++-
> fs/ext4/extents.c | 157 ++++-
> fs/ext4/file.c | 11 +-
> fs/ext4/ialloc.c | 19 +-
> fs/ext4/inode.c | 668 +++++++++++++--
> fs/ext4/ioctl.c | 120 +++
> fs/ext4/mballoc.c | 161 ++++-
> fs/ext4/move_extent.c | 3 +-
> fs/ext4/namei.c | 9 +
> fs/ext4/resize.c | 19 +-
> fs/ext4/snapshot.c | 1000 ++++++++++++++++++++++
> fs/ext4/snapshot.h | 690 ++++++++++++++++
> fs/ext4/snapshot_buffer.c | 393 +++++++++
> fs/ext4/snapshot_ctl.c | 2002 +++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/snapshot_debug.c | 107 +++
> fs/ext4/snapshot_debug.h | 105 +++
> fs/ext4/snapshot_inode.c | 960 ++++++++++++++++++++++
> fs/ext4/super.c | 157 ++++-
> fs/ext4/xattr.c | 4 +-
> 24 files changed, 7182 insertions(+), 165 deletions(-)
>
>

--

2011-06-07 16:32:00

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Tue, Jun 7, 2011 at 6:56 PM, Lukas Czerner <[email protected]> wrote:
> Hi Amir,
>
> thanks very much for the resend. I'll take a look at the whole patch
> series, but first I want to bring up one important thing.
>
> While this being a huge feature for ext4 (regardless on how
> intrusive it is for the usual code paths) and while we already have
> patches in the list with people interesting in looking into them, you
> should clearly clarify what is the gain of it, what is the use case (and
> I know you have one), and why it is better than other approaches. You
> know, advertise it a bit in the marketing way :).

Hi Lukas,

Thank you for pointing out the marketing aspect.

I must admit that my user-case rather speaks for itself.
CTERA develops a NAS device which is specialized for
backing up local networks and snapshots gives the NAS a time
dimension without paying for it in disk space and performance.

The reason for not going with btrfs 3 years ago is clear.
So why not go with it now instead of moving forward to
ext4 with snapshots?
Part of the answer lies in the possibility to run fsck -x,
which gets rid of the snapshots in the case of fs corruption
and gets you back to good old stable and consistent ext4.

>
> There is some confusion among developers on what actually are benefits
> of ext4 snapshots in comparison to btrfs, or in comparison to the new
> dm_multisnap code. I know that you have done quite a lot of testing to
> assure that it does not actually change old ext4 behavior when snapshot
> disabled, and that it works well when enabled, but have you done any
> performance related benchmarks ? Do you have any expectations on how it
> should behave in different work loads ?
>
> It would be great to see and be able to confirm that ext4 snapshots are
> really a win, not only on the feature side, but on the performance side
> as well. I know that there are people out there still undecided or
> having a strange feeling about your snapshot work. But who can blame
> them, when we have not seen any hard data on this matter ?

Ehm.. I did present this benchmark on LSF:
http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560

unless you snoozed ;-)
it shows performance vs. ext4 w/o snapshots and with snapshots
and while taking snapshots.
I did not compare with btrfs, but I bet there are ext4 vs. btrfs
benchmarks out there.
dm-multisnap is better than dm-snap only when it comes to overhead
per snapshot. it still copies every written block, which is far from
being the case in ext4 snapshots.

>
> So I, for myself, and I believe there are others, would like to see some
> benchmark numbers and comparison (both, features and performance) with at
> least new dm-multisnap code and probably btrfs and plain ext4 as well.
>
> Thanks!
> -Lukas
>
>
> On Tue, 7 Jun 2011, [email protected] wrote:
>
>> Hi All,
>>
>> I am resending the snapshots patch series as per Lukas's request.
>> This time, the snapshot*.c files have not been omitted, as in
>> the previous posting.
>>
>> The series is still based on ext4 dev branch sometime in the preparation
>> for 3.0 merge window. It was not yet rebased on 3.0-rc1, so punch holes
>> changes have not been addressed yet.
>>
>> As always, I advocate online review of the patches at:
>> https://github.com/amir73il/ext4-snapshots/commits/for-ext4-v1
>> but if you insist on doing it the old way, I won't complain.
>>
>> Thanks,
>> Amir.
>>
>> [PATCH v1 01/36] ext4: EXT4 snapshots (Experimental)
>> [PATCH v1 02/36] ext4: snapshot debugging support
>> [PATCH v1 03/36] ext4: snapshot hooks - inside JBD hooks
>> [PATCH v1 04/36] ext4: snapshot hooks - block bitmap access
>> [PATCH v1 05/36] ext4: snapshot hooks - delete blocks
>> [PATCH v1 06/36] ext4: snapshot hooks - move data blocks
>> [PATCH v1 07/36] ext4: snapshot hooks - direct I/O
>> [PATCH v1 08/36] ext4: snapshot hooks - move extent file data blocks
>> [PATCH v1 09/36] ext4: snapshot file
>> [PATCH v1 10/36] ext4: snapshot file - read through to block device
>> [PATCH v1 11/36] ext4: snapshot file - permissions
>> [PATCH v1 12/36] ext4: snapshot file - store on disk
>> [PATCH v1 13/36] ext4: snapshot file - increase maximum file size limit to 16TB
>> [PATCH v1 14/36] ext4: snapshot block operations
>> [PATCH v1 15/36] ext4: snapshot block operation - copy blocks to snapshot
>> [PATCH v1 16/36] ext4: snapshot block operation - move blocks to snapshot
>> [PATCH v1 17/36] ext4: snapshot block operation - copy block bitmap to snapshot
>> [PATCH v1 18/36] ext4: snapshot control
>> [PATCH v1 19/36] ext4: snapshot control - init new snapshot
>> [PATCH v1 20/36] ext4: snapshot control - fix new snapshot
>> [PATCH v1 21/36] ext4: snapshot control - reserve disk space for snapshot
>> [PATCH v1 22/36] ext4: snapshot journaled - increase transaction credits
>> [PATCH v1 23/36] ext4: snapshot journaled - implement journal_release_buffer()
>> [PATCH v1 24/36] ext4: snapshot journaled - bypass to save credits
>> [PATCH v1 25/36] ext4: snapshot journaled - cache last COW tid in journal_head
>> [PATCH v1 26/36] ext4: snapshot journaled - trace COW/buffer credits
>> [PATCH v1 27/36] ext4: snapshot list support
>> [PATCH v1 28/36] ext4: snapshot list - read through to previous snapshot
>> [PATCH v1 29/36] ext4: snapshot race conditions - concurrent COW bitmap operations
>> [PATCH v1 30/36] ext4: snapshot race conditions - concurrent COW operations
>> [PATCH v1 31/36] ext4: snapshot race conditions - tracked reads
>> [PATCH v1 32/36] ext4: snapshot exclude - the exclude bitmap
>> [PATCH v1 33/36] ext4: snapshot cleanup
>> [PATCH v1 34/36] ext4: snapshot cleanup - shrink deleted snapshots
>> [PATCH v1 35/36] ext4: snapshot cleanup - merge shrunk snapshots
>> [PATCH v1 36/36] ext4: snapshot rocompat - enable rw mount
>>
>> ?fs/ext4/Kconfig ? ? ? ? ? | ? 11 +
>> ?fs/ext4/Makefile ? ? ? ? ?| ? ?3 +
>> ?fs/ext4/balloc.c ? ? ? ? ?| ?132 +++
>> ?fs/ext4/ext4.h ? ? ? ? ? ?| ?188 ++++-
>> ?fs/ext4/ext4_jbd2.c ? ? ? | ?162 ++++-
>> ?fs/ext4/ext4_jbd2.h ? ? ? | ?266 ++++++-
>> ?fs/ext4/extents.c ? ? ? ? | ?157 ++++-
>> ?fs/ext4/file.c ? ? ? ? ? ?| ? 11 +-
>> ?fs/ext4/ialloc.c ? ? ? ? ?| ? 19 +-
>> ?fs/ext4/inode.c ? ? ? ? ? | ?668 +++++++++++++--
>> ?fs/ext4/ioctl.c ? ? ? ? ? | ?120 +++
>> ?fs/ext4/mballoc.c ? ? ? ? | ?161 ++++-
>> ?fs/ext4/move_extent.c ? ? | ? ?3 +-
>> ?fs/ext4/namei.c ? ? ? ? ? | ? ?9 +
>> ?fs/ext4/resize.c ? ? ? ? ?| ? 19 +-
>> ?fs/ext4/snapshot.c ? ? ? ?| 1000 ++++++++++++++++++++++
>> ?fs/ext4/snapshot.h ? ? ? ?| ?690 ++++++++++++++++
>> ?fs/ext4/snapshot_buffer.c | ?393 +++++++++
>> ?fs/ext4/snapshot_ctl.c ? ?| 2002 +++++++++++++++++++++++++++++++++++++++++++++
>> ?fs/ext4/snapshot_debug.c ?| ?107 +++
>> ?fs/ext4/snapshot_debug.h ?| ?105 +++
>> ?fs/ext4/snapshot_inode.c ?| ?960 ++++++++++++++++++++++
>> ?fs/ext4/super.c ? ? ? ? ? | ?157 ++++-
>> ?fs/ext4/xattr.c ? ? ? ? ? | ? ?4 +-
>> ?24 files changed, 7182 insertions(+), 165 deletions(-)
>>
>>
>
> --
>

2011-06-08 10:09:33

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Tue, 7 Jun 2011, Amir G. wrote:

> On Tue, Jun 7, 2011 at 6:56 PM, Lukas Czerner <[email protected]> wrote:
> > Hi Amir,
> >
> > thanks very much for the resend. I'll take a look at the whole patch
> > series, but first I want to bring up one important thing.
> >
> > While this being a huge feature for ext4 (regardless on how
> > intrusive it is for the usual code paths) and while we already have
> > patches in the list with people interesting in looking into them, you
> > should clearly clarify what is the gain of it, what is the use case (and
> > I know you have one), and why it is better than other approaches. You
> > know, advertise it a bit in the marketing way :).
>
> Hi Lukas,
>
> Thank you for pointing out the marketing aspect.
>
> I must admit that my user-case rather speaks for itself.
> CTERA develops a NAS device which is specialized for
> backing up local networks and snapshots gives the NAS a time
> dimension without paying for it in disk space and performance.
>
> The reason for not going with btrfs 3 years ago is clear.
> So why not go with it now instead of moving forward to
> ext4 with snapshots?
> Part of the answer lies in the possibility to run fsck -x,
> which gets rid of the snapshots in the case of fs corruption
> and gets you back to good old stable and consistent ext4.

But that is not even a real reason, is it ? When you need snapshots,
well, then you just need it and do no want to get rid of it. When fs
corruption appears, then it's bad in any case and the fsck should be
able to more or less fix it.

So you're saying that when corruption appears, then you *have to* blast
all snapshots ? I am not sure how btrfs is going to deal with it, but it
does seem like an advantage at all, why are you presenting it as such ?

>
> >
> > There is some confusion among developers on what actually are benefits
> > of ext4 snapshots in comparison to btrfs, or in comparison to the new
> > dm_multisnap code. I know that you have done quite a lot of testing to
> > assure that it does not actually change old ext4 behavior when snapshot
> > disabled, and that it works well when enabled, but have you done any
> > performance related benchmarks ? Do you have any expectations on how it
> > should behave in different work loads ?
> >
> > It would be great to see and be able to confirm that ext4 snapshots are
> > really a win, not only on the feature side, but on the performance side
> > as well. I know that there are people out there still undecided or
> > having a strange feeling about your snapshot work. But who can blame
> > them, when we have not seen any hard data on this matter ?
>
> Ehm.. I did present this benchmark on LSF:
> http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
>
> unless you snoozed ;-)
> it shows performance vs. ext4 w/o snapshots and with snapshots
> and while taking snapshots.

I believe that you just missed the fact that not everyone has attended LSF
and your lightning talk, but that's ok.

It seems to me that random writes are usually faster with you snapshot
code regardless whether you use snapshots or not. Is that because of
non snapshot related changes you've made ?

Also random reads seems to be slower with snapshots, is suspect that
this is because of read through, so the reason for the slowdown that it
was CPU bound ? I do not see any CPU utilization data.

The postmark results seems quite odd, it is actually a lot faster with
one snapshot and a lot slower with multiple snapshots, do you have an
idea what is going on ?

> I did not compare with btrfs, but I bet there are ext4 vs. btrfs
> benchmarks out there.
> dm-multisnap is better than dm-snap only when it comes to overhead
> per snapshot. it still copies every written block, which is far from
> being the case in ext4 snapshots.

Nevertheless, I still have not seen any comparison with other
snapshotting possibilities we have. Note that ext4 to btrfs comparison
is not enough, because we do not know what is the difference between
the difference of ext4 with/without snapshots and btrfs with/without
snapshots. The reason for this is that btrfs performance is very likely
to scale up, but ext4 is pretty much done in that matter and I do not
expect any huge performance leaps in the future.

Also, rejecting dm-multisnap based on this statement is not enough, show
us some numbers.

I believe that it is not very convenient for you, because this feature
support your business case and you do not necessarily want to find out
that there might be a better way, especially after the work you have
done already.

So it might be unpleasant for you that people ask questions and delaying
the inclusion of ext4 snapshots. But what you see as obstacles people
are throwing at you is really just caution, especially when it comes to
ext4 which is seen as a simple, stable, reliable and predictable linux
filesystem, but I bet you understand.

And one last note, I also think that the snapshot format change in the
future, when we'll have snpashots with 64bit feature compatible seems
just wrong to me. Adding some features or changing the implementation a
bit is ok, but format change is different. When the code is upstream and
stable it is just wrong.

Thanks!
-Lukas

>
> >
> > So I, for myself, and I believe there are others, would like to see some
> > benchmark numbers and comparison (both, features and performance) with at
> > least new dm-multisnap code and probably btrfs and plain ext4 as well.
> >
> > Thanks!
> > -Lukas
> >

2011-06-08 14:04:56

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Wed, Jun 8, 2011 at 1:09 PM, Lukas Czerner <[email protected]> wrote:
> On Tue, 7 Jun 2011, Amir G. wrote:
>
>> On Tue, Jun 7, 2011 at 6:56 PM, Lukas Czerner <[email protected]> wrote:
>> > Hi Amir,
>> >
>> > thanks very much for the resend. I'll take a look at the whole patch
>> > series, but first I want to bring up one important thing.
>> >
>> > While this being a huge feature for ext4 (regardless on how
>> > intrusive it is for the usual code paths) and while we already have
>> > patches in the list with people interesting in looking into them, you
>> > should clearly clarify what is the gain of it, what is the use case (and
>> > I know you have one), and why it is better than other approaches. You
>> > know, advertise it a bit in the marketing way :).
>>
>> Hi Lukas,
>>
>> Thank you for pointing out the marketing aspect.
>>
>> I must admit that my user-case rather speaks for itself.
>> CTERA develops a NAS device which is specialized for
>> backing up local networks and snapshots gives the NAS a time
>> dimension without paying for it in disk space and performance.
>>
>> The reason for not going with btrfs 3 years ago is clear.
>> So why not go with it now instead of moving forward to
>> ext4 with snapshots?
>> Part of the answer lies in the possibility to run fsck -x,
>> which gets rid of the snapshots in the case of fs corruption
>> and gets you back to good old stable and consistent ext4.
>
> But that is not even a real reason, is it ? When you need snapshots,
> well, then you just need it and do no want to get rid of it. When fs
> corruption appears, then it's bad in any case and the fsck should be
> able to more or less fix it.
>
> So you're saying that when corruption appears, then you *have to* blast
> all snapshots ? I am not sure how btrfs is going to deal with it, but it
> does seem like an advantage at all, why are you presenting it as such ?
>

Hi Lukas,

First of all, thank you for being strict with me.
I admit to having lousy marketing skills...

The market I am targeting are the sys admins who
are very cautious about their 'data' and are reluctant
therefor to migrate from ext3 to ext4, not to speak of
btrfs.

To this market I say, you can have snapshots of your
'data' on ext4 without risking the proven stability of ext4.
The snapshots of the 'data' are not guarantied to be as
stable (being a new feature), but because the snapshots
are second to 'data' in ext4 snapshots, corrupted snapshots
will not risk the 'data'.

During 1 year of next3 in production systems, we found bugs.
But none of the bugs corrupted 'data'. All of the bugs which
caused file system to contain errors, the errors were restricted
to snapshot files and in those worst cases, we could always
go to emergency plan B (plan A being fsck -p) and run fsck -x
which always solved the problem.

The customer was always consulted before resorting to 'plan B'
and was given the chance to copy out 'data' from the snapshots
(it was always possible) before we discard them.

Needless to say, the said bugs were fixed and ext4 snapshots
will enjoy the stability of next3 and the 'fail safe' nature of the
solution, which was proven several times on the field.


>>
>> >
>> > There is some confusion among developers on what actually are benefits
>> > of ext4 snapshots in comparison to btrfs, or in comparison to the new
>> > dm_multisnap code. I know that you have done quite a lot of testing to
>> > assure that it does not actually change old ext4 behavior when snapshot
>> > disabled, and that it works well when enabled, but have you done any
>> > performance related benchmarks ? Do you have any expectations on how it
>> > should behave in different work loads ?
>> >
>> > It would be great to see and be able to confirm that ext4 snapshots are
>> > really a win, not only on the feature side, but on the performance side
>> > as well. I know that there are people out there still undecided or
>> > having a strange feeling about your snapshot work. But who can blame
>> > them, when we have not seen any hard data on this matter ?
>>
>> Ehm.. I did present this benchmark on LSF:
>> http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
>>
>> unless you snoozed ;-)
>> it shows performance vs. ext4 w/o snapshots and with snapshots
>> and while taking snapshots.
>
> I believe that you just missed the fact that not everyone has attended LSF
> and your lightning talk, but that's ok.

That's not really OK. I should have posted the results
and analysis on my wiki (the results are there).

>
> It seems to me that random writes are usually faster with you snapshot
> code regardless whether you use snapshots or not. Is that because of
> non snapshot related changes you've made ?

Not that I know of.
I can explain why random write onesnap is faster than nosnap
and why 1snappermin is faster than onesnap, but I am not
sure about nosnap vs. plain ext4.

>
> Also random reads seems to be slower with snapshots, is suspect that
> this is because of read through, so the reason for the slowdown that it
> was CPU bound ? I do not see any CPU utilization data.
>

Only the 1snappermin is slower.
I suspect it has to do with the fs freezes, but I admin I have not
looked into it.

> The postmark results seems quite odd, it is actually a lot faster with
> one snapshot and a lot slower with multiple snapshots, do you have an
> idea what is going on ?
>

The name onesnap is misleading. It should have been
existingsnaps.
The important factor is whether or not snapshots are taken during the test.
In the 1snappermin case, postmark is the only test that exposes the
weak spot of ext4 snapshots performance - deletes/truncates.
create file+delete file with existing snapshots has no overhead (no COW).
create file+take snapshot+delete file has the overhead of moving the
deleted blocks to snapshot.
With regards to speed up of onesnap, postmark is randomizing the file
creates/write so it may be a similar effect to random write.
I did not investigate this.

>> I did not compare with btrfs, but I bet there are ext4 vs. btrfs
>> benchmarks out there.
>> dm-multisnap is better than dm-snap only when it comes to overhead
>> per snapshot. it still copies every written block, which is far from
>> being the case in ext4 snapshots.
>
> Nevertheless, I still have not seen any comparison with other
> snapshotting possibilities we have. Note that ext4 to btrfs comparison
> is not enough, because we do not know what is the difference between
> the difference of ext4 with/without snapshots and btrfs with/without
> snapshots. The reason for this is that btrfs performance is very likely
> to scale up, but ext4 is pretty much done in that matter and I do not
> expect any huge performance leaps in the future.
>
> Also, rejecting dm-multisnap based on this statement is not enough, show
> us some numbers.

Well, if you come to understand the difference between fs level an dm
level snapshots, you will see why i am rejecting dm-multisnap
(performance wise only!).

Anyway #1: I have already answered this questions 2 years ago and I
think the answers are still valid both for LVM and btrfs:
http://sourceforge.net/apps/mediawiki/next3/index.php?title=FAQ#Why_use_Next3_snapshots_and_not_LVM_snapshots.3F

Anyway #2: I need to give you some numbers ;-)

>
> I believe that it is not very convenient for you, because this feature
> support your business case and you do not necessarily want to find out
> that there might be a better way, especially after the work you have
> done already.

Your analysis of my motives is correct :-)
The use of the term 'better way' I reject.
I think that ext4/btrfs/LVM snapshots are apples and oranges and hamburgers.
The question of whether the world needs ext4 snapshots is
perfectly valid, but going back to the food analogy, I think it's
a case of "the proof of the pudding is in the eating".
I have no doubt that if ext4 snapshots are merged, many people will use it.
And I think that is a good enough (if not the best)
reason for inclusion.


>
> So it might be unpleasant for you that people ask questions and delaying
> the inclusion of ext4 snapshots. But what you see as obstacles people
> are throwing at you is really just caution, especially when it comes to
> ext4 which is seen as a simple, stable, reliable and predictable linux
> filesystem, but I bet you understand.
>

Yes, I understand. As evidence, I posted the "core patches"
to get them reviewed for "safely" and "stability" rather than
"functionality". (and that didn't work out well, but I understand that as well).

> And one last note, I also think that the snapshot format change in the
> future, when we'll have snpashots with 64bit feature compatible seems
> just wrong to me. Adding some features or changing the implementation a
> bit is ok, but format change is different. When the code is upstream and
> stable it is just wrong.

What can I say, I understand why it looks bad, but is 64bit code
upstream and stable? Hell no! e2fsprogs 64bit is not out yet!
There is no reason to call it 'format change'.
It's going to be a new format used only for 64bit fs, which are not
even out there yet. And when they are finally out there, they won't
have
snapshots until the new format is implemented.

And more important, say I do implement a new 48bit logical
offsets file format, so my employer can provide snapshots on
>16TB volumes in future releases.
I will not recommend my employer to use this format on <16TB volumes,
because there is nothing wrong with staying with the simple and well
tested indirect mapped snapshot format in future releases.

Thanks for your time and patience,
Amir.

2011-06-08 14:41:45

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On 6/8/11 9:04 AM, Amir G. wrote:
>> And one last note, I also think that the snapshot format change in the
>> > future, when we'll have snpashots with 64bit feature compatible seems
>> > just wrong to me. Adding some features or changing the implementation a
>> > bit is ok, but format change is different. When the code is upstream and
>> > stable it is just wrong.
> What can I say, I understand why it looks bad, but is 64bit code
> upstream and stable? Hell no! e2fsprogs 64bit is not out yet!
> There is no reason to call it 'format change'.
> It's going to be a new format used only for 64bit fs, which are not
> even out there yet. And when they are finally out there, they won't
> have
> snapshots until the new format is implemented.

Well, the on-disk format for 64-bit (48-bit?) ext4 is there & fixed; it's
just that there is no released userspace which can properly handle it, right?

I don't anticipate ext4 format changes for >16T, or am I missing something?

-Eric

2011-06-08 15:01:46

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Wed, Jun 8, 2011 at 5:41 PM, Eric Sandeen <[email protected]> wrote:
> On 6/8/11 9:04 AM, Amir G. wrote:
>>> And one last note, I also think that the snapshot format change in the
>>> > future, when we'll have snpashots with 64bit feature compatible seems
>>> > just wrong to me. Adding some features or changing the implementation a
>>> > bit is ok, but format change is different. When the code is upstream and
>>> > stable it is just wrong.
>> What can I say, I understand why it looks bad, but is 64bit code
>> upstream and stable? Hell no! e2fsprogs 64bit is not out yet!
>> There is no reason to call it 'format change'.
>> It's going to be a new format used only for 64bit fs, which are not
>> even out there yet. And when they are finally out there, they won't
>> have
>> snapshots until the new format is implemented.
>
> Well, the on-disk format for 64-bit (48-bit?) ext4 is there & fixed; it's
> just that there is no released userspace which can properly handle it, right?

I don't know, you tell me.
Are there many users out there using 64bit feature, without the proper
user space tools?

>
> I don't anticipate ext4 format changes for >16T, or am I missing something?
>
> -Eric
>

Argh! I wish I hadn't missed the Monday call (it's
not in a good time for me).
This whole 'format change' has gone out of control
and I find it hard to present my case properly on scattered emails.

The message I am trying to get through is:
There is 32bit snapshot file format, which is implemented and well tested.
There is 64bit snapshot file format, which is not implemented yet, so
64bit and snapshot feature are mutually exclusive.
If and when 64bit snapshot file format will be implemented, it will be
a new type of extent mapped file (v2) with 48bit logical addresses.
Is this a 'format change'? Call it what you will, but it shouldn't
affect anything on existing structures. It should only affect the
non-existing structure of 64bit snapshot file.

Does this answer your question?

Amir.

2011-06-08 15:23:00

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On 6/8/11 10:01 AM, Amir G. wrote:
> On Wed, Jun 8, 2011 at 5:41 PM, Eric Sandeen <[email protected]> wrote:
>> On 6/8/11 9:04 AM, Amir G. wrote:
>>>> And one last note, I also think that the snapshot format change in the
>>>>> future, when we'll have snpashots with 64bit feature compatible seems
>>>>> just wrong to me. Adding some features or changing the implementation a
>>>>> bit is ok, but format change is different. When the code is upstream and
>>>>> stable it is just wrong.
>>> What can I say, I understand why it looks bad, but is 64bit code
>>> upstream and stable? Hell no! e2fsprogs 64bit is not out yet!
>>> There is no reason to call it 'format change'.
>>> It's going to be a new format used only for 64bit fs, which are not
>>> even out there yet. And when they are finally out there, they won't
>>> have
>>> snapshots until the new format is implemented.
>>
>> Well, the on-disk format for 64-bit (48-bit?) ext4 is there & fixed; it's
>> just that there is no released userspace which can properly handle it, right?
>
> I don't know, you tell me.
> Are there many users out there using 64bit feature, without the proper
> user space tools?

No, but that doesn't mean the disk format has to change when the tools
come out... I just don't want to confuse "there are no tools" with
"the disk format is unstable" - Andreas et. al. have been using
that format for years.

>>
>> I don't anticipate ext4 format changes for >16T, or am I missing something?
>>
>> -Eric
>>
>
> Argh! I wish I hadn't missed the Monday call (it's
> not in a good time for me).
> This whole 'format change' has gone out of control
> and I find it hard to present my case properly on scattered emails.

Sorry; I may have just misunderstood...

> The message I am trying to get through is:
> There is 32bit snapshot file format, which is implemented and well tested.
> There is 64bit snapshot file format, which is not implemented yet, so
> 64bit and snapshot feature are mutually exclusive.
> If and when 64bit snapshot file format will be implemented, it will be
> a new type of extent mapped file (v2) with 48bit logical addresses.
> Is this a 'format change'? Call it what you will, but it shouldn't
> affect anything on existing structures. It should only affect the
> non-existing structure of 64bit snapshot file.
>
> Does this answer your question?

Yes, I guess I had misunderstood your point; I thought you were
implying that ext4's format had to change to support 64-bit, so why
not change snapshots along with it....

But you're just saying that you wish to push 32-bit snapshots which only
work with certain sizes of ext4 filesystems now, and later you will
release a new snapshot format which works with the larger filesystems.
Right?

(I don't actually know if we'll ever have 64-bit ext4, though, there
are still so many scaling issues beyond just being able to mkfs,
mount, growfs etc ... it's a serious game of catch-up with xfs
in that space, IMHO, which has been doing it well for years now...)

Still, pushing snapshots upstream which will have an on-disk format
more limited than the rest of the filesystem's on-disk format
does strike me as suboptimal from a pure technical design POV.

What if we proposed, say, xattr code that could only apply xattrs
to files located in the first 16T? I don't think it'd be accepted.

I understand that you have a history and a format and a business case,
but that really should not change whether we do it right the first time,
upstream, IMHO... But I'm just the peanut gallery, here.... ;)

-Eric

> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-06-08 15:33:25

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Wed, Jun 8, 2011 at 6:22 PM, Eric Sandeen <[email protected]> wrote:
> On 6/8/11 10:01 AM, Amir G. wrote:
>> On Wed, Jun 8, 2011 at 5:41 PM, Eric Sandeen <[email protected]> wrote:
>>> On 6/8/11 9:04 AM, Amir G. wrote:
>>>>> And one last note, I also think that the snapshot format change in the
>>>>>> future, when we'll have snpashots with 64bit feature compatible seems
>>>>>> just wrong to me. Adding some features or changing the implementation a
>>>>>> bit is ok, but format change is different. When the code is upstream and
>>>>>> stable it is just wrong.
>>>> What can I say, I understand why it looks bad, but is 64bit code
>>>> upstream and stable? Hell no! e2fsprogs 64bit is not out yet!
>>>> There is no reason to call it 'format change'.
>>>> It's going to be a new format used only for 64bit fs, which are not
>>>> even out there yet. And when they are finally out there, they won't
>>>> have
>>>> snapshots until the new format is implemented.
>>>
>>> Well, the on-disk format for 64-bit (48-bit?) ext4 is there & fixed; it's
>>> just that there is no released userspace which can properly handle it, right?
>>
>> I don't know, you tell me.
>> Are there many users out there using 64bit feature, without the proper
>> user space tools?
>
> No, but that doesn't mean the disk format has to change when the tools
> come out... I just don't want to confuse "there are no tools" with
> "the disk format is unstable" - Andreas et. al. have been using
> that format for years.
>
>>>
>>> I don't anticipate ext4 format changes for >16T, or am I missing something?
>>>
>>> -Eric
>>>
>>
>> Argh! I wish I hadn't missed the Monday call (it's
>> not in a good time for me).
>> This whole 'format change' has gone out of control
>> and I find it hard to present my case properly on scattered emails.
>
> Sorry; I may have just misunderstood...
>
>> The message I am trying to get through is:
>> There is 32bit snapshot file format, which is implemented and well tested.
>> There is 64bit snapshot file format, which is not implemented yet, so
>> 64bit and snapshot feature are mutually exclusive.
>> If and when 64bit snapshot file format will be implemented, it will be
>> a new type of extent mapped file (v2) with 48bit logical addresses.
>> Is this a 'format change'? Call it what you will, but it shouldn't
>> affect anything on existing structures. It should only affect the
>> non-existing structure of 64bit snapshot file.
>>
>> Does this answer your question?
>
> Yes, I guess I had misunderstood your point; I thought you were
> implying that ext4's format had to change to support 64-bit, so why
> not change snapshots along with it....
>
> But you're just saying that you wish to push 32-bit snapshots which only
> work with certain sizes of ext4 filesystems now, and later you will
> release a new snapshot format which works with the larger filesystems.
> Right?

Right. Where 'Larger filesystems' := 64bit block addresses.

>
> (I don't actually know if we'll ever have 64-bit ext4, though, there
> are still so many scaling issues beyond just being able to mkfs,
> mount, growfs etc ... it's a serious game of catch-up with xfs
> in that space, IMHO, which has been doing it well for years now...)

More of a good reason to push a snapshot file format that work well
with 32bit ext4.

>
> Still, pushing snapshots upstream which will have an on-disk format
> more limited than the rest of the filesystem's on-disk format
> does strike me as suboptimal from a pure technical design POV.
>
> What if we proposed, say, xattr code that could only apply xattrs
> to files located in the first 16T? ?I don't think it'd be accepted.

That is not a correct analogy. The correct analogy is not supporting
xattrs on 64-bit ext4. Whether it makes sense or not for snapshots
depends IMHO on whether people find snapshot on 32bit ext4 only
useful or not.

I naturally think that people will find it useful.
Anyone can add snapshots to his existing 32-bit ext4,
No one can migrate the same fs to 64-bit...

>
> I understand that you have a history and a format and a business case,
> but that really should not change whether we do it right the first time,
> upstream, IMHO... ?But I'm just the peanut gallery, here.... ?;)
>
> -Eric
>
>> Amir.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
>

2011-06-08 15:39:09

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Wed, 8 Jun 2011, Amir G. wrote:

> On Wed, Jun 8, 2011 at 1:09 PM, Lukas Czerner <[email protected]> wrote:
> > On Tue, 7 Jun 2011, Amir G. wrote:
> >
> >> On Tue, Jun 7, 2011 at 6:56 PM, Lukas Czerner <[email protected]> wrote:
> >> > Hi Amir,
> >> >
> >> > thanks very much for the resend. I'll take a look at the whole patch
> >> > series, but first I want to bring up one important thing.
> >> >
> >> > While this being a huge feature for ext4 (regardless on how
> >> > intrusive it is for the usual code paths) and while we already have
> >> > patches in the list with people interesting in looking into them, you
> >> > should clearly clarify what is the gain of it, what is the use case (and
> >> > I know you have one), and why it is better than other approaches. You
> >> > know, advertise it a bit in the marketing way :).
> >>
> >> Hi Lukas,
> >>
> >> Thank you for pointing out the marketing aspect.
> >>
> >> I must admit that my user-case rather speaks for itself.
> >> CTERA develops a NAS device which is specialized for
> >> backing up local networks and snapshots gives the NAS a time
> >> dimension without paying for it in disk space and performance.
> >>
> >> The reason for not going with btrfs 3 years ago is clear.
> >> So why not go with it now instead of moving forward to
> >> ext4 with snapshots?
> >> Part of the answer lies in the possibility to run fsck -x,
> >> which gets rid of the snapshots in the case of fs corruption
> >> and gets you back to good old stable and consistent ext4.
> >
> > But that is not even a real reason, is it ? When you need snapshots,
> > well, then you just need it and do no want to get rid of it. When fs
> > corruption appears, then it's bad in any case and the fsck should be
> > able to more or less fix it.
> >
> > So you're saying that when corruption appears, then you *have to* blast
> > all snapshots ? I am not sure how btrfs is going to deal with it, but it
> > does seem like an advantage at all, why are you presenting it as such ?
> >
>
> Hi Lukas,
>
> First of all, thank you for being strict with me.
> I admit to having lousy marketing skills...
>
> The market I am targeting are the sys admins who
> are very cautious about their 'data' and are reluctant
> therefor to migrate from ext3 to ext4, not to speak of
> btrfs.

Well, that's why I am concerned with merging the ext4 snapshots. This is
exactly the reason why people will get nervous when you try to push a
huge change like ext4 snapshots into the stable code base. Yes, when you
do not compile it in, it does not affect the fs very much, but try to
tell people that ext4 is not the old-good-stable-ext4 when you enable
this feature. And I do not believe that snapshot code does not interfere
with the old ext4 code paths, so there is a place for horrible bugs
waiting for us.

>
> To this market I say, you can have snapshots of your
> 'data' on ext4 without risking the proven stability of ext4.
> The snapshots of the 'data' are not guarantied to be as
> stable (being a new feature), but because the snapshots
> are second to 'data' in ext4 snapshots, corrupted snapshots
> will not risk the 'data'.
>
> During 1 year of next3 in production systems, we found bugs.
> But none of the bugs corrupted 'data'. All of the bugs which
> caused file system to contain errors, the errors were restricted
> to snapshot files and in those worst cases, we could always
> go to emergency plan B (plan A being fsck -p) and run fsck -x
> which always solved the problem.

It does not matter that much how long or how much your embedded
production systems are out there. The fact is that it is really very
limited work load variation, hence very limited testing.

>
> The customer was always consulted before resorting to 'plan B'
> and was given the chance to copy out 'data' from the snapshots
> (it was always possible) before we discard them.

So it is true, when you have an fs problem (corruption) you have to
blast off all your snapshots ?

>
> Needless to say, the said bugs were fixed and ext4 snapshots
> will enjoy the stability of next3 and the 'fail safe' nature of the
> solution, which was proven several times on the field.
>
>
> >>
> >> >
> >> > There is some confusion among developers on what actually are benefits
> >> > of ext4 snapshots in comparison to btrfs, or in comparison to the new
> >> > dm_multisnap code. I know that you have done quite a lot of testing to
> >> > assure that it does not actually change old ext4 behavior when snapshot
> >> > disabled, and that it works well when enabled, but have you done any
> >> > performance related benchmarks ? Do you have any expectations on how it
> >> > should behave in different work loads ?
> >> >
> >> > It would be great to see and be able to confirm that ext4 snapshots are
> >> > really a win, not only on the feature side, but on the performance side
> >> > as well. I know that there are people out there still undecided or
> >> > having a strange feeling about your snapshot work. But who can blame
> >> > them, when we have not seen any hard data on this matter ?
> >>
> >> Ehm.. I did present this benchmark on LSF:
> >> http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
> >>
> >> unless you snoozed ;-)
> >> it shows performance vs. ext4 w/o snapshots and with snapshots
> >> and while taking snapshots.
> >
> > I believe that you just missed the fact that not everyone has attended LSF
> > and your lightning talk, but that's ok.
>
> That's not really OK. I should have posted the results
> and analysis on my wiki (the results are there).
>
> >
> > It seems to me that random writes are usually faster with you snapshot
> > code regardless whether you use snapshots or not. Is that because of
> > non snapshot related changes you've made ?
>
> Not that I know of.
> I can explain why random write onesnap is faster than nosnap
> and why 1snappermin is faster than onesnap, but I am not
> sure about nosnap vs. plain ext4.
>
> >
> > Also random reads seems to be slower with snapshots, is suspect that
> > this is because of read through, so the reason for the slowdown that it
> > was CPU bound ? I do not see any CPU utilization data.
> >
>
> Only the 1snappermin is slower.
> I suspect it has to do with the fs freezes, but I admin I have not
> looked into it.
>
> > The postmark results seems quite odd, it is actually a lot faster with
> > one snapshot and a lot slower with multiple snapshots, do you have an
> > idea what is going on ?
> >
>
> The name onesnap is misleading. It should have been
> existingsnaps.
> The important factor is whether or not snapshots are taken during the test.
> In the 1snappermin case, postmark is the only test that exposes the
> weak spot of ext4 snapshots performance - deletes/truncates.
> create file+delete file with existing snapshots has no overhead (no COW).
> create file+take snapshot+delete file has the overhead of moving the
> deleted blocks to snapshot.
> With regards to speed up of onesnap, postmark is randomizing the file
> creates/write so it may be a similar effect to random write.
> I did not investigate this.
>
> >> I did not compare with btrfs, but I bet there are ext4 vs. btrfs
> >> benchmarks out there.
> >> dm-multisnap is better than dm-snap only when it comes to overhead
> >> per snapshot. it still copies every written block, which is far from
> >> being the case in ext4 snapshots.
> >
> > Nevertheless, I still have not seen any comparison with other
> > snapshotting possibilities we have. Note that ext4 to btrfs comparison
> > is not enough, because we do not know what is the difference between
> > the difference of ext4 with/without snapshots and btrfs with/without
> > snapshots. The reason for this is that btrfs performance is very likely
> > to scale up, but ext4 is pretty much done in that matter and I do not
> > expect any huge performance leaps in the future.
> >
> > Also, rejecting dm-multisnap based on this statement is not enough, show
> > us some numbers.
>
> Well, if you come to understand the difference between fs level an dm
> level snapshots, you will see why i am rejecting dm-multisnap
> (performance wise only!).

But I do understand the difference. And also, when it comes to fs level
snapshotting I would suspect that it would do something we can not do
with the current solutions, for example per-file or per-directory snapshots,
cat ext4 snapshots do that ?

>
> Anyway #1: I have already answered this questions 2 years ago and I
> think the answers are still valid both for LVM and btrfs:
> http://sourceforge.net/apps/mediawiki/next3/index.php?title=FAQ#Why_use_Next3_snapshots_and_not_LVM_snapshots.3F

But again, it was two years ago and even back then you have not had any
numbers proving your statements.

>
> Anyway #2: I need to give you some numbers ;-)

That would be great. Thanks!

>
> >
> > I believe that it is not very convenient for you, because this feature
> > support your business case and you do not necessarily want to find out
> > that there might be a better way, especially after the work you have
> > done already.
>
> Your analysis of my motives is correct :-)
> The use of the term 'better way' I reject.
> I think that ext4/btrfs/LVM snapshots are apples and oranges and hamburgers.

But they are really not, because otherwise it would complement each
other, but they are all trying to do the same thing, except btrfs has
it for free.

> The question of whether the world needs ext4 snapshots is
> perfectly valid, but going back to the food analogy, I think it's
> a case of "the proof of the pudding is in the eating".
> I have no doubt that if ext4 snapshots are merged, many people will use it.

Well, I would like to have your confidence. Why do you think so ? They
will use it for what ? Doing backups ? We can do this easily with LVM
without any risk of compromising existing filesystem at all. On desktop
? I very much doubt that since you can not do per directory (or per
file) snapshots, can you ?

> And I think that is a good enough (if not the best)
> reason for inclusion.

It would be of course, except you're the only one saying that.

Thanks!
-Lukas

2011-06-08 15:59:51

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Wed, Jun 8, 2011 at 6:38 PM, Lukas Czerner <[email protected]> wrote:
> On Wed, 8 Jun 2011, Amir G. wrote:
>
>> On Wed, Jun 8, 2011 at 1:09 PM, Lukas Czerner <[email protected]> wrote:
>> > On Tue, 7 Jun 2011, Amir G. wrote:
>> >
>> >> On Tue, Jun 7, 2011 at 6:56 PM, Lukas Czerner <[email protected]> wrote:
>> >> > Hi Amir,
>> >> >
>> >> > thanks very much for the resend. I'll take a look at the whole patch
>> >> > series, but first I want to bring up one important thing.
>> >> >
>> >> > While this being a huge feature for ext4 (regardless on how
>> >> > intrusive it is for the usual code paths) and while we already have
>> >> > patches in the list with people interesting in looking into them, you
>> >> > should clearly clarify what is the gain of it, what is the use case (and
>> >> > I know you have one), and why it is better than other approaches. You
>> >> > know, advertise it a bit in the marketing way :).
>> >>
>> >> Hi Lukas,
>> >>
>> >> Thank you for pointing out the marketing aspect.
>> >>
>> >> I must admit that my user-case rather speaks for itself.
>> >> CTERA develops a NAS device which is specialized for
>> >> backing up local networks and snapshots gives the NAS a time
>> >> dimension without paying for it in disk space and performance.
>> >>
>> >> The reason for not going with btrfs 3 years ago is clear.
>> >> So why not go with it now instead of moving forward to
>> >> ext4 with snapshots?
>> >> Part of the answer lies in the possibility to run fsck -x,
>> >> which gets rid of the snapshots in the case of fs corruption
>> >> and gets you back to good old stable and consistent ext4.
>> >
>> > But that is not even a real reason, is it ? When you need snapshots,
>> > well, then you just need it and do no want to get rid of it. When fs
>> > corruption appears, then it's bad in any case and the fsck should be
>> > able to more or less fix it.
>> >
>> > So you're saying that when corruption appears, then you *have to* blast
>> > all snapshots ? I am not sure how btrfs is going to deal with it, but it
>> > does seem like an advantage at all, why are you presenting it as such ?
>> >
>>
>> Hi Lukas,
>>
>> First of all, thank you for being strict with me.
>> I admit to having lousy marketing skills...
>>
>> The market I am targeting are the sys admins who
>> are very cautious about their 'data' and are reluctant
>> therefor to migrate from ext3 to ext4, not to speak of
>> btrfs.
>
> Well, that's why I am concerned with merging the ext4 snapshots. This is
> exactly the reason why people will get nervous when you try to push a
> huge change like ext4 snapshots into the stable code base. Yes, when you
> do not compile it in, it does not affect the fs very much, but try to
> tell people that ext4 is not the old-good-stable-ext4 when you enable
> this feature. And I do not believe that snapshot code does not interfere
> with the old ext4 code paths, so there is a place for horrible bugs
> waiting for us.
>
>>
>> To this market I say, you can have snapshots of your
>> 'data' on ext4 without risking the proven stability of ext4.
>> The snapshots of the 'data' are not guarantied to be as
>> stable (being a new feature), but because the snapshots
>> are second to 'data' in ext4 snapshots, corrupted snapshots
>> will not risk the 'data'.
>>
>> During 1 year of next3 in production systems, we found bugs.
>> But none of the bugs corrupted 'data'. All of the bugs which
>> caused file system to contain errors, the errors were restricted
>> to snapshot files and in those worst cases, we could always
>> go to emergency plan B (plan A being fsck -p) and run fsck -x
>> which always solved the problem.
>
> It does not matter that much how long or how much your embedded
> production systems are out there. The fact is that it is really very
> limited work load variation, hence very limited testing.

for the record, the embedded systems are x86_64 dual core,
but yes, it's true that the load variation is limited.
I am not saying there are no bugs, I'm just saying the 'fail safe'
always worked.


>
>>
>> The customer was always consulted before resorting to 'plan B'
>> and was given the chance to copy out 'data' from the snapshots
>> (it was always possible) before we discard them.
>
> So it is true, when you have an fs problem (corruption) you have to
> blast off all your snapshots ?

No, most of the time the problem could be solved by fsck -p
without discarding snapshots.
Only for the really hard cases, we had to discard the snapshots.

>
>>
>> Needless to say, the said bugs were fixed and ext4 snapshots
>> will enjoy the stability of next3 and the 'fail safe' nature of the
>> solution, which was proven several times on the field.
>>
>>
>> >>
>> >> >
>> >> > There is some confusion among developers on what actually are benefits
>> >> > of ext4 snapshots in comparison to btrfs, or in comparison to the new
>> >> > dm_multisnap code. I know that you have done quite a lot of testing to
>> >> > assure that it does not actually change old ext4 behavior when snapshot
>> >> > disabled, and that it works well when enabled, but have you done any
>> >> > performance related benchmarks ? Do you have any expectations on how it
>> >> > should behave in different work loads ?
>> >> >
>> >> > It would be great to see and be able to confirm that ext4 snapshots are
>> >> > really a win, not only on the feature side, but on the performance side
>> >> > as well. I know that there are people out there still undecided or
>> >> > having a strange feeling about your snapshot work. But who can blame
>> >> > them, when we have not seen any hard data on this matter ?
>> >>
>> >> Ehm.. I did present this benchmark on LSF:
>> >> http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
>> >>
>> >> unless you snoozed ;-)
>> >> it shows performance vs. ext4 w/o snapshots and with snapshots
>> >> and while taking snapshots.
>> >
>> > I believe that you just missed the fact that not everyone has attended LSF
>> > and your lightning talk, but that's ok.
>>
>> That's not really OK. I should have posted the results
>> and analysis on my wiki (the results are there).
>>
>> >
>> > It seems to me that random writes are usually faster with you snapshot
>> > code regardless whether you use snapshots or not. Is that because of
>> > non snapshot related changes you've made ?
>>
>> Not that I know of.
>> I can explain why random write onesnap is faster than nosnap
>> and why 1snappermin is faster than onesnap, but I am not
>> sure about nosnap vs. plain ext4.
>>
>> >
>> > Also random reads seems to be slower with snapshots, is suspect that
>> > this is because of read through, so the reason for the slowdown that it
>> > was CPU bound ? I do not see any CPU utilization data.
>> >
>>
>> Only the 1snappermin is slower.
>> I suspect it has to do with the fs freezes, but I admin I have not
>> looked into it.
>>
>> > The postmark results seems quite odd, it is actually a lot faster with
>> > one snapshot and a lot slower with multiple snapshots, do you have an
>> > idea what is going on ?
>> >
>>
>> The name onesnap is misleading. It should have been
>> existingsnaps.
>> The important factor is whether or not snapshots are taken during the test.
>> In the 1snappermin case, postmark is the only test that exposes the
>> weak spot of ext4 snapshots performance - deletes/truncates.
>> create file+delete file with existing snapshots has no overhead (no COW).
>> create file+take snapshot+delete file has the overhead of moving the
>> deleted blocks to snapshot.
>> With regards to speed up of onesnap, postmark is randomizing the file
>> creates/write so it may be a similar effect to random write.
>> I did not investigate this.
>>
>> >> I did not compare with btrfs, but I bet there are ext4 vs. btrfs
>> >> benchmarks out there.
>> >> dm-multisnap is better than dm-snap only when it comes to overhead
>> >> per snapshot. it still copies every written block, which is far from
>> >> being the case in ext4 snapshots.
>> >
>> > Nevertheless, I still have not seen any comparison with other
>> > snapshotting possibilities we have. Note that ext4 to btrfs comparison
>> > is not enough, because we do not know what is the difference between
>> > the difference of ext4 with/without snapshots and btrfs with/without
>> > snapshots. The reason for this is that btrfs performance is very likely
>> > to scale up, but ext4 is pretty much done in that matter and I do not
>> > expect any huge performance leaps in the future.
>> >
>> > Also, rejecting dm-multisnap based on this statement is not enough, show
>> > us some numbers.
>>
>> Well, if you come to understand the difference between fs level an dm
>> level snapshots, you will see why i am rejecting dm-multisnap
>> (performance wise only!).
>
> But I do understand the difference. And also, when it comes to fs level
> snapshotting I would suspect that it would do something we can not do
> with the current solutions, for example per-file or per-directory snapshots,
> cat ext4 snapshots do that ?

Nope.

>
>>
>> Anyway #1: I have already answered this questions 2 years ago and I
>> think the answers are still valid both for LVM and btrfs:
>> http://sourceforge.net/apps/mediawiki/next3/index.php?title=FAQ#Why_use_Next3_snapshots_and_not_LVM_snapshots.3F
>
> But again, it was two years ago and even back then you have not had any
> numbers proving your statements.
>
>>
>> Anyway #2: I need to give you some numbers ;-)
>
> That would be great. Thanks!
>
>>
>> >
>> > I believe that it is not very convenient for you, because this feature
>> > support your business case and you do not necessarily want to find out
>> > that there might be a better way, especially after the work you have
>> > done already.
>>
>> Your analysis of my motives is correct :-)
>> The use of the term 'better way' I reject.
>> I think that ext4/btrfs/LVM snapshots are apples and oranges and hamburgers.
>
> But they are really not, because otherwise it would complement each
> other, but they are all trying to do the same thing, except btrfs has
> it for free.

apples and oranges don't complement each other.
they are (non-equal) alternatives.

>
>> The question of whether the world needs ext4 snapshots is
>> perfectly valid, but going back to the food analogy, I think it's
>> a case of "the proof of the pudding is in the eating".
>> I have no doubt that if ext4 snapshots are merged, many people will use it.
>
> Well, I would like to have your confidence. Why do you think so ? They
> will use it for what ? Doing backups ? We can do this easily with LVM
> without any risk of compromising existing filesystem at all. On desktop

LVM snapshots are not meant to be long lived snapshots.
As temporary snapshots they are fine, but with ext4 snapshots
you can easily retain monthly/weekly snapshots without the
need to allocate the space for it in advance and without the
'vanish' quality of LVM snapshots.

> ? I very much doubt that since you can not do per directory (or per
> file) snapshots, can you ?

No, I can't.

>
>> And I think that is a good enough (if not the best)
>> reason for inclusion.
>
> It would be of course, except you're the only one saying that.
>

I had several people approaching me that found the feature interesting
for their application. Some are developers I met on LSF, some are
users that found next3 interesting. One distro (OpenNode) has even
announced support for next3.

The incremental filesystem backup (ala ZFS send/recv) is a 'killer app'
in my opinion (and in the opinion of sys admins that use ZFS).
Ext4 snapshots enables that technology.

Amir.

2011-06-08 16:20:15

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Wed, Jun 8, 2011 at 11:59 AM, Amir G. <[email protected]> wrote:
> On Wed, Jun 8, 2011 at 6:38 PM, Lukas Czerner <[email protected]> wrote:
>> Amir said:

>>> The question of whether the world needs ext4 snapshots is
>>> perfectly valid, but going back to the food analogy, I think it's
>>> a case of "the proof of the pudding is in the eating".
>>> I have no doubt that if ext4 snapshots are merged, many people will use it.
>>
>> Well, I would like to have your confidence. Why do you think so ? They
>> will use it for what ? Doing backups ? We can do this easily with LVM
>> without any risk of compromising existing filesystem at all. On desktop
>
> LVM snapshots are not meant to be long lived snapshots.
> As temporary snapshots they are fine, but with ext4 snapshots
> you can easily retain monthly/weekly snapshots without the
> need to allocate the space for it in advance and without the
> 'vanish' quality of LVM snapshots.

In that old sf.net wiki you say:
Why use Next3 snapshots and not LVM snapshots?
* Performance: only small overhead to write performance with snapshots

Fair claim against current LVM snapshot (but not multisnap).

In this thread you're being very terse on the performance hit you
assert multisnap has that ext4 snapshots does not. Can you please be
more specific?

In your most recent post it seems you're focusing on "LVM snapshots"
and attributing the deficiencies of old-style LVM snapshots
(non-shared exception store causing N-way copy-out) to dm-multisnap?

Again, nobody will dispute that the existing dm-snapshot target has
poor performance that requires snapshots be short-lived. But
multisnap does _not_ suffer from those performance problems.

Mike

2011-06-09 01:59:50

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

> But I do understand the difference. And also, when it comes to fs level
> snapshotting I would suspect that it would do something we can not do
> with the current solutions, for example per-file or per-directory snapshots,
> cat ext4 snapshots do that ?
Hi Lukas,

I noticed that there is no answer to this question in the thread. I
can give the question the answer that ext4 can snapshot per-file or
per-directory, and can exclude some files or directories from being
snapshotted.

--
Best Wishes
Yongqiang Yang

2011-06-09 03:18:12

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
>> But I do understand the difference. And also, when it comes to fs level
>> snapshotting I would suspect that it would do something we can not do
>> with the current solutions, for example per-file or per-directory snapshots,
>> cat ext4 snapshots do that ?
> Hi Lukas,
>
> I noticed that there is no answer to this question in the thread. ?I

I think I answered this question with No it can't ;-)

> can give the question the answer that ext4 can snapshot per-file or
> per-directory, and can exclude some files or directories from being
> snapshotted.
>

So the full answer is that ext4 snapshot CAN exclude
certain files/dirs from snapshot, but this feature is not fully implemented yet
(I have it in a dev branch)

> --
> Best Wishes
> Yongqiang Yang
>

2011-06-09 03:51:08

by Yongqiang Yang

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
> On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
>>> But I do understand the difference. And also, when it comes to fs level
>>> snapshotting I would suspect that it would do something we can not do
>>> with the current solutions, for example per-file or per-directory snapshots,
>>> cat ext4 snapshots do that ?
>> Hi Lukas,
>>
>> I noticed that there is no answer to this question in the thread. ?I
>
> I think I answered this question with No it can't ;-)
I think this can be implemented easily by chattr and adding check in
should_snapshot() or should_move_data().

And I thought Lukas are focusing on if ext4-snapshots can do this
easily. So i said YES:-)

>
>> can give the question the answer that ext4 can snapshot per-file or
>> per-directory, and can exclude some files or directories from being
>> snapshotted.
>>
>
> So the full answer is that ext4 snapshot CAN exclude
> certain files/dirs from snapshot, but this feature is not fully implemented yet
> (I have it in a dev branch)
>
>> --
>> Best Wishes
>> Yongqiang Yang
>>
>



--
Best Wishes
Yongqiang Yang

2011-06-09 06:50:50

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, 9 Jun 2011, Yongqiang Yang wrote:

> On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
> > On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
> >>> But I do understand the difference. And also, when it comes to fs level
> >>> snapshotting I would suspect that it would do something we can not do
> >>> with the current solutions, for example per-file or per-directory snapshots,
> >>> cat ext4 snapshots do that ?
> >> Hi Lukas,
> >>
> >> I noticed that there is no answer to this question in the thread. ?I
> >
> > I think I answered this question with No it can't ;-)
> I think this can be implemented easily by chattr and adding check in
> should_snapshot() or should_move_data().
>
> And I thought Lukas are focusing on if ext4-snapshots can do this
> easily. So i said YES:-)

Cool, finally something interesting :). So, how it'll work ? Does that
require any format changes again:) ? Can you exclude the whole root and
then selectively pick the directories or files you are interested in ?

How does rollback work with ext4 snapshots ? Can you selectively roll
back one file, or the whole directory subtree even when you're
snapshotting more ?

You see, when it comes to the full fs snapshots I am not convinced that
it is *very* useful, yes it might have some users, but you can alway
take the safe way and do lvm snapshots (or better use the new multisnap)
for backup, without need to modify stable filesystem code.

Also, I do not buy the whole argument of "not have to create separate disk
space for snapshot". It is actually better for sysadmins, because you
have perfect control on what is going on, how much space is used for
your snapshots and how much is used by your data. You can always easily
extend the snapshot volume, or let it die silently when it is too old
and too big.

How does it actually work on ext4 snapshots ? When you're going to
rewrite a file, you will never know how much disk space it'll take in
advance, am I right ? Is the filesystem accounting for the snapshot size
as well ? or is it hidden ?

Thanks!
-Lukas

>
> >
> >> can give the question the answer that ext4 can snapshot per-file or
> >> per-directory, and can exclude some files or directories from being
> >> snapshotted.
> >>
> >
> > So the full answer is that ext4 snapshot CAN exclude
> > certain files/dirs from snapshot, but this feature is not fully implemented yet
> > (I have it in a dev branch)
> >
> >> --
> >> Best Wishes
> >> Yongqiang Yang
> >>
> >
>
>
>
>

2011-06-09 07:57:13

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
> On Thu, 9 Jun 2011, Yongqiang Yang wrote:
>
>> On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
>> > On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
>> >>> But I do understand the difference. And also, when it comes to fs level
>> >>> snapshotting I would suspect that it would do something we can not do
>> >>> with the current solutions, for example per-file or per-directory snapshots,
>> >>> cat ext4 snapshots do that ?
>> >> Hi Lukas,
>> >>
>> >> I noticed that there is no answer to this question in the thread. ?I
>> >
>> > I think I answered this question with No it can't ;-)
>> I think this can be implemented easily by chattr and adding check in
>> should_snapshot() or should_move_data().
>>
>> And I thought Lukas are focusing on if ext4-snapshots can do this
>> easily. ?So i said YES:-)
>
> Cool, finally something interesting :). So, how it'll work ? Does that
> require any format changes again:) ? Can you exclude the whole root and
> then selectively pick the directories or files you are interested in ?

The design is actually very simple and not as powerful as you
probably desire.
I hate to get into the design of future features, when we haven't
even ACKed the current feature yet, but since you're the only one
did any review, I owe you that much ;-)

To exclude a file from snapshot it needs to have the NOCOW_FL flag.
Ironically, btrfs have already added that flag in parallel to me (for the
same purpose) so the flag it is already reserved in the code :-)

To avoid some transition issues and keep it really simple,
I disallow changing the NOCOW_FL
for regular file and only allow to change it for directories.
The NOCOW_FL is inherited from the parent directory,
so setting/clearing the flag on a directory means:
"All files/subdirs will be created excluded/not-excluded from now on".

Inside the snapshot image, excluded directories, which are not really
excluded, show normally, but excluded files are shown with zero length,
because making the files disappear is hard, but their blocks may have already
been reused, so we cannot allow access to them.

>
> How does rollback work with ext4 snapshots ? Can you selectively roll
> back one file, or the whole directory subtree even when you're
> snapshotting more ?

So there is actually no inherent "rollback" feature, not for a file/dir
and not for the entire fs.
It's a drawback of ext4 snapshots, but hey, cp/rsync from snapshot
still works for file/dir ;-)
As for full "fs" rollback. A revert tool has been developed (by students),
which requires an external storage to export the "revert patch".
This tool is going to be enhanced to use LVM snapshot storage
and LVM --merge option to implement ext4 "revert to snapshot" with Yum.

>
> You see, when it comes to the full fs snapshots I am not convinced that
> it is *very* useful, yes it might have some users, but you can alway
> take the safe way and do lvm snapshots (or better use the new multisnap)
> for backup, without need to modify stable filesystem code.
>

You think like a developer. Try talking to some sys admins.
Especially ones that worked with Solaris/ZFS or NetApp.
See what they think about snapshots and about the LVM alternative...
Snapshots have addictive qualities. Ones you've used them, you can't
go back to not having them.
Imagine how people used to live before the 'Undo' button and imagine
that your employer forced you to use an editor without an Undo button.
This is the kind of feedback I got from sys admins that moved from Solaris
to Linux.


> Also, I do not buy the whole argument of "not have to create separate disk
> space for snapshot". It is actually better for sysadmins, because you
> have perfect control on what is going on, how much space is used for
> your snapshots and how much is used by your data. You can always easily
> extend the snapshot volume, or let it die silently when it is too old
> and too big.
>

Seriously, Lukas, talk to sys admins.
Letting the snapshot die silently is the worst possible thing that a snapshots
implementation can do (for long lived snapshots).


> How does it actually work on ext4 snapshots ? When you're going to
> rewrite a file, you will never know how much disk space it'll take in
> advance, am I right ? Is the filesystem accounting for the snapshot size
> as well ? or is it hidden ?

It's not hidden, it's accounted for as a regular file (usually owned by root).
You need a bit of scripting to gather the disk space used by snapshots (du).

In ANY snapshots implementation, you can get ENOSPC on operations,
which traditionally could not produce this error.
This statement is also true for thin provisioning implementations.
The question is how the implementation handles these situations.

What I came to realize on LSF, is that my implementation is the only
one (of LVM and btrfs) that tries to deal with the ENOSPC issue and
does a good job most of the time.

I deal with it by reserving space for metadata COW on snapshot
take, so if a future ENOSPC during metadata COW is possible,
snapshot take will fail with ENOSPC.

As for ENOSPC during regular file rewrite, that's not such a big problem.
The application simply gets ENOSPC as if the file was sparse to begin
with. It may not be pleasant if the application have fallocated the space
and used mmap/close without msync...
The only way I see around this issue is reserving space on mmap time
(and returning ENOSPC at that time), but again, this issue is shared
with btrfs, but is easier to fix (I think) with ext4 snapshots.




>
> Thanks!
> -Lukas
>
>>
>> >
>> >> can give the question the answer that ext4 can snapshot per-file or
>> >> per-directory, and can exclude some files or directories from being
>> >> snapshotted.
>> >>
>> >
>> > So the full answer is that ext4 snapshot CAN exclude
>> > certain files/dirs from snapshot, but this feature is not fully implemented yet
>> > (I have it in a dev branch)
>> >
>> >> --
>> >> Best Wishes
>> >> Yongqiang Yang
>> >>
>> >
>>
>>
>>
>>

2011-06-09 08:13:34

by David Lang

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, 9 Jun 2011, Amir G. wrote:

> On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
>> On Thu, 9 Jun 2011, Yongqiang Yang wrote:
>>
>>> On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
>>>> On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
>>
>> You see, when it comes to the full fs snapshots I am not convinced that
>> it is *very* useful, yes it might have some users, but you can alway
>> take the safe way and do lvm snapshots (or better use the new multisnap)
>> for backup, without need to modify stable filesystem code.
>>
>
> You think like a developer. Try talking to some sys admins.
> Especially ones that worked with Solaris/ZFS or NetApp.
> See what they think about snapshots and about the LVM alternative...
> Snapshots have addictive qualities. Ones you've used them, you can't
> go back to not having them.
> Imagine how people used to live before the 'Undo' button and imagine
> that your employer forced you to use an editor without an Undo button.
> This is the kind of feedback I got from sys admins that moved from Solaris
> to Linux.

as a sysadmin, it's a _wonderful_ tool to have for any system that has
people editing/saving files on directly.

>
>> Also, I do not buy the whole argument of "not have to create separate disk
>> space for snapshot". It is actually better for sysadmins, because you
>> have perfect control on what is going on, how much space is used for
>> your snapshots and how much is used by your data. You can always easily
>> extend the snapshot volume, or let it die silently when it is too old
>> and too big.
>>
>
> Seriously, Lukas, talk to sys admins.
> Letting the snapshot die silently is the worst possible thing that a snapshots
> implementation can do (for long lived snapshots).

that depends on the site policy.

sometimes it is better to loose snapshots than to run out of disk space
and halt the system, sometimes you would rather halt the system.

the policy of what happens when you run out of space should not be a
kernel decision, the desired behavior varies far too much.

this includes being able to say things like "I want to always have 10% of
my disk allocated to snapshots, but if there's more free space, go ahead
and use it, but always keep at least 10% of the disk free so that you
don't have to halt new writes while you clear space"

or

"if you run out of space, try and keep the oldest snapshot and the newest
snapshot, delete other snapshots in between before touching either of
these"

>> How does it actually work on ext4 snapshots ? When you're going to
>> rewrite a file, you will never know how much disk space it'll take in
>> advance, am I right ? Is the filesystem accounting for the snapshot size
>> as well ? or is it hidden ?
>
> It's not hidden, it's accounted for as a regular file (usually owned by root).
> You need a bit of scripting to gather the disk space used by snapshots (du).

the worst case when you re-write a file is that it will take the full
amount of space that the file currently takes (as if you wrote a new copy
of the file and some process had a filehandle open on the old copy,
preventing the space from being reclaimed, so it's far from being a new
problem)

see the note above about the need to be able to remove snapshots when you
are out of space.

since snapshots tend to be small compared to the filesystems they protect
(not in all cases, but if you are covering the entire system with one
snapshot that would be the way to bet), having the ability to put the
snapshot metadata off on a smaller/faster disk would be helpful.

having the ability to snapshot just specific files/directories would be a
killer feature IMHO

David Lang

2011-06-09 08:46:40

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, 9 Jun 2011, Amir G. wrote:

> On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
> > On Thu, 9 Jun 2011, Yongqiang Yang wrote:
> >
> >> On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
> >> > On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
> >> >>> But I do understand the difference. And also, when it comes to fs level
> >> >>> snapshotting I would suspect that it would do something we can not do
> >> >>> with the current solutions, for example per-file or per-directory snapshots,
> >> >>> cat ext4 snapshots do that ?
> >> >> Hi Lukas,
> >> >>
> >> >> I noticed that there is no answer to this question in the thread. ?I
> >> >
> >> > I think I answered this question with No it can't ;-)
> >> I think this can be implemented easily by chattr and adding check in
> >> should_snapshot() or should_move_data().
> >>
> >> And I thought Lukas are focusing on if ext4-snapshots can do this
> >> easily. ?So i said YES:-)
> >
> > Cool, finally something interesting :). So, how it'll work ? Does that
> > require any format changes again:) ? Can you exclude the whole root and
> > then selectively pick the directories or files you are interested in ?
>
> The design is actually very simple and not as powerful as you
> probably desire.
> I hate to get into the design of future features, when we haven't
> even ACKed the current feature yet, but since you're the only one
> did any review, I owe you that much ;-)

Thanks Amir!

You have to understand that I am still not convinced that ext4 snapshot
in its current state is really what we want to have in ext4. Especially
given the very basic features it provides, without any knowledge on how
it can be extended (but you're slowly providing that information, so
thanks for that). And especially facing the new dm-multisnap, I really
wonder if it is worth it.

If we want filesystem level snapshotting we can try to do it right with
all the benefits that snapshots on that level brings. But what I see
now, is not even remotely the case. And I have the feeling that all the
features that might be interesting for snapshotting at file system
level, are just a hack and not inherent from the design. But that is
probably because your goal was to snapshot the whole filesystem for the
backup purposes, but that's not what I would expect from fs level
snapshots. I really hope you understand my point.

>
> To exclude a file from snapshot it needs to have the NOCOW_FL flag.
> Ironically, btrfs have already added that flag in parallel to me (for the
> same purpose) so the flag it is already reserved in the code :-)
>
> To avoid some transition issues and keep it really simple,
> I disallow changing the NOCOW_FL
> for regular file and only allow to change it for directories.
> The NOCOW_FL is inherited from the parent directory,
> so setting/clearing the flag on a directory means:
> "All files/subdirs will be created excluded/not-excluded from now on".
>
> Inside the snapshot image, excluded directories, which are not really
> excluded, show normally, but excluded files are shown with zero length,
> because making the files disappear is hard, but their blocks may have already
> been reused, so we cannot allow access to them.
>
> >
> > How does rollback work with ext4 snapshots ? Can you selectively roll
> > back one file, or the whole directory subtree even when you're
> > snapshotting more ?
>
> So there is actually no inherent "rollback" feature, not for a file/dir
> and not for the entire fs.
> It's a drawback of ext4 snapshots, but hey, cp/rsync from snapshot
> still works for file/dir ;-)
> As for full "fs" rollback. A revert tool has been developed (by students),
> which requires an external storage to export the "revert patch".
> This tool is going to be enhanced to use LVM snapshot storage
> and LVM --merge option to implement ext4 "revert to snapshot" with Yum.

And that is the problem. Because at this level you should be able to do
it without very much trouble, because being at file system level you
should have all the information. Do not get me wrong, I am not saying
that this is easy, but is should be "from design". Exporting the
"revert patch" to the external storage, or exporting snapshot to LVM
format to be able to merge it...that is all just hacks, because the
design itself does not count with that possibility.

>
> >
> > You see, when it comes to the full fs snapshots I am not convinced that
> > it is *very* useful, yes it might have some users, but you can alway
> > take the safe way and do lvm snapshots (or better use the new multisnap)
> > for backup, without need to modify stable filesystem code.
> >
>
> You think like a developer. Try talking to some sys admins.
> Especially ones that worked with Solaris/ZFS or NetApp.
> See what they think about snapshots and about the LVM alternative...
> Snapshots have addictive qualities. Ones you've used them, you can't
> go back to not having them.
> Imagine how people used to live before the 'Undo' button and imagine
> that your employer forced you to use an editor without an Undo button.
> This is the kind of feedback I got from sys admins that moved from Solaris
> to Linux.

Exactly, so if we want fs level snapshots, it should use that
privilege no hack its way to do things like roll back, or
excludes+includes. Ext4 was not meant to work that way, nor was your
snapshots designed to work that way. If we are considering backups only,
because that is what you ext4 snaphosts can provide now, I would prefer
to use LVM. But yes, we all need to know how the new multisnap works
out.

>
>
> > Also, I do not buy the whole argument of "not have to create separate disk
> > space for snapshot". It is actually better for sysadmins, because you
> > have perfect control on what is going on, how much space is used for
> > your snapshots and how much is used by your data. You can always easily
> > extend the snapshot volume, or let it die silently when it is too old
> > and too big.
> >
>
> Seriously, Lukas, talk to sys admins.
> Letting the snapshot die silently is the worst possible thing that a snapshots
> implementation can do (for long lived snapshots).

Oh, no you misunderstood. Even with your snapshots you'll have to delete
old snapshots someday, because otherwise you'll run out of space. With
LVM however, you have prereserved space for it, so even if your snapshot
volume gets full, it does not affect your filesystem what so ever. And,
as a administrator, you can decide whether to extend the snapshot volume
to let it live longer, or just let it be and it will die eventually.

And, as far as I know, the new multisnap will notify the admin when the
snapshot volume approaches the watermark the same way that for example
thinly provisioned storage would do. But again, with your snapshots it
will give you ENOSPC when the snapshot grow too big, and at the end
of the day, you need to create data to be able to backup it:), so having
snapshots separate from your fs volume makes sense.

>
>
> > How does it actually work on ext4 snapshots ? When you're going to
> > rewrite a file, you will never know how much disk space it'll take in
> > advance, am I right ? Is the filesystem accounting for the snapshot size
> > as well ? or is it hidden ?
>
> It's not hidden, it's accounted for as a regular file (usually owned by root).
> You need a bit of scripting to gather the disk space used by snapshots (du).
>
> In ANY snapshots implementation, you can get ENOSPC on operations,
> which traditionally could not produce this error.
> This statement is also true for thin provisioning implementations.
> The question is how the implementation handles these situations.
>
> What I came to realize on LSF, is that my implementation is the only
> one (of LVM and btrfs) that tries to deal with the ENOSPC issue and
> does a good job most of the time.
>
> I deal with it by reserving space for metadata COW on snapshot
> take, so if a future ENOSPC during metadata COW is possible,
> snapshot take will fail with ENOSPC.
>
> As for ENOSPC during regular file rewrite, that's not such a big problem.
> The application simply gets ENOSPC as if the file was sparse to begin
> with. It may not be pleasant if the application have fallocated the space
> and used mmap/close without msync...
> The only way I see around this issue is reserving space on mmap time
> (and returning ENOSPC at that time), but again, this issue is shared
> with btrfs, but is easier to fix (I think) with ext4 snapshots.

Yes, I do understand that ext4 snaphosts are doing well in that aspect,
but as I said, having snapshots separate from your file system gives
you advantage of not running into ENOSPC on your file system until you
really fill it with data.

Granted, I have to take a look at the multisnap code, to see what it can
do and compare it with ext4 snapshots, because really, if it is good
enough and you will be able to do snapshotting backups as you do with
your approach, I do not see the reason why to complicate our life in
ext4.

Thanks!
-Lukas

2011-06-09 10:06:39

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, Jun 9, 2011 at 11:13 AM, <[email protected]> wrote:
> On Thu, 9 Jun 2011, Amir G. wrote:
>
>> On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
>>>
>>> On Thu, 9 Jun 2011, Yongqiang Yang wrote:
>>>
>>>> On Thu, Jun 9, 2011 at 11:18 AM, Amir G.
>>>> <[email protected]> wrote:
>>>>>
>>>>> On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]>
>>>>> wrote:
>>>
>>> You see, when it comes to the full fs snapshots I am not convinced that
>>> it is *very* useful, yes it might have some users, but you can alway
>>> take the safe way and do lvm snapshots (or better use the new multisnap)
>>> for backup, without need to modify stable filesystem code.
>>>
>>
>> You think like a developer. Try talking to some sys admins.
>> Especially ones that worked with Solaris/ZFS or NetApp.
>> See what they think about snapshots and about the LVM alternative...
>> Snapshots have addictive qualities. Ones you've used them, you can't
>> go back to not having them.
>> Imagine how people used to live before the 'Undo' button and imagine
>> that your employer forced you to use an editor without an Undo button.
>> This is the kind of feedback I got from sys admins that moved from Solaris
>> to Linux.
>
> as a sysadmin, it's a _wonderful_ tool to have for any system that has
> people editing/saving files on directly.

Thank you david. Finally some positive feedback from the people
for whom the feature is intended for :-)

>
>>
>>> Also, I do not buy the whole argument of "not have to create separate
>>> disk
>>> space for snapshot". It is actually better for sysadmins, because you
>>> have perfect control on what is going on, how much space is used for
>>> your snapshots and how much is used by your data. You can always easily
>>> extend the snapshot volume, or let it die silently when it is too old
>>> and too big.
>>>
>>
>> Seriously, Lukas, talk to sys admins.
>> Letting the snapshot die silently is the worst possible thing that a
>> snapshots
>> implementation can do (for long lived snapshots).
>
> that depends on the site policy.
>
> sometimes it is better to loose snapshots than to run out of disk space and
> halt the system, sometimes you would rather halt the system.
>
> the policy of what happens when you run out of space should not be a kernel
> decision, the desired behavior varies far too much.
>
> this includes being able to say things like "I want to always have 10% of my
> disk allocated to snapshots, but if there's more free space, go ahead and
> use it, but always keep at least 10% of the disk free so that you don't have
> to halt new writes while you clear space"
>
> or
>
> "if you run out of space, try and keep the oldest snapshot and the newest
> snapshot, delete other snapshots in between before touching either of these"
>

I fully agree.
AFAIK, there is no user space tool to manage snapshots to this level for Linux.
The only snapshot manager I know about is snapper:
http://en.opensuse.org/Portal:Snapper, which we are working on adding
ext4 snapshots support to.
Snapper does not have the free space based policy to the best of my knowledge,
but it could be improved to monitor free disk space.

A tool like that does not need any further kernel changes from
ext4 and btrfs to implement the policies suggested above.
However, with LVM snapshots, some of these policies (like use whatever space you
have free in the filesystem) are simply not possible.


>>> How does it actually work on ext4 snapshots ? When you're going to
>>> rewrite a file, you will never know how much disk space it'll take in
>>> advance, am I right ? Is the filesystem accounting for the snapshot size
>>> as well ? or is it hidden ?
>>
>> It's not hidden, it's accounted for as a regular file (usually owned by
>> root).
>> You need a bit of scripting to gather the disk space used by snapshots
>> (du).
>
> the worst case when you re-write a file is that it will take the full amount
> of space that the file currently takes (as if you wrote a new copy of the
> file and some process had a filehandle open on the old copy, preventing the
> space from being reclaimed, so it's far from being a new problem)

No. it's a new problem.
When you have a large db, which does random writes to an exiting db file,
it does not expect ENOSPC, when updating an existing record or index.
Only by keeping enough free disk space in the system at all times, can you
avoid this kind of problems.

>
> see the note above about the need to be able to remove snapshots when you
> are out of space.
>
> since snapshots tend to be small compared to the filesystems they protect
> (not in all cases, but if you are covering the entire system with one
> snapshot that would be the way to bet), having the ability to put the
> snapshot metadata off on a smaller/faster disk would be helpful.

Helpful for which workload?
For reading from snapshots? Yes, that would be faster.
For writing to the filesystem? I demonstrated that the performance
overhead is near zero.

>
> having the ability to snapshot just specific files/directories would be a
> killer feature IMHO

I agree to that, but I don't think the ext4 will be able to provide
that to the full extent.

>
> David Lang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-06-09 10:17:41

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, 9 Jun 2011, Amir G. wrote:

> On Thu, Jun 9, 2011 at 11:13 AM, <[email protected]> wrote:
> > On Thu, 9 Jun 2011, Amir G. wrote:
> >
> >> On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
> >>>
> >>> On Thu, 9 Jun 2011, Yongqiang Yang wrote:
> >>>
> >>>> On Thu, Jun 9, 2011 at 11:18 AM, Amir G.
> >>>> <[email protected]> wrote:
> >>>>>
> >>>>> On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]>
> >>>>> wrote:
> >>>
> >>> You see, when it comes to the full fs snapshots I am not convinced that
> >>> it is *very* useful, yes it might have some users, but you can alway
> >>> take the safe way and do lvm snapshots (or better use the new multisnap)
> >>> for backup, without need to modify stable filesystem code.
> >>>
> >>
> >> You think like a developer. Try talking to some sys admins.
> >> Especially ones that worked with Solaris/ZFS or NetApp.
> >> See what they think about snapshots and about the LVM alternative...
> >> Snapshots have addictive qualities. Ones you've used them, you can't
> >> go back to not having them.
> >> Imagine how people used to live before the 'Undo' button and imagine
> >> that your employer forced you to use an editor without an Undo button.
> >> This is the kind of feedback I got from sys admins that moved from Solaris
> >> to Linux.
> >
> > as a sysadmin, it's a _wonderful_ tool to have for any system that has
> > people editing/saving files on directly.
>
> Thank you david. Finally some positive feedback from the people
> for whom the feature is intended for :-)

No one is arguing about the advantages of snapshots. I think that we are
all clear on this. Snapshots are useful.

>
> >
> >>
> >>> Also, I do not buy the whole argument of "not have to create separate
> >>> disk
> >>> space for snapshot". It is actually better for sysadmins, because you
> >>> have perfect control on what is going on, how much space is used for
> >>> your snapshots and how much is used by your data. You can always easily
> >>> extend the snapshot volume, or let it die silently when it is too old
> >>> and too big.
> >>>
> >>
> >> Seriously, Lukas, talk to sys admins.
> >> Letting the snapshot die silently is the worst possible thing that a
> >> snapshots
> >> implementation can do (for long lived snapshots).
> >
> > that depends on the site policy.
> >
> > sometimes it is better to loose snapshots than to run out of disk space and
> > halt the system, sometimes you would rather halt the system.
> >
> > the policy of what happens when you run out of space should not be a kernel
> > decision, the desired behavior varies far too much.
> >
> > this includes being able to say things like "I want to always have 10% of my
> > disk allocated to snapshots, but if there's more free space, go ahead and
> > use it, but always keep at least 10% of the disk free so that you don't have
> > to halt new writes while you clear space"
> >
> > or
> >
> > "if you run out of space, try and keep the oldest snapshot and the newest
> > snapshot, delete other snapshots in between before touching either of these"
> >
>
> I fully agree.
> AFAIK, there is no user space tool to manage snapshots to this level for Linux.
> The only snapshot manager I know about is snapper:
> http://en.opensuse.org/Portal:Snapper, which we are working on adding
> ext4 snapshots support to.
> Snapper does not have the free space based policy to the best of my knowledge,
> but it could be improved to monitor free disk space.
>
> A tool like that does not need any further kernel changes from
> ext4 and btrfs to implement the policies suggested above.
> However, with LVM snapshots, some of these policies (like use whatever space you
> have free in the filesystem) are simply not possible.

And why is that ? With LVM you can shrink or extent volumes at will, I
do not think this is a problem at all, moreover, you can always add more
drives to resize your existing volumes to.

>
>
> >>> How does it actually work on ext4 snapshots ? When you're going to
> >>> rewrite a file, you will never know how much disk space it'll take in
> >>> advance, am I right ? Is the filesystem accounting for the snapshot size
> >>> as well ? or is it hidden ?
> >>
> >> It's not hidden, it's accounted for as a regular file (usually owned by
> >> root).
> >> You need a bit of scripting to gather the disk space used by snapshots
> >> (du).
> >
> > the worst case when you re-write a file is that it will take the full amount
> > of space that the file currently takes (as if you wrote a new copy of the
> > file and some process had a filehandle open on the old copy, preventing the
> > space from being reclaimed, so it's far from being a new problem)
>
> No. it's a new problem.
> When you have a large db, which does random writes to an exiting db file,
> it does not expect ENOSPC, when updating an existing record or index.
> Only by keeping enough free disk space in the system at all times, can you
> avoid this kind of problems.

You can very well avoid this kind of problems when you separate
filesystem and snapshots, that is what LVM can do easily.

>
> >
> > see the note above about the need to be able to remove snapshots when you
> > are out of space.
> >
> > since snapshots tend to be small compared to the filesystems they protect
> > (not in all cases, but if you are covering the entire system with one
> > snapshot that would be the way to bet), having the ability to put the
> > snapshot metadata off on a smaller/faster disk would be helpful.

Very easy to do with dm-multisnap.

>
> Helpful for which workload?
> For reading from snapshots? Yes, that would be faster.
> For writing to the filesystem? I demonstrated that the performance
> overhead is near zero.
>
> >
> > having the ability to snapshot just specific files/directories would be a
> > killer feature IMHO
>
> I agree to that, but I don't think the ext4 will be able to provide
> that to the full extent.

And that is for being fs level snapshots a huge drawback.

>
> >
> > David Lang
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> >
>

2011-06-09 10:54:16

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, Jun 9, 2011 at 11:46 AM, Lukas Czerner <[email protected]> wrote:
> On Thu, 9 Jun 2011, Amir G. wrote:
>
>> On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
>> > On Thu, 9 Jun 2011, Yongqiang Yang wrote:
>> >
>> >> On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
>> >> > On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
>> >> >>> But I do understand the difference. And also, when it comes to fs level
>> >> >>> snapshotting I would suspect that it would do something we can not do
>> >> >>> with the current solutions, for example per-file or per-directory snapshots,
>> >> >>> cat ext4 snapshots do that ?
>> >> >> Hi Lukas,
>> >> >>
>> >> >> I noticed that there is no answer to this question in the thread. ?I
>> >> >
>> >> > I think I answered this question with No it can't ;-)
>> >> I think this can be implemented easily by chattr and adding check in
>> >> should_snapshot() or should_move_data().
>> >>
>> >> And I thought Lukas are focusing on if ext4-snapshots can do this
>> >> easily. ?So i said YES:-)
>> >
>> > Cool, finally something interesting :). So, how it'll work ? Does that
>> > require any format changes again:) ? Can you exclude the whole root and
>> > then selectively pick the directories or files you are interested in ?
>>
>> The design is actually very simple and not as powerful as you
>> probably desire.
>> I hate to get into the design of future features, when we haven't
>> even ACKed the current feature yet, but since you're the only one
>> did any review, I owe you that much ;-)
>
> Thanks Amir!
>
> You have to understand that I am still not convinced that ext4 snapshot
> in its current state is really what we want to have in ext4. Especially
> given the very basic features it provides, without any knowledge on how
> it can be extended (but you're slowly providing that information, so
> thanks for that). And especially facing the new dm-multisnap, I really
> wonder if it is worth it.

Did you not see my post on LVM vs. Ext4 snapshots?
https://lkml.org/lkml/2011/6/8/296
dm-multisnap is much better than dm-snap, but it's not perfect.
And ext4 snapshots aren't perfect either, but they do bring some
new interesting options for sys admins.

>
> If we want filesystem level snapshotting we can try to do it right with
> all the benefits that snapshots on that level brings. But what I see
> now, is not even remotely the case. And I have the feeling that all the
> features that might be interesting for snapshotting at file system
> level, are just a hack and not inherent from the design. But that is
> probably because your goal was to snapshot the whole filesystem for the
> backup purposes, but that's not what I would expect from fs level
> snapshots. I really hope you understand my point.
>

I think I understand the point. The reason that ext4 snapshots are
less powerful then, say, btrfs snapshots, is not because of my design,
it is because I was building on top a 20 year old on-disk format (ext2), which
was extended 2 times already, but remained mostly backwards compatible.
There is only so much you can do without block reference counts and this
is all that I was trying to do.


>>
>> To exclude a file from snapshot it needs to have the NOCOW_FL flag.
>> Ironically, btrfs have already added that flag in parallel to me (for the
>> same purpose) so the flag it is already reserved in the code :-)
>>
>> To avoid some transition issues and keep it really simple,
>> I disallow changing the NOCOW_FL
>> for regular file and only allow to change it for directories.
>> The NOCOW_FL is inherited from the parent directory,
>> so setting/clearing the flag on a directory means:
>> "All files/subdirs will be created excluded/not-excluded from now on".
>>
>> Inside the snapshot image, excluded directories, which are not really
>> excluded, show normally, but excluded files are shown with zero length,
>> because making the files disappear is hard, but their blocks may have already
>> been reused, so we cannot allow access to them.
>>
>> >
>> > How does rollback work with ext4 snapshots ? Can you selectively roll
>> > back one file, or the whole directory subtree even when you're
>> > snapshotting more ?
>>
>> So there is actually no inherent "rollback" feature, not for a file/dir
>> and not for the entire fs.
>> It's a drawback of ext4 snapshots, but hey, cp/rsync from snapshot
>> still works for file/dir ;-)
>> As for full "fs" rollback. A revert tool has been developed (by students),
>> which requires an external storage to export the "revert patch".
>> This tool is going to be enhanced to use LVM snapshot storage
>> and LVM --merge option to implement ext4 "revert to snapshot" with Yum.
>
> And that is the problem. Because at this level you should be able to do
> it without very much trouble, because being at file system level you
> should have all the information. Do not get me wrong, I am not saying
> that this is easy, but is should be "from design". Exporting the
> "revert patch" to the external storage, or exporting snapshot to LVM
> format to be able to merge it...that is all just hacks, because the
> design itself does not count with that possibility.
>

The design makes a conscious choice to keep snapshots *inside*
the filesystem.
This is both an advantage (no need to change on-disk format and checking tools)
and disadvantage (you cannot mount a snapshot without mounting the fs first).


>>
>> >
>> > You see, when it comes to the full fs snapshots I am not convinced that
>> > it is *very* useful, yes it might have some users, but you can alway
>> > take the safe way and do lvm snapshots (or better use the new multisnap)
>> > for backup, without need to modify stable filesystem code.
>> >
>>
>> You think like a developer. Try talking to some sys admins.
>> Especially ones that worked with Solaris/ZFS or NetApp.
>> See what they think about snapshots and about the LVM alternative...
>> Snapshots have addictive qualities. Ones you've used them, you can't
>> go back to not having them.
>> Imagine how people used to live before the 'Undo' button and imagine
>> that your employer forced you to use an editor without an Undo button.
>> This is the kind of feedback I got from sys admins that moved from Solaris
>> to Linux.
>
> Exactly, so if we want fs level snapshots, it should use that
> privilege no hack its way to do things like roll back, or
> excludes+includes. Ext4 was not meant to work that way, nor was your
> snapshots designed to work that way. If we are considering backups only,
> because that is what you ext4 snaphosts can provide now, I would prefer
> to use LVM. But yes, we all need to know how the new multisnap works
> out.
>

Why do you keep saying 'backup only'?
There is a huge difference between having long lived snapshots,
like CTERA products have, and temporary snapshot for backup
purpose (for which LVM is adequate).

>>
>>
>> > Also, I do not buy the whole argument of "not have to create separate disk
>> > space for snapshot". It is actually better for sysadmins, because you
>> > have perfect control on what is going on, how much space is used for
>> > your snapshots and how much is used by your data. You can always easily
>> > extend the snapshot volume, or let it die silently when it is too old
>> > and too big.
>> >
>>
>> Seriously, Lukas, talk to sys admins.
>> Letting the snapshot die silently is the worst possible thing that a snapshots
>> implementation can do (for long lived snapshots).
>
> Oh, no you misunderstood. Even with your snapshots you'll have to delete
> old snapshots someday, because otherwise you'll run out of space. With
> LVM however, you have prereserved space for it, so even if your snapshot
> volume gets full, it does not affect your filesystem what so ever. And,
> as a administrator, you can decide whether to extend the snapshot volume
> to let it live longer, or just let it be and it will die eventually.
>
> And, as far as I know, the new multisnap will notify the admin when the
> snapshot volume approaches the watermark the same way that for example
> thinly provisioned storage would do. But again, with your snapshots it
> will give you ENOSPC when the snapshot grow too big, and at the end
> of the day, you need to create data to be able to backup it:), so having
> snapshots separate from your fs volume makes sense.
>

Yes, one day you will run out of space and will be getting a warning
before that, if you are using a CTERA product.
You won't be getting the warning from the kernel snapshots code, but from
disk space monitoring daemon.
And when you get the warning (or ENOSPC if you ignored the warnings)
you will have 2 options:
1. add disks and resize the fs
2. delete some snapshots

When using a CTERA product, you not have to pre-partition your disk
space between fs and snapshots - they are thinly provisioned, which
is a big advantage for a product which does not require being an IT expert to
operate it.


>>
>>
>> > How does it actually work on ext4 snapshots ? When you're going to
>> > rewrite a file, you will never know how much disk space it'll take in
>> > advance, am I right ? Is the filesystem accounting for the snapshot size
>> > as well ? or is it hidden ?
>>
>> It's not hidden, it's accounted for as a regular file (usually owned by root).
>> You need a bit of scripting to gather the disk space used by snapshots (du).
>>
>> In ANY snapshots implementation, you can get ENOSPC on operations,
>> which traditionally could not produce this error.
>> This statement is also true for thin provisioning implementations.
>> The question is how the implementation handles these situations.
>>
>> What I came to realize on LSF, is that my implementation is the only
>> one (of LVM and btrfs) that tries to deal with the ENOSPC issue and
>> does a good job most of the time.
>>
>> I deal with it by reserving space for metadata COW on snapshot
>> take, so if a future ENOSPC during metadata COW is possible,
>> snapshot take will fail with ENOSPC.
>>
>> As for ENOSPC during regular file rewrite, that's not such a big problem.
>> The application simply gets ENOSPC as if the file was sparse to begin
>> with. It may not be pleasant if the application have fallocated the space
>> and used mmap/close without msync...
>> The only way I see around this issue is reserving space on mmap time
>> (and returning ENOSPC at that time), but again, this issue is shared
>> with btrfs, but is easier to fix (I think) with ext4 snapshots.
>
> Yes, I do understand that ext4 snaphosts are doing well in that aspect,
> but as I said, having snapshots separate from your file system gives
> you advantage of not running into ENOSPC on your file system until you
> really fill it with data.

It should be, as David wrote, a choice to the sys admin.
Because ext4 snapshots are thinly provisioned, you can always say
"use 10% for snapshots and 90% for data" (like you would with LVM),
But you cannot say "reserve 10% for snapshots 50% for data and the
rest to either" when you administer LVM snapshots.

You are confusing user functionality with functionality provided by the kernel.
LVM happens to check water marks in the kernel because of it's design.
That doesn't mean that the same thing cannot be accomplished for ext4
snapshots by user tools.


>
> Granted, I have to take a look at the multisnap code, to see what it can
> do and compare it with ext4 snapshots, because really, if it is good
> enough and you will be able to do snapshotting backups as you do with
> your approach, I do not see the reason why to complicate our life in
> ext4.
>

I don't know how you intend to determine if dm-multisnap is 'good enough'.
I don't claim to have the capability myself to determine if ext4 snapshots
are 'good enough'.
I just try to present the technical differences between the 3 solutions
(LVM,ext4,btrfs) and claim that each have their advantages and disadvantages
over others.
I wish more sys admins and end users would provide feedback, though I don't
know how many of them are following LKML.

Amir.

2011-06-09 13:00:01

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, 9 Jun 2011, Amir G. wrote:

> On Thu, Jun 9, 2011 at 11:46 AM, Lukas Czerner <[email protected]> wrote:
> > On Thu, 9 Jun 2011, Amir G. wrote:
> >
> >> On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
> >> > On Thu, 9 Jun 2011, Yongqiang Yang wrote:
> >> >
> >> >> On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
> >> >> > On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
> >> >> >>> But I do understand the difference. And also, when it comes to fs level
> >> >> >>> snapshotting I would suspect that it would do something we can not do
> >> >> >>> with the current solutions, for example per-file or per-directory snapshots,
> >> >> >>> cat ext4 snapshots do that ?
> >> >> >> Hi Lukas,
> >> >> >>
> >> >> >> I noticed that there is no answer to this question in the thread. ?I
> >> >> >
> >> >> > I think I answered this question with No it can't ;-)
> >> >> I think this can be implemented easily by chattr and adding check in
> >> >> should_snapshot() or should_move_data().
> >> >>
> >> >> And I thought Lukas are focusing on if ext4-snapshots can do this
> >> >> easily. ?So i said YES:-)
> >> >
> >> > Cool, finally something interesting :). So, how it'll work ? Does that
> >> > require any format changes again:) ? Can you exclude the whole root and
> >> > then selectively pick the directories or files you are interested in ?
> >>
> >> The design is actually very simple and not as powerful as you
> >> probably desire.
> >> I hate to get into the design of future features, when we haven't
> >> even ACKed the current feature yet, but since you're the only one
> >> did any review, I owe you that much ;-)
> >
> > Thanks Amir!
> >
> > You have to understand that I am still not convinced that ext4 snapshot
> > in its current state is really what we want to have in ext4. Especially
> > given the very basic features it provides, without any knowledge on how
> > it can be extended (but you're slowly providing that information, so
> > thanks for that). And especially facing the new dm-multisnap, I really
> > wonder if it is worth it.
>
> Did you not see my post on LVM vs. Ext4 snapshots?
> https://lkml.org/lkml/2011/6/8/296
> dm-multisnap is much better than dm-snap, but it's not perfect.
> And ext4 snapshots aren't perfect either, but they do bring some
> new interesting options for sys admins.
>
> >
> > If we want filesystem level snapshotting we can try to do it right with
> > all the benefits that snapshots on that level brings. But what I see
> > now, is not even remotely the case. And I have the feeling that all the
> > features that might be interesting for snapshotting at file system
> > level, are just a hack and not inherent from the design. But that is
> > probably because your goal was to snapshot the whole filesystem for the
> > backup purposes, but that's not what I would expect from fs level
> > snapshots. I really hope you understand my point.
> >
>
> I think I understand the point. The reason that ext4 snapshots are
> less powerful then, say, btrfs snapshots, is not because of my design,
> it is because I was building on top a 20 year old on-disk format (ext2), which
> was extended 2 times already, but remained mostly backwards compatible.
> There is only so much you can do without block reference counts and this
> is all that I was trying to do.

And I can imagine it works well enough. But given that we have better,
more generic solution, which does not require hacking stable filesystem
I am becoming more and more against ext4 snapshots to be merged. And if
anyone wishes to have some fancy fs level snapshoting features (which
ext4 snapshots can no provide from the resons you have pointed out), you
can always turn to btrfs, which has been designed that way unlike ext4.

>
>
> >>
> >> To exclude a file from snapshot it needs to have the NOCOW_FL flag.
> >> Ironically, btrfs have already added that flag in parallel to me (for the
> >> same purpose) so the flag it is already reserved in the code :-)
> >>
> >> To avoid some transition issues and keep it really simple,
> >> I disallow changing the NOCOW_FL
> >> for regular file and only allow to change it for directories.
> >> The NOCOW_FL is inherited from the parent directory,
> >> so setting/clearing the flag on a directory means:
> >> "All files/subdirs will be created excluded/not-excluded from now on".
> >>
> >> Inside the snapshot image, excluded directories, which are not really
> >> excluded, show normally, but excluded files are shown with zero length,
> >> because making the files disappear is hard, but their blocks may have already
> >> been reused, so we cannot allow access to them.
> >>
> >> >
> >> > How does rollback work with ext4 snapshots ? Can you selectively roll
> >> > back one file, or the whole directory subtree even when you're
> >> > snapshotting more ?
> >>
> >> So there is actually no inherent "rollback" feature, not for a file/dir
> >> and not for the entire fs.
> >> It's a drawback of ext4 snapshots, but hey, cp/rsync from snapshot
> >> still works for file/dir ;-)
> >> As for full "fs" rollback. A revert tool has been developed (by students),
> >> which requires an external storage to export the "revert patch".
> >> This tool is going to be enhanced to use LVM snapshot storage
> >> and LVM --merge option to implement ext4 "revert to snapshot" with Yum.
> >
> > And that is the problem. Because at this level you should be able to do
> > it without very much trouble, because being at file system level you
> > should have all the information. Do not get me wrong, I am not saying
> > that this is easy, but is should be "from design". Exporting the
> > "revert patch" to the external storage, or exporting snapshot to LVM
> > format to be able to merge it...that is all just hacks, because the
> > design itself does not count with that possibility.
> >
>
> The design makes a conscious choice to keep snapshots *inside*
> the filesystem.
> This is both an advantage (no need to change on-disk format and checking tools)
> and disadvantage (you cannot mount a snapshot without mounting the fs first).

And thats where ext4 snapshots loose. With dm you do not need to change
on-disk format, tools or filesystem itself, and you can mount the snapshot
without also mounting the origin.

>
>
> >>
> >> >
> >> > You see, when it comes to the full fs snapshots I am not convinced that
> >> > it is *very* useful, yes it might have some users, but you can alway
> >> > take the safe way and do lvm snapshots (or better use the new multisnap)
> >> > for backup, without need to modify stable filesystem code.
> >> >
> >>
> >> You think like a developer. Try talking to some sys admins.
> >> Especially ones that worked with Solaris/ZFS or NetApp.
> >> See what they think about snapshots and about the LVM alternative...
> >> Snapshots have addictive qualities. Ones you've used them, you can't
> >> go back to not having them.
> >> Imagine how people used to live before the 'Undo' button and imagine
> >> that your employer forced you to use an editor without an Undo button.
> >> This is the kind of feedback I got from sys admins that moved from Solaris
> >> to Linux.
> >
> > Exactly, so if we want fs level snapshots, it should use that
> > privilege no hack its way to do things like roll back, or
> > excludes+includes. Ext4 was not meant to work that way, nor was your
> > snapshots designed to work that way. If we are considering backups only,
> > because that is what you ext4 snaphosts can provide now, I would prefer
> > to use LVM. But yes, we all need to know how the new multisnap works
> > out.
> >
>
> Why do you keep saying 'backup only'?
> There is a huge difference between having long lived snapshots,
> like CTERA products have, and temporary snapshot for backup
> purpose (for which LVM is adequate).

dm's multisnapshots are designed to be long lived and can be used as
such.

>
> >>
> >>
> >> > Also, I do not buy the whole argument of "not have to create separate disk
> >> > space for snapshot". It is actually better for sysadmins, because you
> >> > have perfect control on what is going on, how much space is used for
> >> > your snapshots and how much is used by your data. You can always easily
> >> > extend the snapshot volume, or let it die silently when it is too old
> >> > and too big.
> >> >
> >>
> >> Seriously, Lukas, talk to sys admins.
> >> Letting the snapshot die silently is the worst possible thing that a snapshots
> >> implementation can do (for long lived snapshots).
> >
> > Oh, no you misunderstood. Even with your snapshots you'll have to delete
> > old snapshots someday, because otherwise you'll run out of space. With
> > LVM however, you have prereserved space for it, so even if your snapshot
> > volume gets full, it does not affect your filesystem what so ever. And,
> > as a administrator, you can decide whether to extend the snapshot volume
> > to let it live longer, or just let it be and it will die eventually.
> >
> > And, as far as I know, the new multisnap will notify the admin when the
> > snapshot volume approaches the watermark the same way that for example
> > thinly provisioned storage would do. But again, with your snapshots it
> > will give you ENOSPC when the snapshot grow too big, and at the end
> > of the day, you need to create data to be able to backup it:), so having
> > snapshots separate from your fs volume makes sense.
> >
>
> Yes, one day you will run out of space and will be getting a warning
> before that, if you are using a CTERA product.
> You won't be getting the warning from the kernel snapshots code, but from
> disk space monitoring daemon.
> And when you get the warning (or ENOSPC if you ignored the warnings)
> you will have 2 options:
> 1. add disks and resize the fs
> 2. delete some snapshots
>
> When using a CTERA product, you not have to pre-partition your disk
> space between fs and snapshots - they are thinly provisioned, which
> is a big advantage for a product which does not require being an IT expert to
> operate it.

dm multisnapshot code is using thin provisioning, you just have to pick
the volume and that's it.

>
>
> >>
> >>
> >> > How does it actually work on ext4 snapshots ? When you're going to
> >> > rewrite a file, you will never know how much disk space it'll take in
> >> > advance, am I right ? Is the filesystem accounting for the snapshot size
> >> > as well ? or is it hidden ?
> >>
> >> It's not hidden, it's accounted for as a regular file (usually owned by root).
> >> You need a bit of scripting to gather the disk space used by snapshots (du).
> >>
> >> In ANY snapshots implementation, you can get ENOSPC on operations,
> >> which traditionally could not produce this error.
> >> This statement is also true for thin provisioning implementations.
> >> The question is how the implementation handles these situations.
> >>
> >> What I came to realize on LSF, is that my implementation is the only
> >> one (of LVM and btrfs) that tries to deal with the ENOSPC issue and
> >> does a good job most of the time.
> >>
> >> I deal with it by reserving space for metadata COW on snapshot
> >> take, so if a future ENOSPC during metadata COW is possible,
> >> snapshot take will fail with ENOSPC.
> >>
> >> As for ENOSPC during regular file rewrite, that's not such a big problem.
> >> The application simply gets ENOSPC as if the file was sparse to begin
> >> with. It may not be pleasant if the application have fallocated the space
> >> and used mmap/close without msync...
> >> The only way I see around this issue is reserving space on mmap time
> >> (and returning ENOSPC at that time), but again, this issue is shared
> >> with btrfs, but is easier to fix (I think) with ext4 snapshots.
> >
> > Yes, I do understand that ext4 snaphosts are doing well in that aspect,
> > but as I said, having snapshots separate from your file system gives
> > you advantage of not running into ENOSPC on your file system until you
> > really fill it with data.
>
> It should be, as David wrote, a choice to the sys admin.
> Because ext4 snapshots are thinly provisioned, you can always say
> "use 10% for snapshots and 90% for data" (like you would with LVM),
> But you cannot say "reserve 10% for snapshots 50% for data and the
> rest to either" when you administer LVM snapshots.

I am not sure how can this be managed with multisnap target, but I do
not see a reason why it can not be done, given that both data and
snapshots can be allocated from within the same pool.

>
> You are confusing user functionality with functionality provided by the kernel.
> LVM happens to check water marks in the kernel because of it's design.
> That doesn't mean that the same thing cannot be accomplished for ext4
> snapshots by user tools.

That was not my point, I was simply saying that it is not ext4 snapshots
advantage.

>
>
> >
> > Granted, I have to take a look at the multisnap code, to see what it can
> > do and compare it with ext4 snapshots, because really, if it is good
> > enough and you will be able to do snapshotting backups as you do with
> > your approach, I do not see the reason why to complicate our life in
> > ext4.
> >
>
> I don't know how you intend to determine if dm-multisnap is 'good enough'.
> I don't claim to have the capability myself to determine if ext4 snapshots
> are 'good enough'.
> I just try to present the technical differences between the 3 solutions
> (LVM,ext4,btrfs) and claim that each have their advantages and disadvantages
> over others.
> I wish more sys admins and end users would provide feedback, though I don't
> know how many of them are following LKML.

I do. When it can do long lived snapshots without any obvious headaches
it is good enough. Your only contra argument was that lvm snapshotting
is slow, which is not that big argument now when we have multisnap
almost ready. I am not even talking about features, because clearly
mutlisnap has superset of the features that ext4 does - no I am not
counting per-file or per-directory snapshotting because clearly those
are just hacks and it was not designed that way.

-Lukas

2011-06-10 07:06:59

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, Jun 9, 2011 at 3:59 PM, Lukas Czerner <[email protected]> wrote:
> On Thu, 9 Jun 2011, Amir G. wrote:
>
>> On Thu, Jun 9, 2011 at 11:46 AM, Lukas Czerner <[email protected]> wrote:
>> > On Thu, 9 Jun 2011, Amir G. wrote:
>> >
>> >> On Thu, Jun 9, 2011 at 9:50 AM, Lukas Czerner <[email protected]> wrote:
>> >> > On Thu, 9 Jun 2011, Yongqiang Yang wrote:
>> >> >
>> >> >> On Thu, Jun 9, 2011 at 11:18 AM, Amir G. <[email protected]> wrote:
>> >> >> > On Thu, Jun 9, 2011 at 4:59 AM, Yongqiang Yang <[email protected]> wrote:
>> >> >> >>> But I do understand the difference. And also, when it comes to fs level
>> >> >> >>> snapshotting I would suspect that it would do something we can not do
>> >> >> >>> with the current solutions, for example per-file or per-directory snapshots,
>> >> >> >>> cat ext4 snapshots do that ?
>> >> >> >> Hi Lukas,
>> >> >> >>
>> >> >> >> I noticed that there is no answer to this question in the thread. ?I
>> >> >> >
>> >> >> > I think I answered this question with No it can't ;-)
>> >> >> I think this can be implemented easily by chattr and adding check in
>> >> >> should_snapshot() or should_move_data().
>> >> >>
>> >> >> And I thought Lukas are focusing on if ext4-snapshots can do this
>> >> >> easily. ?So i said YES:-)
>> >> >
>> >> > Cool, finally something interesting :). So, how it'll work ? Does that
>> >> > require any format changes again:) ? Can you exclude the whole root and
>> >> > then selectively pick the directories or files you are interested in ?
>> >>
>> >> The design is actually very simple and not as powerful as you
>> >> probably desire.
>> >> I hate to get into the design of future features, when we haven't
>> >> even ACKed the current feature yet, but since you're the only one
>> >> did any review, I owe you that much ;-)
>> >
>> > Thanks Amir!
>> >
>> > You have to understand that I am still not convinced that ext4 snapshot
>> > in its current state is really what we want to have in ext4. Especially
>> > given the very basic features it provides, without any knowledge on how
>> > it can be extended (but you're slowly providing that information, so
>> > thanks for that). And especially facing the new dm-multisnap, I really
>> > wonder if it is worth it.
>>
>> Did you not see my post on LVM vs. Ext4 snapshots?
>> https://lkml.org/lkml/2011/6/8/296
>> dm-multisnap is much better than dm-snap, but it's not perfect.
>> And ext4 snapshots aren't perfect either, but they do bring some
>> new interesting options for sys admins.
>>
>> >
>> > If we want filesystem level snapshotting we can try to do it right with
>> > all the benefits that snapshots on that level brings. But what I see
>> > now, is not even remotely the case. And I have the feeling that all the
>> > features that might be interesting for snapshotting at file system
>> > level, are just a hack and not inherent from the design. But that is
>> > probably because your goal was to snapshot the whole filesystem for the
>> > backup purposes, but that's not what I would expect from fs level
>> > snapshots. I really hope you understand my point.
>> >
>>
>> I think I understand the point. The reason that ext4 snapshots are
>> less powerful then, say, btrfs snapshots, is not because of my design,
>> it is because I was building on top a 20 year old on-disk format (ext2), which
>> was extended 2 times already, but remained mostly backwards compatible.
>> There is only so much you can do without block reference counts and this
>> is all that I was trying to do.
>
> And I can imagine it works well enough. But given that we have better,
> more generic solution, which does not require hacking stable filesystem
> I am becoming more and more against ext4 snapshots to be merged. And if
> anyone wishes to have some fancy fs level snapshoting features (which
> ext4 snapshots can no provide from the resons you have pointed out), you
> can always turn to btrfs, which has been designed that way unlike ext4.
>
>>
>>
>> >>
>> >> To exclude a file from snapshot it needs to have the NOCOW_FL flag.
>> >> Ironically, btrfs have already added that flag in parallel to me (for the
>> >> same purpose) so the flag it is already reserved in the code :-)
>> >>
>> >> To avoid some transition issues and keep it really simple,
>> >> I disallow changing the NOCOW_FL
>> >> for regular file and only allow to change it for directories.
>> >> The NOCOW_FL is inherited from the parent directory,
>> >> so setting/clearing the flag on a directory means:
>> >> "All files/subdirs will be created excluded/not-excluded from now on".
>> >>
>> >> Inside the snapshot image, excluded directories, which are not really
>> >> excluded, show normally, but excluded files are shown with zero length,
>> >> because making the files disappear is hard, but their blocks may have already
>> >> been reused, so we cannot allow access to them.
>> >>
>> >> >
>> >> > How does rollback work with ext4 snapshots ? Can you selectively roll
>> >> > back one file, or the whole directory subtree even when you're
>> >> > snapshotting more ?
>> >>
>> >> So there is actually no inherent "rollback" feature, not for a file/dir
>> >> and not for the entire fs.
>> >> It's a drawback of ext4 snapshots, but hey, cp/rsync from snapshot
>> >> still works for file/dir ;-)
>> >> As for full "fs" rollback. A revert tool has been developed (by students),
>> >> which requires an external storage to export the "revert patch".
>> >> This tool is going to be enhanced to use LVM snapshot storage
>> >> and LVM --merge option to implement ext4 "revert to snapshot" with Yum.
>> >
>> > And that is the problem. Because at this level you should be able to do
>> > it without very much trouble, because being at file system level you
>> > should have all the information. Do not get me wrong, I am not saying
>> > that this is easy, but is should be "from design". Exporting the
>> > "revert patch" to the external storage, or exporting snapshot to LVM
>> > format to be able to merge it...that is all just hacks, because the
>> > design itself does not count with that possibility.
>> >
>>
>> The design makes a conscious choice to keep snapshots *inside*
>> the filesystem.
>> This is both an advantage (no need to change on-disk format and checking tools)
>> and disadvantage (you cannot mount a snapshot without mounting the fs first).
>
> And thats where ext4 snapshots loose. With dm you do not need to change
> on-disk format, tools or filesystem itself, and you can mount the snapshot
> without also mounting the origin.
>
>>
>>
>> >>
>> >> >
>> >> > You see, when it comes to the full fs snapshots I am not convinced that
>> >> > it is *very* useful, yes it might have some users, but you can alway
>> >> > take the safe way and do lvm snapshots (or better use the new multisnap)
>> >> > for backup, without need to modify stable filesystem code.
>> >> >
>> >>
>> >> You think like a developer. Try talking to some sys admins.
>> >> Especially ones that worked with Solaris/ZFS or NetApp.
>> >> See what they think about snapshots and about the LVM alternative...
>> >> Snapshots have addictive qualities. Ones you've used them, you can't
>> >> go back to not having them.
>> >> Imagine how people used to live before the 'Undo' button and imagine
>> >> that your employer forced you to use an editor without an Undo button.
>> >> This is the kind of feedback I got from sys admins that moved from Solaris
>> >> to Linux.
>> >
>> > Exactly, so if we want fs level snapshots, it should use that
>> > privilege no hack its way to do things like roll back, or
>> > excludes+includes. Ext4 was not meant to work that way, nor was your
>> > snapshots designed to work that way. If we are considering backups only,
>> > because that is what you ext4 snaphosts can provide now, I would prefer
>> > to use LVM. But yes, we all need to know how the new multisnap works
>> > out.
>> >
>>
>> Why do you keep saying 'backup only'?
>> There is a huge difference between having long lived snapshots,
>> like CTERA products have, and temporary snapshot for backup
>> purpose (for which LVM is adequate).
>
> dm's multisnapshots are designed to be long lived and can be used as
> such.
>
>>
>> >>
>> >>
>> >> > Also, I do not buy the whole argument of "not have to create separate disk
>> >> > space for snapshot". It is actually better for sysadmins, because you
>> >> > have perfect control on what is going on, how much space is used for
>> >> > your snapshots and how much is used by your data. You can always easily
>> >> > extend the snapshot volume, or let it die silently when it is too old
>> >> > and too big.
>> >> >
>> >>
>> >> Seriously, Lukas, talk to sys admins.
>> >> Letting the snapshot die silently is the worst possible thing that a snapshots
>> >> implementation can do (for long lived snapshots).
>> >
>> > Oh, no you misunderstood. Even with your snapshots you'll have to delete
>> > old snapshots someday, because otherwise you'll run out of space. With
>> > LVM however, you have prereserved space for it, so even if your snapshot
>> > volume gets full, it does not affect your filesystem what so ever. And,
>> > as a administrator, you can decide whether to extend the snapshot volume
>> > to let it live longer, or just let it be and it will die eventually.
>> >
>> > And, as far as I know, the new multisnap will notify the admin when the
>> > snapshot volume approaches the watermark the same way that for example
>> > thinly provisioned storage would do. But again, with your snapshots it
>> > will give you ENOSPC when the snapshot grow too big, and at the end
>> > of the day, you need to create data to be able to backup it:), so having
>> > snapshots separate from your fs volume makes sense.
>> >
>>
>> Yes, one day you will run out of space and will be getting a warning
>> before that, if you are using a CTERA product.
>> You won't be getting the warning from the kernel snapshots code, but from
>> disk space monitoring daemon.
>> And when you get the warning (or ENOSPC if you ignored the warnings)
>> you will have 2 options:
>> 1. add disks and resize the fs
>> 2. delete some snapshots
>>
>> When using a CTERA product, you not have to pre-partition your disk
>> space between fs and snapshots - they are thinly provisioned, which
>> is a big advantage for a product which does not require being an IT expert to
>> operate it.
>
> dm multisnapshot code is using thin provisioning, you just have to pick
> the volume and that's it.
>
>>
>>
>> >>
>> >>
>> >> > How does it actually work on ext4 snapshots ? When you're going to
>> >> > rewrite a file, you will never know how much disk space it'll take in
>> >> > advance, am I right ? Is the filesystem accounting for the snapshot size
>> >> > as well ? or is it hidden ?
>> >>
>> >> It's not hidden, it's accounted for as a regular file (usually owned by root).
>> >> You need a bit of scripting to gather the disk space used by snapshots (du).
>> >>
>> >> In ANY snapshots implementation, you can get ENOSPC on operations,
>> >> which traditionally could not produce this error.
>> >> This statement is also true for thin provisioning implementations.
>> >> The question is how the implementation handles these situations.
>> >>
>> >> What I came to realize on LSF, is that my implementation is the only
>> >> one (of LVM and btrfs) that tries to deal with the ENOSPC issue and
>> >> does a good job most of the time.
>> >>
>> >> I deal with it by reserving space for metadata COW on snapshot
>> >> take, so if a future ENOSPC during metadata COW is possible,
>> >> snapshot take will fail with ENOSPC.
>> >>
>> >> As for ENOSPC during regular file rewrite, that's not such a big problem.
>> >> The application simply gets ENOSPC as if the file was sparse to begin
>> >> with. It may not be pleasant if the application have fallocated the space
>> >> and used mmap/close without msync...
>> >> The only way I see around this issue is reserving space on mmap time
>> >> (and returning ENOSPC at that time), but again, this issue is shared
>> >> with btrfs, but is easier to fix (I think) with ext4 snapshots.
>> >
>> > Yes, I do understand that ext4 snaphosts are doing well in that aspect,
>> > but as I said, having snapshots separate from your file system gives
>> > you advantage of not running into ENOSPC on your file system until you
>> > really fill it with data.
>>
>> It should be, as David wrote, a choice to the sys admin.
>> Because ext4 snapshots are thinly provisioned, you can always say
>> "use 10% for snapshots and 90% for data" (like you would with LVM),
>> But you cannot say "reserve 10% for snapshots 50% for data and the
>> rest to either" when you administer LVM snapshots.
>
> I am not sure how can this be managed with multisnap target, but I do
> not see a reason why it can not be done, given that both data and
> snapshots can be allocated from within the same pool.
>
>>
>> You are confusing user functionality with functionality provided by the kernel.
>> LVM happens to check water marks in the kernel because of it's design.
>> That doesn't mean that the same thing cannot be accomplished for ext4
>> snapshots by user tools.
>
> That was not my point, I was simply saying that it is not ext4 snapshots
> advantage.
>
>>
>>
>> >
>> > Granted, I have to take a look at the multisnap code, to see what it can
>> > do and compare it with ext4 snapshots, because really, if it is good
>> > enough and you will be able to do snapshotting backups as you do with
>> > your approach, I do not see the reason why to complicate our life in
>> > ext4.
>> >
>>
>> I don't know how you intend to determine if dm-multisnap is 'good enough'.
>> I don't claim to have the capability myself to determine if ext4 snapshots
>> are 'good enough'.
>> I just try to present the technical differences between the 3 solutions
>> (LVM,ext4,btrfs) and claim that each have their advantages and disadvantages
>> over others.
>> I wish more sys admins and end users would provide feedback, though I don't
>> know how many of them are following LKML.
>
> I do. When it can do long lived snapshots without any obvious headaches
> it is good enough. Your only contra argument was that lvm snapshotting
> is slow, which is not that big argument now when we have multisnap
> almost ready. I am not even talking about features, because clearly
> mutlisnap has superset of the features that ext4 does - no I am not
> counting per-file or per-directory snapshotting because clearly those
> are just hacks and it was not designed that way.
>

Hi Lukas,

I am very glad to have you as my reviewer and critic :-)
I am saying that with all honesty, because I know that you are impartial
and have no anti-ext4 agenda.

LVM multisnap does look like a big leap forward, but you should not
be blinded by the promised feature, before you inspect the implementation,
the same as you are doing to ext4 snapshots now...

I could suggest that you put your root fs on a QCOW2 file exported as NBD.
That would give you both thin provisioning and snapshots, but you know
perfectly well, that this is not a 'good enough' solution.
I'm not saying that LVM is comparable to QCOW2 virtual volume.
I'm just saying we (included myself) should carefully examine the alternatives
before make a ruling against one of them.

Amir.

2011-06-10 09:35:29

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Fri, 10 Jun 2011, Amir G. wrote:

--snip--
> >
> >>
> >>
> >> >
> >> > Granted, I have to take a look at the multisnap code, to see what it can
> >> > do and compare it with ext4 snapshots, because really, if it is good
> >> > enough and you will be able to do snapshotting backups as you do with
> >> > your approach, I do not see the reason why to complicate our life in
> >> > ext4.
> >> >
> >>
> >> I don't know how you intend to determine if dm-multisnap is 'good enough'.
> >> I don't claim to have the capability myself to determine if ext4 snapshots
> >> are 'good enough'.
> >> I just try to present the technical differences between the 3 solutions
> >> (LVM,ext4,btrfs) and claim that each have their advantages and disadvantages
> >> over others.
> >> I wish more sys admins and end users would provide feedback, though I don't
> >> know how many of them are following LKML.
> >
> > I do. When it can do long lived snapshots without any obvious headaches
> > it is good enough. Your only contra argument was that lvm snapshotting
> > is slow, which is not that big argument now when we have multisnap
> > almost ready. I am not even talking about features, because clearly
> > mutlisnap has superset of the features that ext4 does - no I am not
> > counting per-file or per-directory snapshotting because clearly those
> > are just hacks and it was not designed that way.
> >
>
> Hi Lukas,
>
> I am very glad to have you as my reviewer and critic :-)
> I am saying that with all honesty, because I know that you are impartial
> and have no anti-ext4 agenda.
>
> LVM multisnap does look like a big leap forward, but you should not
> be blinded by the promised feature, before you inspect the implementation,
> the same as you are doing to ext4 snapshots now...
>
> I could suggest that you put your root fs on a QCOW2 file exported as NBD.
> That would give you both thin provisioning and snapshots, but you know
> perfectly well, that this is not a 'good enough' solution.
> I'm not saying that LVM is comparable to QCOW2 virtual volume.
> I'm just saying we (included myself) should carefully examine the alternatives
> before make a ruling against one of them.
>
> Amir.
>

Hi Amir,

that is why I spoke with several dm people and all of them had the same
opinion. When you are not using the advantage of being at fs level,
there is no reason to have shapshoting at this level.

And no, I am not blinded. I am trying to understand why is multisnap a
huge win everyone is saying, so I already asked ejt to step in and
give us an overview on how dm-multisnap works and why is it better
than the old implementation. Also I am trying it myslef, and so far
it works quite well. I might have some numbers later.

Thanks!
-Lukas

2011-06-10 12:02:44

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Fri, Jun 10, 2011 at 12:00 PM, Lukas Czerner <[email protected]> wrote:
> On Fri, 10 Jun 2011, Amir G. wrote:
>
> --snip--
>> >
>> >>
>> >>
>> >> >
>> >> > Granted, I have to take a look at the multisnap code, to see what it can
>> >> > do and compare it with ext4 snapshots, because really, if it is good
>> >> > enough and you will be able to do snapshotting backups as you do with
>> >> > your approach, I do not see the reason why to complicate our life in
>> >> > ext4.
>> >> >
>> >>
>> >> I don't know how you intend to determine if dm-multisnap is 'good enough'.
>> >> I don't claim to have the capability myself to determine if ext4 snapshots
>> >> are 'good enough'.
>> >> I just try to present the technical differences between the 3 solutions
>> >> (LVM,ext4,btrfs) and claim that each have their advantages and disadvantages
>> >> over others.
>> >> I wish more sys admins and end users would provide feedback, though I don't
>> >> know how many of them are following LKML.
>> >
>> > I do. When it can do long lived snapshots without any obvious headaches
>> > it is good enough. Your only contra argument was that lvm snapshotting
>> > is slow, which is not that big argument now when we have multisnap
>> > almost ready. I am not even talking about features, because clearly
>> > mutlisnap has superset of the features that ext4 does - no I am not
>> > counting per-file or per-directory snapshotting because clearly those
>> > are just hacks and it was not designed that way.
>> >
>>
>> Hi Lukas,
>>
>> I am very glad to have you as my reviewer and critic :-)
>> I am saying that with all honesty, because I know that you are impartial
>> and have no anti-ext4 agenda.
>>
>> LVM multisnap does look like a big leap forward, but you should not
>> be blinded by the promised feature, before you inspect the implementation,
>> the same as you are doing to ext4 snapshots now...
>>
>> I could suggest that you put your root fs on a QCOW2 file exported as NBD.
>> That would give you both thin provisioning and snapshots, but you know
>> perfectly well, that this is not a 'good enough' solution.
>> I'm not saying that LVM is comparable to QCOW2 virtual volume.
>> I'm just saying we (included myself) should carefully examine the alternatives
>> before make a ruling against one of them.
>>
>> Amir.
>>
>
> Hi Amir,
>
> that is why I spoke with several dm people and all of them had the same
> opinion. When you are not using the advantage of being at fs level,
> there is no reason to have shapshoting at this level.
>
> And no, I am not blinded. I am trying to understand why is multisnap a
> huge win everyone is saying, so I already asked ejt to step in and
> give us an overview on how dm-multisnap works and why is it better
> than the old implementation. Also I am trying it myslef, and so far
> it works quite well. I might have some numbers later.
>
> Thanks!
> -Lukas
>

Wow, if you can provide numbers that would be great!
If you can also run the same tests on the same machine with my
ext4dev module that would be awesome!
the module on next3.sf.net is for kernel 2.6.38, but I can send you
a module for kernel 2.6.39 or 3.0-rc1 if you like.

Thanks!
Amir.

2011-06-10 22:52:18

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Thu, 09 Jun 2011 13:54:13 +0300, "Amir G." said:

> Why do you keep saying 'backup only'?
> There is a huge difference between having long lived snapshots,
> like CTERA products have, and temporary snapshot for backup
> purpose (for which LVM is adequate).

I must have blinked somewhere - I'm not convinced LVM is even "adequate" for
backup purposes. In particular, how does an LVM-level snapshot deal with the
"metadata in memory" problem (basically the exact same problem as running fsck
on a disk partition that is already mounted)?


Attachments:
(No filename) (227.00 B)

2011-06-11 01:09:29

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Sat, Jun 11, 2011 at 1:51 AM, <[email protected]> wrote:
> On Thu, 09 Jun 2011 13:54:13 +0300, "Amir G." said:
>
>> Why do you keep saying 'backup only'?
>> There is a huge difference between having long lived snapshots,
>> like CTERA products have, and temporary snapshot for backup
>> purpose (for which LVM is adequate).
>
> I must have blinked somewhere - I'm not convinced LVM is even "adequate" for
> backup purposes. ?In particular, how does an LVM-level snapshot deal with the
> "metadata in memory" problem (basically the exact same problem as running fsck
> on a disk partition that is already mounted)?
>
>

It uses the filesystem freeze API.
Same as ext4 snapshots.

Amir.

2011-06-13 09:56:56

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Fri, Jun 10, 2011 at 12:00 PM, Lukas Czerner <[email protected]> wrote:
>
> --snip--
>
> Hi Amir,
>
> that is why I spoke with several dm people and all of them had the same
> opinion. When you are not using the advantage of being at fs level,
> there is no reason to have shapshoting at this level.
>
> And no, I am not blinded. I am trying to understand why is multisnap a
> huge win everyone is saying, so I already asked ejt to step in and
> give us an overview on how dm-multisnap works and why is it better
> than the old implementation. Also I am trying it myslef, and so far
> it works quite well. I might have some numbers later.
>

(Dropping LKML - had enough of that attention for 1 week...)

Hi Lukas,

So did you get any numbers? Joe said you were not able to get good results.

Did you come to understand the drawbacks of multisnap (physical fragmentation)?

Did it make you change your mind about ext4 snapshots?

I am planning to join the ext4 weekly call today and ask if people think that
we still have open issues with ext4 snapshots, which must be resolved
before the merge.

I have 2 questions that should be answered before the merge:
1. Should 32bit ext4 move to 48bit snapshot file format after the
format is implemented for 64bit ext4?
2. Should exclude bitmap be allocated only on mkfs time or should it
also be possible to allocate it with tune2fs?
Allocating it later will enable snapshots on existing fs, but will
have sub-optimal on-disk layout.

If anyone has opinions on these 2 questions, please make them heard here or on
the call today.

Thanks,
Amir.

2011-06-13 10:54:55

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Mon, 13 Jun 2011, Amir G. wrote:

> On Fri, Jun 10, 2011 at 12:00 PM, Lukas Czerner <[email protected]> wrote:
> >
> > --snip--
> >
> > Hi Amir,
> >
> > that is why I spoke with several dm people and all of them had the same
> > opinion. When you are not using the advantage of being at fs level,
> > there is no reason to have shapshoting at this level.
> >
> > And no, I am not blinded. I am trying to understand why is multisnap a
> > huge win everyone is saying, so I already asked ejt to step in and
> > give us an overview on how dm-multisnap works and why is it better
> > than the old implementation. Also I am trying it myslef, and so far
> > it works quite well. I might have some numbers later.
> >
>
> (Dropping LKML - had enough of that attention for 1 week...)
>
> Hi Lukas,
>
> So did you get any numbers? Joe said you were not able to get good results.

Hi, yes I did had some bad numbers, but it was due to stupid setup I
have created :) metadata and data volume on the same drive, but in the
different partition. In the postmark test the performance drop was about
100% and that is quite expected as it probably caused a LOT of seeks.

But when I separated data and metadata I have very good results. Results
differs with the data block size used by dm.

Filesystem on bare device.
113 76 657.89 661.60 2000.00 325.80 328.61 329.28 661.60 1980.88 332.09
24363052.00 76242168.00

dm-multisnap
bs=128
146 118 423.73 512.06 1923.08 209.84 211.64 212.08 512.06 1904.69 213.89
18856334.00 59009348.00

bs=256
151 96 520.83 495.11 943.40 257.93 260.15 260.68 495.11 934.38 262.91
18231952.00 57055396.00

bs=512
134 96 520.83 557.92 1515.15 257.93 260.15 260.68 557.92 1500.67 262.91
20544960.00 64293764.00

bs=1024
119 70 714.29 628.24 1470.59 353.73 356.77 357.50 628.24 1456.53 360.56
23134662.00 72398024.00

bs=2048
128 76 657.89 584.07 1190.48 325.80 328.61 329.28 584.07 1179.10 332.09
21508006.00 67307536.00

bs=4096
131 84 595.24 570.69 1851.85 294.77 297.31 297.92 570.69 1834.15 300.46
21015456.00 65766144.00

Legend:
-----------------------------------------------------------------------
Total_duration Duration_of_transactions Transactions/s
Files_created/s Creation_alone/s Creation_mixed_with_transaction/s
Read/s Append/s Deleted/s Deletion_alone/s
Deletion_mixed_with_transaction/s Read_B/s Write_B/s
-----------------------------------------------------------------------

I choosed postmark because it is doing a lot of operation on the file
and it is quite metadata intensive. Although it is still very simple
and limited test. However you can see that with data block size 1024B I
received almost the same results as in the case of bare device. It means
that there was almost none performance drop and I suspect that if I put
metadata on the SSD it would not be noticeable at all.

We can see that results are dropping to the bs of 1024B and rising
afterwards. I suspect that we are dealing with two variables with
opposite outcome. Thinp target works better with bigger block sizes as
it has less metadata to work with, but in the other hand snapshots are
then more expensive, because we have to deal with COW rather than simple
write when we are changing the whole block. But 1024 seems quite
reasonable and I also think that putting metadata on SSD (which is
easily doable) we can very well address the first one.

>
> Did you come to understand the drawbacks of multisnap (physical fragmentation)?

Yes I did, but the fragmentation is problem for any thinly provisioned
storage. I also understand that your snapshot files has also proble with
fragmentation.

>
> Did it make you change your mind about ext4 snapshots?

>From the first time I was interested in ext4 snapshgots, however as I
came to understand how it works (I must admit not *very* deeply) it all
seems like a hack to solve your problem at the time (several years
ago).

And now, when I see how the new dm-multisnap target works, what features it has,
how it performs (more-or-less) it seems to me that it is a lot more
flexible and desirable way of doing this.

On the other hand your snapshots disrupts a quite calm water of
stable filesystem with a very poor set of features and very limited
possibilities of improvements. Not talking about maintaining burden. But
yes, it might perform a bit better.

So to sum it up I see that dm-multisnap has superset of features your
ext4 snapshots has, in performs well enough, it is more generic solution
for all filesystems, it is also more flexibile, it does not require
intrusive change into stable fs code, and it has better possibilities of
future improvements.

Do even if the final decision does not belong to me, I think that we do
not need this code in ext4. If your snapshots were a *real* filesystem level
snapshots with all the cool features it provides, the situation would be
quite different, however even then I would be thinking if it is worth
it, when we have btrfs here and now, ready to use, and improving every
day to get at enterprise level (it will, hopefully, be a default
filesystem in Fedora 16, which is huge step forward to enterprise
environment).

And here I would very much like to see other ext4 developers opinions,
because they were really quiet on this matter and it is time to reveal
the cards on the table, so ?...


>
> I am planning to join the ext4 weekly call today and ask if people think that
> we still have open issues with ext4 snapshots, which must be resolved
> before the merge.
>
> I have 2 questions that should be answered before the merge:
> 1. Should 32bit ext4 move to 48bit snapshot file format after the
> format is implemented for 64bit ext4?
> 2. Should exclude bitmap be allocated only on mkfs time or should it
> also be possible to allocate it with tune2fs?
> Allocating it later will enable snapshots on existing fs, but will
> have sub-optimal on-disk layout.
>
> If anyone has opinions on these 2 questions, please make them heard here or on
> the call today.
>
> Thanks,
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--

2011-06-13 12:56:10

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Mon, Jun 13, 2011 at 1:54 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 13 Jun 2011, Amir G. wrote:
>
>> On Fri, Jun 10, 2011 at 12:00 PM, Lukas Czerner <[email protected]> wrote:
>> >
>> > --snip--
>> >
>> > Hi Amir,
>> >
>> > that is why I spoke with several dm people and all of them had the same
>> > opinion. When you are not using the advantage of being at fs level,
>> > there is no reason to have shapshoting at this level.
>> >
>> > And no, I am not blinded. I am trying to understand why is multisnap a
>> > huge win everyone is saying, so I already asked ejt to step in and
>> > give us an overview on how dm-multisnap works and why is it better
>> > than the old implementation. Also I am trying it myslef, and so far
>> > it works quite well. I might have some numbers later.
>> >
>>
>> (Dropping LKML - had enough of that attention for 1 week...)
>>
>> Hi Lukas,
>>
>> So did you get any numbers? Joe said you were not able to get good results.
>
> Hi, yes I did had some bad numbers, but it was due to stupid setup I
> have created :) metadata and data volume on the same drive, but in the
> different partition. In the postmark test the performance drop was about
> 100% and that is quite expected as it probably caused a LOT of seeks.
>
> But when I separated data and metadata I have very good results. Results
> differs with the data block size used by dm.
>
> Filesystem on bare device.
> 113 76 657.89 661.60 2000.00 325.80 328.61 329.28 661.60 1980.88 332.09
> 24363052.00 76242168.00
>
> dm-multisnap
> bs=128
> 146 118 423.73 512.06 1923.08 209.84 211.64 212.08 512.06 1904.69 213.89
> 18856334.00 59009348.00
>
> bs=256
> 151 96 520.83 495.11 943.40 257.93 260.15 260.68 495.11 934.38 262.91
> 18231952.00 57055396.00
>
> bs=512
> 134 96 520.83 557.92 1515.15 257.93 260.15 260.68 557.92 1500.67 262.91
> 20544960.00 64293764.00
>
> bs=1024
> 119 70 714.29 628.24 1470.59 353.73 356.77 357.50 628.24 1456.53 360.56
> 23134662.00 72398024.00
>
> bs=2048
> 128 76 657.89 584.07 1190.48 325.80 328.61 329.28 584.07 1179.10 332.09
> 21508006.00 67307536.00
>
> bs=4096
> 131 84 595.24 570.69 1851.85 294.77 297.31 297.92 570.69 1834.15 300.46
> 21015456.00 65766144.00
>
> Legend:
> -----------------------------------------------------------------------
> Total_duration Duration_of_transactions Transactions/s
> Files_created/s Creation_alone/s Creation_mixed_with_transaction/s
> Read/s Append/s Deleted/s Deletion_alone/s
> Deletion_mixed_with_transaction/s Read_B/s Write_B/s
> -----------------------------------------------------------------------
>
> I choosed postmark because it is doing a lot of operation on the file
> and it is quite metadata intensive. Although it is still very simple
> and limited test. However you can see that with data block size 1024B I
> received almost the same results as in the case of bare device. It means
> that there was almost none performance drop and I suspect that if I put
> metadata on the SSD it would not be noticeable at all.
>
> We can see that results are dropping to the bs of 1024B and rising
> afterwards. I suspect that we are dealing with two variables with
> opposite outcome. Thinp target works better with bigger block sizes as
> it has less metadata to work with, but in the other hand snapshots are
> then more expensive, because we have to deal with COW rather than simple
> write when we are changing the whole block. But 1024 seems quite
> reasonable and I also think that putting metadata on SSD (which is
> easily doable) we can very well address the first one.
>

SSD may be doable for enterprise servers, but I don't have one in my laptop :-(

I think that Joe will agree with me that this is not the benchmark he
is concerned about.
It is clear to me that any operations applied to a new thin
provisioned file system, will
perform well, sometimes even better than on bare device.

Did you use subdirs in postmark?
If you do, ext3 will try to spread subdirs under root all over the disk
(not sure about ext4) and postmark will be slower on bare device.

The benchmark which is relevant to the drawbacks of multisnap is aging
a filesystem
to the point that it's metadata is physically layed out very
differently than was intended.

Here is a suggested real life test:

1. DATE=START_DATE; take snapshot $DATE
2. git checkout <mainline daily git tag>
3. time -o LOGFILE --append make
4. DATE+=DAY; goto 1

Repeat this test until the volume fills up to several orders of
magnitude more than the size
of RAM on your system and observe how build time changes over time.

>>
>> Did you come to understand the drawbacks of multisnap (physical fragmentation)?
>
> Yes I did, but the fragmentation is problem for any thinly provisioned
> storage. I also understand that your snapshot files has also proble with
> fragmentation.
>

It's true. ext4 snapshots generates fragmented *files*, but it does not fragment
the filesystem metadata. And only on specific workloads of in-place writes,
like large db or virtual image.

One difference is that ext4 snapshots can do effective auto defrag by using
the inode context, which is not available for multisnap.
The other big difference is that ext4 snapshots gives precedence to main
fs performance, while multisnap hasn't even the notion of a main fs.
All thinp and snapshot targets are writable and get equal treatment.

>>
>> Did it make you change your mind about ext4 snapshots?
>
> From the first time I was interested in ext4 snapshgots, however as I
> came to understand how it works (I must admit not *very* deeply) it all
> seems like a hack to solve your problem at the time (several years
> ago).

The problem was not mine, it's was for all Linux users who wanted snapshots.
The future does look brighter for them, but CTERA customers don't have to wait
for the future...

>
> And now, when I see how the new dm-multisnap target works, what features it has,
> how it performs (more-or-less) it seems to me that it is a lot more
> flexible and desirable way of doing this.
>
> On the other hand your snapshots disrupts a quite calm water of
> stable filesystem with a very poor set of features and very limited
> possibilities of improvements. Not talking about maintaining burden. But
> yes, it might perform a bit better.
>
> So to sum it up I see that dm-multisnap has superset of features your
> ext4 snapshots has, in performs well enough, it is more generic solution
> for all filesystems, it is also more flexibile, it does not require
> intrusive change into stable fs code, and it has better possibilities of
> future improvements.
>
> Do even if the final decision does not belong to me, I think that we do
> not need this code in ext4. If your snapshots were a *real* filesystem level
> snapshots with all the cool features it provides, the situation would be
> quite different, however even then I would be thinking if it is worth
> it, when we have btrfs here and now, ready to use, and improving every
> day to get at enterprise level (it will, hopefully, be a default
> filesystem in Fedora 16, which is huge step forward to enterprise
> environment).
>
> And here I would very much like to see other ext4 developers opinions,
> because they were really quiet on this matter and it is time to reveal
> the cards on the table, so ?...
>
>
>>
>> I am planning to join the ext4 weekly call today and ask if people think that
>> we still have open issues with ext4 snapshots, which must be resolved
>> before the merge.
>>
>> I have 2 questions that should be answered before the merge:
>> 1. Should 32bit ext4 move to 48bit snapshot file format after the
>> format is implemented for 64bit ext4?
>> 2. Should exclude bitmap be allocated only on mkfs time or should it
>> also be possible to allocate it with tune2fs?
>> Allocating it later will enable snapshots on existing fs, but will
>> have sub-optimal on-disk layout.
>>
>> If anyone has opinions on these 2 questions, please make them heard here or on
>> the call today.
>>
>> Thanks,
>> Amir.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>>
>
> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-06-13 13:11:58

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Mon, 13 Jun 2011, Amir G. wrote:

> On Mon, Jun 13, 2011 at 1:54 PM, Lukas Czerner <[email protected]> wrote:
> > On Mon, 13 Jun 2011, Amir G. wrote:
> >
> >> On Fri, Jun 10, 2011 at 12:00 PM, Lukas Czerner <[email protected]> wrote:
> >> >
> >> > --snip--
> >> >
> >> > Hi Amir,
> >> >
> >> > that is why I spoke with several dm people and all of them had the same
> >> > opinion. When you are not using the advantage of being at fs level,
> >> > there is no reason to have shapshoting at this level.
> >> >
> >> > And no, I am not blinded. I am trying to understand why is multisnap a
> >> > huge win everyone is saying, so I already asked ejt to step in and
> >> > give us an overview on how dm-multisnap works and why is it better
> >> > than the old implementation. Also I am trying it myslef, and so far
> >> > it works quite well. I might have some numbers later.
> >> >
> >>
> >> (Dropping LKML - had enough of that attention for 1 week...)
> >>
> >> Hi Lukas,
> >>
> >> So did you get any numbers? Joe said you were not able to get good results.
> >
> > Hi, yes I did had some bad numbers, but it was due to stupid setup I
> > have created :) metadata and data volume on the same drive, but in the
> > different partition. In the postmark test the performance drop was about
> > 100% and that is quite expected as it probably caused a LOT of seeks.
> >
> > But when I separated data and metadata I have very good results. Results
> > differs with the data block size used by dm.
> >
> > Filesystem on bare device.
> > 113 76 657.89 661.60 2000.00 325.80 328.61 329.28 661.60 1980.88 332.09
> > 24363052.00 76242168.00
> >
> > dm-multisnap
> > bs=128
> > 146 118 423.73 512.06 1923.08 209.84 211.64 212.08 512.06 1904.69 213.89
> > 18856334.00 59009348.00
> >
> > bs=256
> > 151 96 520.83 495.11 943.40 257.93 260.15 260.68 495.11 934.38 262.91
> > 18231952.00 57055396.00
> >
> > bs=512
> > 134 96 520.83 557.92 1515.15 257.93 260.15 260.68 557.92 1500.67 262.91
> > 20544960.00 64293764.00
> >
> > bs=1024
> > 119 70 714.29 628.24 1470.59 353.73 356.77 357.50 628.24 1456.53 360.56
> > 23134662.00 72398024.00
> >
> > bs=2048
> > 128 76 657.89 584.07 1190.48 325.80 328.61 329.28 584.07 1179.10 332.09
> > 21508006.00 67307536.00
> >
> > bs=4096
> > 131 84 595.24 570.69 1851.85 294.77 297.31 297.92 570.69 1834.15 300.46
> > 21015456.00 65766144.00
> >
> > Legend:
> > -----------------------------------------------------------------------
> > Total_duration Duration_of_transactions Transactions/s
> > Files_created/s Creation_alone/s Creation_mixed_with_transaction/s
> > Read/s Append/s Deleted/s Deletion_alone/s
> > Deletion_mixed_with_transaction/s Read_B/s Write_B/s
> > -----------------------------------------------------------------------
> >
> > I choosed postmark because it is doing a lot of operation on the file
> > and it is quite metadata intensive. Although it is still very simple
> > and limited test. However you can see that with data block size 1024B I
> > received almost the same results as in the case of bare device. It means
> > that there was almost none performance drop and I suspect that if I put
> > metadata on the SSD it would not be noticeable at all.
> >
> > We can see that results are dropping to the bs of 1024B and rising
> > afterwards. I suspect that we are dealing with two variables with
> > opposite outcome. Thinp target works better with bigger block sizes as
> > it has less metadata to work with, but in the other hand snapshots are
> > then more expensive, because we have to deal with COW rather than simple
> > write when we are changing the whole block. But 1024 seems quite
> > reasonable and I also think that putting metadata on SSD (which is
> > easily doable) we can very well address the first one.
> >
>
> SSD may be doable for enterprise servers, but I don't have one in my laptop :-(
>
> I think that Joe will agree with me that this is not the benchmark he
> is concerned about.
> It is clear to me that any operations applied to a new thin
> provisioned file system, will
> perform well, sometimes even better than on bare device.
>
> Did you use subdirs in postmark?
> If you do, ext3 will try to spread subdirs under root all over the disk
> (not sure about ext4) and postmark will be slower on bare device.
>
> The benchmark which is relevant to the drawbacks of multisnap is aging
> a filesystem
> to the point that it's metadata is physically layed out very
> differently than was intended.
>
> Here is a suggested real life test:
>
> 1. DATE=START_DATE; take snapshot $DATE
> 2. git checkout <mainline daily git tag>
> 3. time -o LOGFILE --append make
> 4. DATE+=DAY; goto 1
>
> Repeat this test until the volume fills up to several orders of
> magnitude more than the size
> of RAM on your system and observe how build time changes over time.

I am very much aware of the fact that this benchmark is not ideal, but
it gives us _some_ numbers - since you did not :). And you specifically
asked for it. Hopefully I'll have some time to do better benchmarks but
it'll take a while as I have some other stuff to do now.

>
> >>
> >> Did you come to understand the drawbacks of multisnap (physical fragmentation)?
> >
> > Yes I did, but the fragmentation is problem for any thinly provisioned
> > storage. I also understand that your snapshot files has also proble with
> > fragmentation.
> >
>
> It's true. ext4 snapshots generates fragmented *files*, but it does not fragment
> the filesystem metadata. And only on specific workloads of in-place writes,
> like large db or virtual image.
>
> One difference is that ext4 snapshots can do effective auto defrag by using
> the inode context, which is not available for multisnap.

No it is not, but from top of my head .. we can use time locality to
pack frequently accessed blocks together. Definitely there is a place
for improvements.

> The other big difference is that ext4 snapshots gives precedence to main
> fs performance, while multisnap hasn't even the notion of a main fs.
> All thinp and snapshot targets are writable and get equal treatment.

I am sorry, what do you mean by that ? Is it that when you mount the
snapshot, the reads will actually have lower importance ?

>
> >>
> >> Did it make you change your mind about ext4 snapshots?
> >
> > From the first time I was interested in ext4 snapshgots, however as I
> > came to understand how it works (I must admit not *very* deeply) it all
> > seems like a hack to solve your problem at the time (several years
> > ago).
>
> The problem was not mine, it's was for all Linux users who wanted snapshots.
> The future does look brighter for them, but CTERA customers don't have to wait
> for the future...

I do understand, but as Eric said, your business case is not reason to
push this hack into kernel.

>
> >
> > And now, when I see how the new dm-multisnap target works, what features it has,
> > how it performs (more-or-less) it seems to me that it is a lot more
> > flexible and desirable way of doing this.
> >
> > On the other hand your snapshots disrupts a quite calm water of
> > stable filesystem with a very poor set of features and very limited
> > possibilities of improvements. Not talking about maintaining burden. But
> > yes, it might perform a bit better.
> >
> > So to sum it up I see that dm-multisnap has superset of features your
> > ext4 snapshots has, in performs well enough, it is more generic solution
> > for all filesystems, it is also more flexibile, it does not require
> > intrusive change into stable fs code, and it has better possibilities of
> > future improvements.
> >
> > Do even if the final decision does not belong to me, I think that we do
> > not need this code in ext4. If your snapshots were a *real* filesystem level
> > snapshots with all the cool features it provides, the situation would be
> > quite different, however even then I would be thinking if it is worth
> > it, when we have btrfs here and now, ready to use, and improving every
> > day to get at enterprise level (it will, hopefully, be a default
> > filesystem in Fedora 16, which is huge step forward to enterprise
> > environment).
> >
> > And here I would very much like to see other ext4 developers opinions,
> > because they were really quiet on this matter and it is time to reveal
> > the cards on the table, so ?...
> >
> >
> >>
> >> I am planning to join the ext4 weekly call today and ask if people think that
> >> we still have open issues with ext4 snapshots, which must be resolved
> >> before the merge.
> >>
> >> I have 2 questions that should be answered before the merge:
> >> 1. Should 32bit ext4 move to 48bit snapshot file format after the
> >> format is implemented for 64bit ext4?
> >> 2. Should exclude bitmap be allocated only on mkfs time or should it
> >> also be possible to allocate it with tune2fs?
> >> Allocating it later will enable snapshots on existing fs, but will
> >> have sub-optimal on-disk layout.
> >>
> >> If anyone has opinions on these 2 questions, please make them heard here or on
> >> the call today.
> >>
> >> Thanks,
> >> Amir.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> >>
> >
> > --
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> >
>

--

2011-06-13 13:26:17

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Mon, Jun 13, 2011 at 4:11 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 13 Jun 2011, Amir G. wrote:
>

-snip-
>> >>
>> >> Did you come to understand the drawbacks of multisnap (physical fragmentation)?
>> >
>> > Yes I did, but the fragmentation is problem for any thinly provisioned
>> > storage. I also understand that your snapshot files has also proble with
>> > fragmentation.
>> >
>>
>> It's true. ext4 snapshots generates fragmented *files*, but it does not fragment
>> the filesystem metadata. And only on specific workloads of in-place writes,
>> like large db or virtual image.
>>
>> One difference is that ext4 snapshots can do effective auto defrag by using
>> the inode context, which is not available for multisnap.
>
> No it is not, but from top of my head .. we can use time locality to
> pack frequently accessed blocks together. Definitely there is a place
> for improvements.
>
>> The other big difference is that ext4 snapshots gives precedence to main
>> fs performance, while multisnap hasn't even the notion of a main fs.
>> All thinp and snapshot targets are writable and get equal treatment.
>
> I am sorry, what do you mean by that ? Is it that when you mount the
> snapshot, the reads will actually have lower importance ?
>

with ext4, snapshots reads may cause extra seeks, but main fs will
stay optimized for reads.
with thinp and multisnap, there is no optimization for read from one
specific target, but I admit that can change in the future when auto defrag
heuristics are applies to multisnap.

Amir.

2011-06-13 13:50:07

by Joe Thornber

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Mon, Jun 13, 2011 at 04:26:16PM +0300, Amir G. wrote:
> with ext4, snapshots reads may cause extra seeks, but main fs will
> stay optimized for reads.
> with thinp and multisnap, there is no optimization for read from one
> specific target, but I admit that can change in the future when auto defrag
> heuristics are applies to multisnap.

I'm going to keep things symmetrical. A lot of use cases involve
pointing at an arbitrary snapshot and saying "that's now my master".
This is connected to why I support arbitrary depth of recursive
snapshots too.

- Joe

--
lvm-devel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/lvm-devel

2011-06-21 11:06:18

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Tue, Jun 7, 2011 at 6:07 PM, <[email protected]> wrote:
> Hi All,
>
> I am resending the snapshots patch series as per Lukas's request.
> This time, the snapshot*.c files have not been omitted, as in
> the previous posting.
>
> The series is still based on ext4 dev branch sometime in the preparation
> for 3.0 merge window. It was not yet rebased on 3.0-rc1, so punch holes
> changes have not been addressed yet.
>
> As always, I advocate online review of the patches at:
> https://github.com/amir73il/ext4-snapshots/commits/for-ext4-v1
> but if you insist on doing it the old way, I won't complain.
>
> Thanks,
> Amir.
>

Hi Ted,

To answer your question about possible diet to snapshot patches,
following are some diffstat on groups of patches (by functionality).
The diffsstat includes only ext4 core file (excluding new snapshot* files).

As you can see, removing the shrink/merge functionality for multiple
snapshots, will remove the need for exclude bitmap and reduce the changes
to core files by ~300 insertions.

There is some advantage of not having to add the exclude bitmap to existing
fs, but I think the win is not so big, assuming that we will want to have the
shrink/merge functionality eventually.

Cheers,
Amir.

> [PATCH v1 01/36] ext4: EXT4 snapshots (Experimental)
> [PATCH v1 02/36] ext4: snapshot debugging support

Generic stuff.

15 files changed, 122 insertions(+), 1 deletions(-)

> [PATCH v1 03/36] ext4: snapshot hooks - inside JBD hooks
> [PATCH v1 04/36] ext4: snapshot hooks - block bitmap access
> [PATCH v1 05/36] ext4: snapshot hooks - delete blocks
> [PATCH v1 06/36] ext4: snapshot hooks - move data blocks
> [PATCH v1 07/36] ext4: snapshot hooks - direct I/O
> [PATCH v1 08/36] ext4: snapshot hooks - move extent file data blocks

Most of the code in this group handles MOW.

7 files changed, 550 insertions(+), 68 deletions(-)

> [PATCH v1 09/36] ext4: snapshot file
> [PATCH v1 10/36] ext4: snapshot file - read through to block device
> [PATCH v1 11/36] ext4: snapshot file - permissions
> [PATCH v1 12/36] ext4: snapshot file - store on disk
> [PATCH v1 13/36] ext4: snapshot file - increase maximum file size limit to 16TB

Implementation of special snapshot file.

7 files changed, 220 insertions(+), 14 deletions(-)

> [PATCH v1 14/36] ext4: snapshot block operations
> [PATCH v1 15/36] ext4: snapshot block operation - copy blocks to snapshot
> [PATCH v1 16/36] ext4: snapshot block operation - move blocks to snapshot
> [PATCH v1 17/36] ext4: snapshot block operation - copy block bitmap to snapshot

Copy/move to snapshot file operations.

5 files changed, 126 insertions(+), 24 deletions(-)

> [PATCH v1 18/36] ext4: snapshot control
> [PATCH v1 19/36] ext4: snapshot control - init new snapshot
> [PATCH v1 20/36] ext4: snapshot control - fix new snapshot
> [PATCH v1 21/36] ext4: snapshot control - reserve disk space for snapshot

Mostly new code in ioctl.c.

6 files changed, 171 insertions(+), 3 deletions(-)

> [PATCH v1 22/36] ext4: snapshot journaled - increase transaction credits
> [PATCH v1 23/36] ext4: snapshot journaled - implement journal_release_buffer()
> [PATCH v1 24/36] ext4: snapshot journaled - bypass to save credits
> [PATCH v1 25/36] ext4: snapshot journaled - cache last COW tid in journal_head

Helper functions to handle extra COW credits.

7 files changed, 284 insertions(+), 31 deletions(-)

> [PATCH v1 26/36] ext4: snapshot journaled - trace COW/buffer credits

Trace/debug for extra COW credits.

3 files changed, 108 insertions(+), 0 deletions(-)

> [PATCH v1 27/36] ext4: snapshot list support
> [PATCH v1 28/36] ext4: snapshot list - read through to previous snapshot

Not much to gain from dropping snapshot list support...

2 files changed, 2 insertions(+), 0 deletions(-)

> [PATCH v1 29/36] ext4: snapshot race conditions - concurrent COW bitmap operations
> [PATCH v1 30/36] ext4: snapshot race conditions - concurrent COW operations
> [PATCH v1 31/36] ext4: snapshot race conditions - tracked reads

We must handle race conditions.

2 files changed, 32 insertions(+), 0 deletions(-)

> [PATCH v1 32/36] ext4: snapshot exclude - the exclude bitmap

We do not need the exclude bitmap if we do not shrink/merge snapshots.

5 files changed, 226 insertions(+), 3 deletions(-)

> [PATCH v1 33/36] ext4: snapshot cleanup
> [PATCH v1 34/36] ext4: snapshot cleanup - shrink deleted snapshots
> [PATCH v1 35/36] ext4: snapshot cleanup - merge shrunk snapshots

We could support multiple snapshots without shrinking/merging deleted snapshot,
but that means that only disk space of oldest snapshot is reclaimed on delete.

2 files changed, 74 insertions(+), 22 deletions(-)

> [PATCH v1 36/36] ext4: snapshot rocompat - enable rw mount
>
> ?fs/ext4/Kconfig ? ? ? ? ? | ? 11 +
> ?fs/ext4/Makefile ? ? ? ? ?| ? ?3 +
> ?fs/ext4/balloc.c ? ? ? ? ?| ?132 +++
> ?fs/ext4/ext4.h ? ? ? ? ? ?| ?188 ++++-
> ?fs/ext4/ext4_jbd2.c ? ? ? | ?162 ++++-
> ?fs/ext4/ext4_jbd2.h ? ? ? | ?266 ++++++-
> ?fs/ext4/extents.c ? ? ? ? | ?157 ++++-
> ?fs/ext4/file.c ? ? ? ? ? ?| ? 11 +-
> ?fs/ext4/ialloc.c ? ? ? ? ?| ? 19 +-
> ?fs/ext4/inode.c ? ? ? ? ? | ?668 +++++++++++++--
> ?fs/ext4/ioctl.c ? ? ? ? ? | ?120 +++
> ?fs/ext4/mballoc.c ? ? ? ? | ?161 ++++-
> ?fs/ext4/move_extent.c ? ? | ? ?3 +-
> ?fs/ext4/namei.c ? ? ? ? ? | ? ?9 +
> ?fs/ext4/resize.c ? ? ? ? ?| ? 19 +-
> ?fs/ext4/snapshot.c ? ? ? ?| 1000 ++++++++++++++++++++++
> ?fs/ext4/snapshot.h ? ? ? ?| ?690 ++++++++++++++++
> ?fs/ext4/snapshot_buffer.c | ?393 +++++++++
> ?fs/ext4/snapshot_ctl.c ? ?| 2002 +++++++++++++++++++++++++++++++++++++++++++++
> ?fs/ext4/snapshot_debug.c ?| ?107 +++
> ?fs/ext4/snapshot_debug.h ?| ?105 +++
> ?fs/ext4/snapshot_inode.c ?| ?960 ++++++++++++++++++++++
> ?fs/ext4/super.c ? ? ? ? ? | ?157 ++++-
> ?fs/ext4/xattr.c ? ? ? ? ? | ? ?4 +-
> ?24 files changed, 7182 insertions(+), 165 deletions(-)
>
>

2011-06-21 15:45:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On 2011-06-21, at 5:06 AM, "Amir G." <[email protected]> wrote:
> On Tue, Jun 7, 2011 at 6:07 PM, <[email protected]> wrote:
>>
>> I am resending the snapshots patch series as per Lukas's request.
>> This time, the snapshot*.c files have not been omitted, as in
>> the previous posting.
>>
>> The series is still based on ext4 dev branch sometime in the preparation
>> for 3.0 merge window. It was not yet rebased on 3.0-rc1, so punch holes
>> changes have not been addressed yet.
>>
>> As always, I advocate online review of the patches at:
>> https://github.com/amir73il/ext4-snapshots/commits/for-ext4-v1
>> but if you insist on doing it the old way, I won't complain.
>>
>>
>
> To answer your question about possible diet to snapshot patches,
> following are some diffstat on groups of patches (by functionality).
> The diffsstat includes only ext4 core file (excluding new snapshot* files).
>
> As you can see, removing the shrink/merge functionality for multiple
> snapshots, will remove the need for exclude bitmap and reduce the changes
> to core files by ~300 insertions.
>
> There is some advantage of not having to add the exclude bitmap to existing
> fs, but I think the win is not so big, assuming that we will want to have the
> shrink/merge functionality eventually.

Wouldn't that also mean if shrink/merge are added later it needs another feature flag and added complexity due to another row in the feature test matrix?


>
>> [PATCH v1 01/36] ext4: EXT4 snapshots (Experimental)
>> [PATCH v1 02/36] ext4: snapshot debugging support
>
> Generic stuff.
>
> 15 files changed, 122 insertions(+), 1 deletions(-)
>
>> [PATCH v1 03/36] ext4: snapshot hooks - inside JBD hooks
>> [PATCH v1 04/36] ext4: snapshot hooks - block bitmap access
>> [PATCH v1 05/36] ext4: snapshot hooks - delete blocks
>> [PATCH v1 06/36] ext4: snapshot hooks - move data blocks
>> [PATCH v1 07/36] ext4: snapshot hooks - direct I/O
>> [PATCH v1 08/36] ext4: snapshot hooks - move extent file data blocks
>
> Most of the code in this group handles MOW.
>
> 7 files changed, 550 insertions(+), 68 deletions(-)
>
>> [PATCH v1 09/36] ext4: snapshot file
>> [PATCH v1 10/36] ext4: snapshot file - read through to block device
>> [PATCH v1 11/36] ext4: snapshot file - permissions
>> [PATCH v1 12/36] ext4: snapshot file - store on disk
>> [PATCH v1 13/36] ext4: snapshot file - increase maximum file size limit to 16TB
>
> Implementation of special snapshot file.
>
> 7 files changed, 220 insertions(+), 14 deletions(-)
>
>> [PATCH v1 14/36] ext4: snapshot block operations
>> [PATCH v1 15/36] ext4: snapshot block operation - copy blocks to snapshot
>> [PATCH v1 16/36] ext4: snapshot block operation - move blocks to snapshot
>> [PATCH v1 17/36] ext4: snapshot block operation - copy block bitmap to snapshot
>
> Copy/move to snapshot file operations.
>
> 5 files changed, 126 insertions(+), 24 deletions(-)
>
>> [PATCH v1 18/36] ext4: snapshot control
>> [PATCH v1 19/36] ext4: snapshot control - init new snapshot
>> [PATCH v1 20/36] ext4: snapshot control - fix new snapshot
>> [PATCH v1 21/36] ext4: snapshot control - reserve disk space for snapshot
>
> Mostly new code in ioctl.c.
>
> 6 files changed, 171 insertions(+), 3 deletions(-)
>
>> [PATCH v1 22/36] ext4: snapshot journaled - increase transaction credits
>> [PATCH v1 23/36] ext4: snapshot journaled - implement journal_release_buffer()
>> [PATCH v1 24/36] ext4: snapshot journaled - bypass to save credits
>> [PATCH v1 25/36] ext4: snapshot journaled - cache last COW tid in journal_head
>
> Helper functions to handle extra COW credits.
>
> 7 files changed, 284 insertions(+), 31 deletions(-)
>
>> [PATCH v1 26/36] ext4: snapshot journaled - trace COW/buffer credits
>
> Trace/debug for extra COW credits.
>
> 3 files changed, 108 insertions(+), 0 deletions(-)
>
>> [PATCH v1 27/36] ext4: snapshot list support
>> [PATCH v1 28/36] ext4: snapshot list - read through to previous snapshot
>
> Not much to gain from dropping snapshot list support...
>
> 2 files changed, 2 insertions(+), 0 deletions(-)
>
>> [PATCH v1 29/36] ext4: snapshot race conditions - concurrent COW bitmap operations
>> [PATCH v1 30/36] ext4: snapshot race conditions - concurrent COW operations
>> [PATCH v1 31/36] ext4: snapshot race conditions - tracked reads
>
> We must handle race conditions.
>
> 2 files changed, 32 insertions(+), 0 deletions(-)
>
>> [PATCH v1 32/36] ext4: snapshot exclude - the exclude bitmap
>
> We do not need the exclude bitmap if we do not shrink/merge snapshots.
>
> 5 files changed, 226 insertions(+), 3 deletions(-)
>
>> [PATCH v1 33/36] ext4: snapshot cleanup
>> [PATCH v1 34/36] ext4: snapshot cleanup - shrink deleted snapshots
>> [PATCH v1 35/36] ext4: snapshot cleanup - merge shrunk snapshots
>
> We could support multiple snapshots without shrinking/merging deleted snapshot,
> but that means that only disk space of oldest snapshot is reclaimed on delete.
>
> 2 files changed, 74 insertions(+), 22 deletions(-)
>
>> [PATCH v1 36/36] ext4: snapshot rocompat - enable rw mount
>>
>> fs/ext4/Kconfig | 11 +
>> fs/ext4/Makefile | 3 +
>> fs/ext4/balloc.c | 132 +++
>> fs/ext4/ext4.h | 188 ++++-
>> fs/ext4/ext4_jbd2.c | 162 ++++-
>> fs/ext4/ext4_jbd2.h | 266 ++++++-
>> fs/ext4/extents.c | 157 ++++-
>> fs/ext4/file.c | 11 +-
>> fs/ext4/ialloc.c | 19 +-
>> fs/ext4/inode.c | 668 +++++++++++++--
>> fs/ext4/ioctl.c | 120 +++
>> fs/ext4/mballoc.c | 161 ++++-
>> fs/ext4/move_extent.c | 3 +-
>> fs/ext4/namei.c | 9 +
>> fs/ext4/resize.c | 19 +-
>> fs/ext4/snapshot.c | 1000 ++++++++++++++++++++++
>> fs/ext4/snapshot.h | 690 ++++++++++++++++
>> fs/ext4/snapshot_buffer.c | 393 +++++++++
>> fs/ext4/snapshot_ctl.c | 2002 +++++++++++++++++++++++++++++++++++++++++++++
>> fs/ext4/snapshot_debug.c | 107 +++
>> fs/ext4/snapshot_debug.h | 105 +++
>> fs/ext4/snapshot_inode.c | 960 ++++++++++++++++++++++
>> fs/ext4/super.c | 157 ++++-
>> fs/ext4/xattr.c | 4 +-
>> 24 files changed, 7182 insertions(+), 165 deletions(-)
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-06-22 06:38:25

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH v1 00/30] Ext4 snapshots

On Tue, Jun 21, 2011 at 6:45 PM, Andreas Dilger <[email protected]> wrote:
> On 2011-06-21, at 5:06 AM, "Amir G." <[email protected]> wrote:
>> On Tue, Jun 7, 2011 at 6:07 PM, ?<[email protected]> wrote:
>>>
>>> I am resending the snapshots patch series as per Lukas's request.
>>> This time, the snapshot*.c files have not been omitted, as in
>>> the previous posting.
>>>
>>> The series is still based on ext4 dev branch sometime in the preparation
>>> for 3.0 merge window. It was not yet rebased on 3.0-rc1, so punch holes
>>> changes have not been addressed yet.
>>>
>>> As always, I advocate online review of the patches at:
>>> https://github.com/amir73il/ext4-snapshots/commits/for-ext4-v1
>>> but if you insist on doing it the old way, I won't complain.
>>>
>>>
>>
>> To answer your question about possible diet to snapshot patches,
>> following are some diffstat on groups of patches (by functionality).
>> The diffsstat includes only ext4 core file (excluding new snapshot* files).
>>
>> As you can see, removing the shrink/merge functionality for multiple
>> snapshots, will remove the need for exclude bitmap and reduce the changes
>> to core files by ~300 insertions.
>>
>> There is some advantage of not having to add the exclude bitmap to existing
>> fs, but I think the win is not so big, assuming that we will want to have the
>> shrink/merge functionality eventually.
>
> Wouldn't that also mean if shrink/merge are added later it needs another feature flag and added complexity due to another row in the feature test matrix?
>

Sure, it would add another row in the matrix and we don't want that either.
Shrink/merge capability could be derived from the exclude_bitmap
(compatible) feature.
In fact, without exclude bitmap, shrink/merge can still be done, but
the shrink will
not be as efficient as with exclude_bitmap, but the shrink code works
just the same
(i.e. it doesn't have to check for exclude_bitmap feature).

BTW, I have just learned that ZFS disk space is not reclaimed at all
unless deleting
a complete part of the snapshots history (including the oldest).
This is essentially how ext4 snapshots would work without the
shrink/merge patches
and exclude_bitmap, but it is not very useful for long snapshot
retention policies
(i.e. 1 yearly, 3 monthly, 4 weekly,...).


>
>>
>>> [PATCH v1 01/36] ext4: EXT4 snapshots (Experimental)
>>> [PATCH v1 02/36] ext4: snapshot debugging support
>>
>> Generic stuff.
>>
>> 15 files changed, 122 insertions(+), 1 deletions(-)
>>
>>> [PATCH v1 03/36] ext4: snapshot hooks - inside JBD hooks
>>> [PATCH v1 04/36] ext4: snapshot hooks - block bitmap access
>>> [PATCH v1 05/36] ext4: snapshot hooks - delete blocks
>>> [PATCH v1 06/36] ext4: snapshot hooks - move data blocks
>>> [PATCH v1 07/36] ext4: snapshot hooks - direct I/O
>>> [PATCH v1 08/36] ext4: snapshot hooks - move extent file data blocks
>>
>> Most of the code in this group handles MOW.
>>
>> 7 files changed, 550 insertions(+), 68 deletions(-)
>>
>>> [PATCH v1 09/36] ext4: snapshot file
>>> [PATCH v1 10/36] ext4: snapshot file - read through to block device
>>> [PATCH v1 11/36] ext4: snapshot file - permissions
>>> [PATCH v1 12/36] ext4: snapshot file - store on disk
>>> [PATCH v1 13/36] ext4: snapshot file - increase maximum file size limit to 16TB
>>
>> Implementation of special snapshot file.
>>
>> 7 files changed, 220 insertions(+), 14 deletions(-)
>>
>>> [PATCH v1 14/36] ext4: snapshot block operations
>>> [PATCH v1 15/36] ext4: snapshot block operation - copy blocks to snapshot
>>> [PATCH v1 16/36] ext4: snapshot block operation - move blocks to snapshot
>>> [PATCH v1 17/36] ext4: snapshot block operation - copy block bitmap to snapshot
>>
>> Copy/move to snapshot file operations.
>>
>> 5 files changed, 126 insertions(+), 24 deletions(-)
>>
>>> [PATCH v1 18/36] ext4: snapshot control
>>> [PATCH v1 19/36] ext4: snapshot control - init new snapshot
>>> [PATCH v1 20/36] ext4: snapshot control - fix new snapshot
>>> [PATCH v1 21/36] ext4: snapshot control - reserve disk space for snapshot
>>
>> Mostly new code in ioctl.c.
>>
>> 6 files changed, 171 insertions(+), 3 deletions(-)
>>
>>> [PATCH v1 22/36] ext4: snapshot journaled - increase transaction credits
>>> [PATCH v1 23/36] ext4: snapshot journaled - implement journal_release_buffer()
>>> [PATCH v1 24/36] ext4: snapshot journaled - bypass to save credits
>>> [PATCH v1 25/36] ext4: snapshot journaled - cache last COW tid in journal_head
>>
>> Helper functions to handle extra COW credits.
>>
>> 7 files changed, 284 insertions(+), 31 deletions(-)
>>
>>> [PATCH v1 26/36] ext4: snapshot journaled - trace COW/buffer credits
>>
>> Trace/debug for extra COW credits.
>>
>> 3 files changed, 108 insertions(+), 0 deletions(-)
>>
>>> [PATCH v1 27/36] ext4: snapshot list support
>>> [PATCH v1 28/36] ext4: snapshot list - read through to previous snapshot
>>
>> Not much to gain from dropping snapshot list support...
>>
>> 2 files changed, 2 insertions(+), 0 deletions(-)
>>
>>> [PATCH v1 29/36] ext4: snapshot race conditions - concurrent COW bitmap operations
>>> [PATCH v1 30/36] ext4: snapshot race conditions - concurrent COW operations
>>> [PATCH v1 31/36] ext4: snapshot race conditions - tracked reads
>>
>> We must handle race conditions.
>>
>> 2 files changed, 32 insertions(+), 0 deletions(-)
>>
>>> [PATCH v1 32/36] ext4: snapshot exclude - the exclude bitmap
>>
>> We do not need the exclude bitmap if we do not shrink/merge snapshots.
>>
>> 5 files changed, 226 insertions(+), 3 deletions(-)
>>
>>> [PATCH v1 33/36] ext4: snapshot cleanup
>>> [PATCH v1 34/36] ext4: snapshot cleanup - shrink deleted snapshots
>>> [PATCH v1 35/36] ext4: snapshot cleanup - merge shrunk snapshots
>>
>> We could support multiple snapshots without shrinking/merging deleted snapshot,
>> but that means that only disk space of oldest snapshot is reclaimed on delete.
>>
>> 2 files changed, 74 insertions(+), 22 deletions(-)
>>
>>> [PATCH v1 36/36] ext4: snapshot rocompat - enable rw mount
>>>
>>> ?fs/ext4/Kconfig ? ? ? ? ? | ? 11 +
>>> ?fs/ext4/Makefile ? ? ? ? ?| ? ?3 +
>>> ?fs/ext4/balloc.c ? ? ? ? ?| ?132 +++
>>> ?fs/ext4/ext4.h ? ? ? ? ? ?| ?188 ++++-
>>> ?fs/ext4/ext4_jbd2.c ? ? ? | ?162 ++++-
>>> ?fs/ext4/ext4_jbd2.h ? ? ? | ?266 ++++++-
>>> ?fs/ext4/extents.c ? ? ? ? | ?157 ++++-
>>> ?fs/ext4/file.c ? ? ? ? ? ?| ? 11 +-
>>> ?fs/ext4/ialloc.c ? ? ? ? ?| ? 19 +-
>>> ?fs/ext4/inode.c ? ? ? ? ? | ?668 +++++++++++++--
>>> ?fs/ext4/ioctl.c ? ? ? ? ? | ?120 +++
>>> ?fs/ext4/mballoc.c ? ? ? ? | ?161 ++++-
>>> ?fs/ext4/move_extent.c ? ? | ? ?3 +-
>>> ?fs/ext4/namei.c ? ? ? ? ? | ? ?9 +
>>> ?fs/ext4/resize.c ? ? ? ? ?| ? 19 +-
>>> ?fs/ext4/snapshot.c ? ? ? ?| 1000 ++++++++++++++++++++++
>>> ?fs/ext4/snapshot.h ? ? ? ?| ?690 ++++++++++++++++
>>> ?fs/ext4/snapshot_buffer.c | ?393 +++++++++
>>> ?fs/ext4/snapshot_ctl.c ? ?| 2002 +++++++++++++++++++++++++++++++++++++++++++++
>>> ?fs/ext4/snapshot_debug.c ?| ?107 +++
>>> ?fs/ext4/snapshot_debug.h ?| ?105 +++
>>> ?fs/ext4/snapshot_inode.c ?| ?960 ++++++++++++++++++++++
>>> ?fs/ext4/super.c ? ? ? ? ? | ?157 ++++-
>>> ?fs/ext4/xattr.c ? ? ? ? ? | ? ?4 +-
>>> ?24 files changed, 7182 insertions(+), 165 deletions(-)
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>