2011-05-09 16:42:13

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 00/30] Ext4 snapshots - core patches

The following patch series includes all the changes to core ext4 files,
which are needed for snapshots support. It adds some ~2K lines of code,
which will never be executed unless the following 2 conditions apply:
1. ext4 is built with CONFIG_EXT4_FS_SNAPSHOT
2. HAS_SNAPSHOT and EXCLUDE_BITMAP features are set by mke2fs/tune2fs

The remaining ~5K lines of code, added in new snapshot* files, were omitted
from this series to simplify the review and becasue they are not needed
when building ext4 without CONFIG_EXT4_FS_SNAPSHOT.
the full patches will be posted soon after I recieve some comments.

Ted has concluded my ext4 snapshots talk on LPC 2010 with the statement that
as long as the snapshot patches don't break anything when snapshot support
is disabled, he will pull them, so the main goal when reviewing this series
should be to prove that it is safe to pull the patches.

REVIEWING
---------
To make it easy for reviewers, I will provide some pointers:
- EXT4_SNAPSHOTS(sb) is defined to (0) (in snapshot.h) when ext4 is built
without snapshots support.
- EXT4_SNAPSHOTS(sb) is defined to test the HAS_SNAPSHOT feature when ext4
is built with snapshots support.
- All the ext4_snapshot_XXX function added by the patches, are defined to
NOP macros in snapshot.h when ext4 is built without snapshots support.
- Various flags defined by the patches (like EXT4_MB_HINT_COWING) will never
get set if EXT4_SNAPSHOTS(sb) is false, so testing them will also be false.

MERGING
-------
These patches are based on Ted's current master branch + alloc_semp removal
patches. Although the alloc_semp removal is an independent (and in my eyes
a good) change, it is also required by snapshot patches, to avoid circular
locking dependency during COW allocations.

Merging with Allison's punch holes patches should be straight forward, since
the hard part, namely Yongqiang's split extent refactoring patches, was
already merged by Ted.

Merging with Ted's big alloc patches is going to be a bit more challenging,
since big alloc patches make a lot of renaming and refactoring. However,
since has_snapshots and big_alloc features will never work together,
at least testing the code together is not a big concern.

TESTING
-------
Apart from the extensive testing for the snapshots feature functionality, we
also ran xfstests with snapshots and while taking a snapshot every 1 minute.
More importantly, we ran xfstests with snapshots support disabled in compile
time and with snapshot support enabled but without has_snapshot feature.
These xfstests were run with blocksize 1K and 4K and on X86 and X86_64.
The 1K blocksize tests are important for the alloc_semp removal patches.
No problems were found apart from one (test 225 hung), which is already
existing in master branch.

CREDITS
-------
The snapshots patches originate in my implementation of the Next3 filesystem
for CTERA networks.
The porting of the Next3 snapshot patches to ext4 patches is attributed to
Aditya Dani, Shardul Mangade, Piyush Nimbalkar and Harshad Shirwadkar from
the Pune Institute of Computer Technology (PICT).
The implementation of extents move-on-write, delayed move-on-write and much
of the cleanup work on these patches was carried out by Yongqiang Yang from
the Institute of Computing Technology, Chinese Academy of Sciences.


[PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)
[PATCH RFC 02/30] ext4: snapshot debugging support
[PATCH RFC 03/30] ext4: snapshot hooks - inside JBD hooks
[PATCH RFC 04/30] ext4: snapshot hooks - block bitmap access
[PATCH RFC 05/30] ext4: snapshot hooks - delete blocks
[PATCH RFC 06/30] ext4: snapshot hooks - move data blocks
[PATCH RFC 07/30] ext4: snapshot hooks - direct I/O
[PATCH RFC 08/30] ext4: snapshot hooks - move extent file data blocks
[PATCH RFC 09/30] ext4: snapshot file
[PATCH RFC 10/30] ext4: snapshot file - read through to block device
[PATCH RFC 11/30] ext4: snapshot file - permissions
[PATCH RFC 12/30] ext4: snapshot file - store on disk
[PATCH RFC 13/30] ext4: snapshot file - increase maximum file size limit to 16TB
[PATCH RFC 14/30] ext4: snapshot block operations
[PATCH RFC 15/30] ext4: snapshot block operation - copy blocks to snapshot
[PATCH RFC 16/30] ext4: snapshot block operation - move blocks to snapshot
[PATCH RFC 17/30] ext4: snapshot control
[PATCH RFC 18/30] ext4: snapshot control - fix new snapshot
[PATCH RFC 19/30] ext4: snapshot control - reserve disk space for snapshot
[PATCH RFC 20/30] ext4: snapshot journaled - increase transaction credits
[PATCH RFC 21/30] ext4: snapshot journaled - implement journal_release_buffer()
[PATCH RFC 22/30] ext4: snapshot journaled - bypass to save credits
[PATCH RFC 23/30] ext4: snapshot journaled - trace COW/buffer credits
[PATCH RFC 24/30] ext4: snapshot list support
[PATCH RFC 25/30] ext4: snapshot race conditions - concurrent COW operations
[PATCH RFC 26/30] ext4: snapshot race conditions - tracked reads
[PATCH RFC 27/30] ext4: snapshot exclude - the exclude bitmap
[PATCH RFC 28/30] ext4: snapshot cleanup
[PATCH RFC 29/30] ext4: snapshot cleanup - shrink deleted snapshots
[PATCH RFC 30/30] ext4: snapshot rocompat - enable rw mount


2011-05-09 16:42:44

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 02/30] ext4: snapshot debugging support

From: Amir Goldstein <[email protected]>

Control snapshot debug level via debugfs entry /ext4/snapshot-debug
and induce delay tests via debugfs entries /ext4/test-XXX-delay-msec.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/Makefile | 1 +
fs/ext4/mballoc.c | 3 +
fs/ext4/snapshot.h | 9 ++++
fs/ext4/snapshot_debug.h | 105 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 118 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 16a779d..9981306 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -21,3 +21,4 @@ ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot.o snapshot_ctl.o
ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_inode.o snapshot_buffer.o
+ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_debug.o
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4952b7b..42961bf 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2657,10 +2657,13 @@ static void __init ext4_create_debugfs_entry(void)
S_IRUGO | S_IWUSR,
debugfs_dir,
&mb_enable_debug);
+ if (debugfs_dir)
+ ext4_snapshot_create_debugfs_entry(debugfs_dir);
}

static void ext4_remove_debugfs_entry(void)
{
+ ext4_snapshot_remove_debugfs_entry();
debugfs_remove(debugfs_debug);
debugfs_remove(debugfs_dir);
}
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
index a927090..52bfa52 100644
--- a/fs/ext4/snapshot.h
+++ b/fs/ext4/snapshot.h
@@ -18,6 +18,7 @@
#include <linux/version.h>
#include <linux/delay.h>
#include "ext4.h"
+#include "snapshot_debug.h"


/*
@@ -109,6 +110,14 @@
static inline void snapshot_size_extend(struct inode *inode,
ext4_fsblk_t blocks)
{
+#ifdef CONFIG_EXT4_DEBUG
+ ext4_fsblk_t old_blocks = SNAPSHOT_PROGRESS(inode);
+ ext4_fsblk_t max_blocks = SNAPSHOT_BLOCKS(inode);
+
+ /* sleep total of tunable delay unit over 100% progress */
+ snapshot_test_delay_progress(SNAPTEST_DELETE,
+ old_blocks, blocks, max_blocks);
+#endif
i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
}

diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
index e69de29..f893eb1 100644
--- a/fs/ext4/snapshot_debug.h
+++ b/fs/ext4/snapshot_debug.h
@@ -0,0 +1,105 @@
+/*
+ * linux/fs/ext4/snapshot_debug.h
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshot debugging.
+ */
+
+#ifndef _LINUX_EXT4_SNAPSHOT_DEBUG_H
+#define _LINUX_EXT4_SNAPSHOT_DEBUG_H
+
+#if defined(CONFIG_EXT4_FS_SNAPSHOT) && defined(CONFIG_EXT4_DEBUG)
+#include <linux/delay.h>
+
+#define SNAPSHOT_INDENT_MAX 4
+#define SNAPSHOT_INDENT_STR "\t\t\t\t"
+
+#define SNAPTEST_TAKE 0
+#define SNAPTEST_DELETE 1
+#define SNAPTEST_COW 2
+#define SNAPTEST_READ 3
+#define SNAPTEST_BITMAP 4
+#define SNAPSHOT_TESTS_NUM 5
+
+extern const char *snapshot_indent;
+extern u8 snapshot_enable_debug;
+extern u16 snapshot_enable_test[SNAPSHOT_TESTS_NUM];
+extern u8 cow_cache_enabled;
+
+#define snapshot_test_delay(i) \
+ do { \
+ if (snapshot_enable_test[i]) \
+ msleep_interruptible(snapshot_enable_test[i]); \
+ } while (0)
+
+/*
+ * Sleep 1ms every 'blocks_per_ms', amounting to the total test delay
+ * over 100% of progress (when 'to' reaches 'max').
+ * snapshot_enable_test[i] (msec) is limited to 64K and max (blocks_count)
+ * is likely much more than 64K, so 'blocks_per_ms' is likely non zero.
+ */
+#define snapshot_test_delay_progress(i, from, to, max) \
+ do { \
+ if (snapshot_enable_test[i] && \
+ (max) > snapshot_enable_test[i] && \
+ (from) <= (to) && (to) <= (max)) { \
+ unsigned long blocks_per_ms = \
+ do_div((max), snapshot_enable_test[i]); \
+ unsigned long x = do_div((from), blocks_per_ms);\
+ unsigned long y = do_div((to), blocks_per_ms); \
+ if (y > x) \
+ msleep_interruptible(y - x); \
+ } \
+ } while (0)
+
+#define snapshot_debug_l(n, l, f, a...) \
+ do { \
+ if ((n) <= snapshot_enable_debug && \
+ (l) <= SNAPSHOT_INDENT_MAX) { \
+ printk(KERN_DEBUG "snapshot: %s" f, \
+ snapshot_indent - (l), \
+ ## a); \
+ } \
+ } while (0)
+
+#define snapshot_debug(n, f, a...) snapshot_debug_l(n, 0, f, ## a)
+
+#define snapshot_debug_once(n, f, a...) \
+ do { \
+ static bool __once; \
+ if (!__once) { \
+ snapshot_debug(n, f, ## a); \
+ __once = true; \
+ } \
+ } while (0)
+
+extern void ext4_snapshot_create_debugfs_entry(struct dentry *debugfs_dir);
+extern void ext4_snapshot_remove_debugfs_entry(void);
+
+#else
+#define snapshot_enable_debug (0)
+#define snapshot_test_delay(i)
+#define snapshot_test_delay_progress(i, from, to, max)
+#define snapshot_debug(n, f, a...)
+#define snapshot_debug_l(n, l, f, a...)
+#define snapshot_debug_once(n, f, a...)
+#define ext4_snapshot_create_debugfs_entry(d)
+#define ext4_snapshot_remove_debugfs_entry()
+#endif
+
+
+/* debug levels */
+#define SNAP_ERR 1 /* errors and summary */
+#define SNAP_WARN 2 /* warnings */
+#define SNAP_INFO 3 /* info */
+#define SNAP_DEBUG 4 /* debug */
+#define SNAP_DUMP 5 /* dump snapshot file */
+
+#endif /* _LINUX_EXT4_SNAPSHOT_DEBUG_H */
--
1.7.0.4


2011-05-09 16:42:42

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)

From: Amir Goldstein <[email protected]>

Built-in snapshots support for ext4.
Requires that the filesystem has the has_snapshot and exclude_bitmap
features and that block size is equal to system page size.
Snapshots are not supported with 64bit and meta_bg features and the
filesystem must be mounted with ordered data mode.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/Kconfig | 11 +++
fs/ext4/Makefile | 2 +
fs/ext4/balloc.c | 1 +
fs/ext4/ext4.h | 15 ++++
fs/ext4/ext4_jbd2.c | 3 +
fs/ext4/ext4_jbd2.h | 25 ++++++
fs/ext4/extents.c | 3 +
fs/ext4/file.c | 1 +
fs/ext4/ialloc.c | 1 +
fs/ext4/inode.c | 3 +
fs/ext4/ioctl.c | 3 +
fs/ext4/mballoc.c | 1 +
fs/ext4/namei.c | 1 +
fs/ext4/resize.c | 1 +
fs/ext4/snapshot.h | 193 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/super.c | 43 ++++++++++
16 files changed, 307 insertions(+), 0 deletions(-)
create mode 100644 fs/ext4/snapshot.h
create mode 100644 fs/ext4/snapshot_debug.h

diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index 9ed1bb1..8970525 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -83,3 +83,14 @@ config EXT4_DEBUG

If you select Y here, then you will be able to turn on debugging
with a command such as "echo 1 > /sys/kernel/debug/ext4/mballoc-debug"
+
+config EXT4_FS_SNAPSHOT
+ bool "EXT4 snapshots (Experimental)"
+ depends on EXT4_FS && EXPERIMENTAL
+ default n
+ help
+ Built-in snapshots support for ext4.
+ Requires that the filesystem has the has_snapshot and exclude_bitmap
+ features and that block size is equal to system page size.
+ Snapshots are not supported with 64bit and meta_bg features and the
+ filesystem must be mounted with ordered data mode.
diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 058b54d..16a779d 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -19,3 +19,5 @@ ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o page-io.o \
ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
+ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot.o snapshot_ctl.o
+ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_inode.o snapshot_buffer.o
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1288f80..350f502 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -20,6 +20,7 @@
#include "ext4.h"
#include "ext4_jbd2.h"
#include "mballoc.h"
+#include "snapshot.h"

/*
* balloc.c contains the blocks allocation and deallocation routines
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f495b22..ca25e67 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -886,6 +886,20 @@ struct ext4_inode_info {
#define EXT2_FLAGS_SIGNED_HASH 0x0001 /* Signed dirhash in use */
#define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
#define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
+#define EXT4_FLAGS_IS_SNAPSHOT 0x0010 /* Is a snapshot image */
+#define EXT4_FLAGS_FIX_SNAPSHOT 0x0020 /* Corrupted snapshot */
+#define EXT4_FLAGS_FIX_EXCLUDE 0x0040 /* Bad exclude bitmap */
+
+#define EXT4_SET_FLAGS(sb, mask) \
+ do { \
+ EXT4_SB(sb)->s_es->s_flags |= cpu_to_le32(mask); \
+ } while (0)
+#define EXT4_CLEAR_FLAGS(sb, mask) \
+ do { \
+ EXT4_SB(sb)->s_es->s_flags &= ~cpu_to_le32(mask);\
+ } while (0)
+#define EXT4_TEST_FLAGS(sb, mask) \
+ (EXT4_SB(sb)->s_es->s_flags & cpu_to_le32(mask))

/*
* Mount flags
@@ -1351,6 +1365,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
+#define EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT 0x0080 /* Ext4 has snapshots */

#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 6e272ef..560020d 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -1,8 +1,11 @@
/*
* Interface between ext4 and JBD
+ *
+ * Snapshot metadata COW hooks, Amir Goldstein <[email protected]>, 2011
*/

#include "ext4_jbd2.h"
+#include "snapshot.h"

#include <trace/events/ext4.h>

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index e25e99b..8ffffb1 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -10,6 +10,8 @@
* option, any later version, incorporated herein by reference.
*
* Ext4-specific journaling extensions.
+ *
+ * Snapshot extra COW credits, Amir Goldstein <[email protected]>, 2011
*/

#ifndef _EXT4_JBD2_H
@@ -18,6 +20,7 @@
#include <linux/fs.h>
#include <linux/jbd2.h>
#include "ext4.h"
+#include "snapshot.h"

#define EXT4_JOURNAL(inode) (EXT4_SB((inode)->i_sb)->s_journal)

@@ -272,6 +275,11 @@ static inline int ext4_should_journal_data(struct inode *inode)
return 0;
if (!S_ISREG(inode->i_mode))
return 1;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* snapshots enforce ordered data */
+ return 0;
+#endif
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
return 1;
if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
@@ -285,6 +293,11 @@ static inline int ext4_should_order_data(struct inode *inode)
return 0;
if (!S_ISREG(inode->i_mode))
return 0;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* snapshots enforce ordered data */
+ return 1;
+#endif
if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return 0;
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
@@ -298,6 +311,11 @@ static inline int ext4_should_writeback_data(struct inode *inode)
return 0;
if (EXT4_JOURNAL(inode) == NULL)
return 1;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* snapshots enforce ordered data */
+ return 0;
+#endif
if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
return 0;
if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
@@ -320,6 +338,11 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 0;
if (!S_ISREG(inode->i_mode))
return 0;
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ /* XXX: should snapshots support dioread_nolock? */
+ return 0;
+#endif
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
return 0;
if (ext4_should_journal_data(inode))
@@ -327,4 +350,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 1;
}

+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+#endif
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7296cd1..0c3ea93 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -18,6 +18,8 @@
* You should have received a copy of the GNU General Public Licens
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
+ *
+ * Snapshot move-on-write (MOW), Yongqiang Yang <[email protected]>, 2011
*/

/*
@@ -43,6 +45,7 @@
#include <linux/fiemap.h>
#include "ext4_jbd2.h"
#include "ext4_extents.h"
+#include "snapshot.h"

static int ext4_ext_truncate_extend_restart(handle_t *handle,
struct inode *inode,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 7b80d54..60b3b19 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -28,6 +28,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "snapshot.h"

/*
* Called when an inode is released. Note that this is different
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 2fd3b0e..831d49a 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -28,6 +28,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "snapshot.h"

#include <trace/events/ext4.h>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4ccb6eb..a597ff1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -20,6 +20,8 @@
* ([email protected])
*
* Assorted race fixes, rewrite of ext4_get_block() by Al Viro, 2000
+ *
+ * Snapshot inode extensions, Amir Goldstein <[email protected]>, 2011
*/

#include <linux/module.h>
@@ -49,6 +51,7 @@
#include "ext4_extents.h"

#include <trace/events/ext4.h>
+#include "snapshot.h"

#define MPAGE_DA_EXTENT_TAIL 0x01

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index eb3bc2f..a426332 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -5,6 +5,8 @@
* Remy Card ([email protected])
* Laboratoire MASI - Institut Blaise Pascal
* Universite Pierre et Marie Curie (Paris VI)
+ *
+ * Snapshot control API, Amir Goldstein <[email protected]>, 2011
*/

#include <linux/fs.h>
@@ -17,6 +19,7 @@
#include <asm/uaccess.h>
#include "ext4_jbd2.h"
#include "ext4.h"
+#include "snapshot.h"

long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2be85af..4952b7b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -25,6 +25,7 @@
#include <linux/debugfs.h>
#include <linux/slab.h>
#include <trace/events/ext4.h>
+#include "snapshot.h"

/*
* MUSTDO:
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index ad87584..b70fa13 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -39,6 +39,7 @@

#include "xattr.h"
#include "acl.h"
+#include "snapshot.h"

/*
* define how far ahead to read directories while searching them.
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index a11c00a..ee9b999 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -15,6 +15,7 @@
#include <linux/slab.h>

#include "ext4_jbd2.h"
+#include "snapshot.h"

#define outside(b, first, last) ((b) < (first) || (b) >= (last))
#define inside(b, first, last) ((b) >= (first) && (b) < (last))
diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
new file mode 100644
index 0000000..a927090
--- /dev/null
+++ b/fs/ext4/snapshot.h
@@ -0,0 +1,193 @@
+/*
+ * linux/fs/ext4/snapshot.h
+ *
+ * Written by Amir Goldstein <[email protected]>, 2008
+ *
+ * Copyright (C) 2008-2011 CTERA Networks
+ *
+ * This file is part of the Linux kernel and is made available under
+ * the terms of the GNU General Public License, version 2, or at your
+ * option, any later version, incorporated herein by reference.
+ *
+ * Ext4 snapshot extensions.
+ */
+
+#ifndef _LINUX_EXT4_SNAPSHOT_H
+#define _LINUX_EXT4_SNAPSHOT_H
+
+#include <linux/version.h>
+#include <linux/delay.h>
+#include "ext4.h"
+
+
+/*
+ * use signed 64bit for snapshot image addresses
+ * negative addresses are used to reference snapshot meta blocks
+ */
+#define ext4_snapblk_t long long
+
+/*
+ * We assert that file system block size == page size (on mount time)
+ * and that the first file system block is block 0 (on snapshot create).
+ * Snapshot inode direct blocks are reserved for snapshot meta blocks.
+ * Snapshot inode single indirect blocks are not used.
+ * Snapshot image starts at the first double indirect block, so all blocks in
+ * Snapshot image block group blocks are mapped by a single DIND block:
+ * 4k: 32k blocks_per_group = 32 IND (4k) blocks = 32 groups per DIND
+ * 8k: 64k blocks_per_group = 32 IND (8k) blocks = 64 groups per DIND
+ * 16k: 128k blocks_per_group = 32 IND (16k) blocks = 128 groups per DIND
+ */
+#define SNAPSHOT_BLOCK_SIZE PAGE_SIZE
+#define SNAPSHOT_BLOCK_SIZE_BITS PAGE_SHIFT
+#define SNAPSHOT_ADDR_PER_BLOCK (SNAPSHOT_BLOCK_SIZE / sizeof(__u32))
+#define SNAPSHOT_ADDR_PER_BLOCK_BITS (SNAPSHOT_BLOCK_SIZE_BITS - 2)
+#define SNAPSHOT_DIR_BLOCKS EXT4_NDIR_BLOCKS
+#define SNAPSHOT_IND_BLOCKS SNAPSHOT_ADDR_PER_BLOCK
+
+#define SNAPSHOT_BLOCKS_PER_GROUP_BITS (SNAPSHOT_BLOCK_SIZE_BITS + 3)
+#define SNAPSHOT_BLOCKS_PER_GROUP \
+ (1<<SNAPSHOT_BLOCKS_PER_GROUP_BITS) /* 8*PAGE_SIZE */
+#define SNAPSHOT_BLOCK_GROUP(block) \
+ ((block)>>SNAPSHOT_BLOCKS_PER_GROUP_BITS)
+#define SNAPSHOT_BLOCK_GROUP_OFFSET(block) \
+ ((block)&(SNAPSHOT_BLOCKS_PER_GROUP-1))
+#define SNAPSHOT_BLOCK_TUPLE(block) \
+ (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP_OFFSET(block), \
+ (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP(block)
+#define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS \
+ (SNAPSHOT_BLOCKS_PER_GROUP_BITS-SNAPSHOT_ADDR_PER_BLOCK_BITS)
+#define SNAPSHOT_IND_PER_BLOCK_GROUP \
+ (1<<SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */
+#define SNAPSHOT_DIND_BLOCK_GROUPS_BITS \
+ (SNAPSHOT_ADDR_PER_BLOCK_BITS-SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
+#define SNAPSHOT_DIND_BLOCK_GROUPS \
+ (1<<SNAPSHOT_DIND_BLOCK_GROUPS_BITS)
+
+#define SNAPSHOT_BLOCK_OFFSET \
+ (SNAPSHOT_DIR_BLOCKS+SNAPSHOT_IND_BLOCKS)
+#define SNAPSHOT_BLOCK(iblock) \
+ ((ext4_snapblk_t)(iblock) - SNAPSHOT_BLOCK_OFFSET)
+#define SNAPSHOT_IBLOCK(block) \
+ (ext4_fsblk_t)((block) + SNAPSHOT_BLOCK_OFFSET)
+
+
+
+#ifdef CONFIG_EXT4_FS_SNAPSHOT
+#define EXT4_SNAPSHOT_VERSION "ext4 snapshot v1.0.13-6 (2-May-2010)"
+
+#define SNAPSHOT_BYTES_OFFSET \
+ (SNAPSHOT_BLOCK_OFFSET << SNAPSHOT_BLOCK_SIZE_BITS)
+#define SNAPSHOT_ISIZE(size) \
+ ((size) + SNAPSHOT_BYTES_OFFSET)
+/* Snapshot block device size is recorded in i_disksize */
+#define SNAPSHOT_SET_SIZE(inode, size) \
+ (EXT4_I(inode)->i_disksize = SNAPSHOT_ISIZE(size))
+#define SNAPSHOT_SIZE(inode) \
+ (EXT4_I(inode)->i_disksize - SNAPSHOT_BYTES_OFFSET)
+#define SNAPSHOT_SET_BLOCKS(inode, blocks) \
+ SNAPSHOT_SET_SIZE((inode), \
+ (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS)
+#define SNAPSHOT_BLOCKS(inode) \
+ (ext4_fsblk_t)(SNAPSHOT_SIZE(inode) >> SNAPSHOT_BLOCK_SIZE_BITS)
+/* Snapshot shrink/merge/clean progress is exported via i_size */
+#define SNAPSHOT_PROGRESS(inode) \
+ (ext4_fsblk_t)((inode)->i_size >> SNAPSHOT_BLOCK_SIZE_BITS)
+#define SNAPSHOT_SET_ENABLED(inode) \
+ i_size_write((inode), SNAPSHOT_SIZE(inode))
+#define SNAPSHOT_SET_PROGRESS(inode, blocks) \
+ snapshot_size_extend((inode), (blocks))
+/* Disabled/deleted snapshot i_size is 1 block, to allow read of super block */
+#define SNAPSHOT_SET_DISABLED(inode) \
+ snapshot_size_truncate((inode), 1)
+/* Removed snapshot i_size and i_disksize are 0, since all blocks were freed */
+#define SNAPSHOT_SET_REMOVED(inode) \
+ do { \
+ EXT4_I(inode)->i_disksize = 0; \
+ snapshot_size_truncate((inode), 0); \
+ } while (0)
+
+static inline void snapshot_size_extend(struct inode *inode,
+ ext4_fsblk_t blocks)
+{
+ i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
+}
+
+static inline void snapshot_size_truncate(struct inode *inode,
+ ext4_fsblk_t blocks)
+{
+ loff_t i_size = (loff_t)blocks << SNAPSHOT_BLOCK_SIZE_BITS;
+
+ i_size_write(inode, i_size);
+ truncate_inode_pages(&inode->i_data, i_size);
+}
+
+/* Is ext4 configured for snapshots support? */
+static inline int EXT4_SNAPSHOTS(struct super_block *sb)
+{
+ return EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT);
+}
+
+#define ext4_snapshot_cow(handle, inode, block, bh, cow) 0
+
+#define ext4_snapshot_move(handle, inode, block, pcount, move) (0)
+
+/*
+ * Block access functions
+ */
+
+
+
+/* snapshot_ctl.c */
+
+
+static inline int init_ext4_snapshot(void)
+{
+ return 0;
+}
+
+static inline void exit_ext4_snapshot(void)
+{
+}
+
+
+
+
+
+#else /* CONFIG_EXT4_FS_SNAPSHOT */
+
+/* Snapshot NOP macros */
+#define EXT4_SNAPSHOTS(sb) (0)
+#define SNAPMAP_ISCOW(cmd) (0)
+#define SNAPMAP_ISMOVE(cmd) (0)
+#define SNAPMAP_ISSYNC(cmd) (0)
+#define IS_COWING(handle) (0)
+
+#define ext4_snapshot_load(sb, es, ro) (0)
+#define ext4_snapshot_destroy(sb)
+#define init_ext4_snapshot() (0)
+#define exit_ext4_snapshot()
+#define ext4_snapshot_active(sbi) (0)
+#define ext4_snapshot_file(inode) (0)
+#define ext4_snapshot_should_move_data(inode) (0)
+#define ext4_snapshot_test_excluded(handle, inode, block_to_free, count) (0)
+#define ext4_snapshot_list(inode) (0)
+#define ext4_snapshot_get_flags(ei, filp)
+#define ext4_snapshot_set_flags(handle, inode, flags) (0)
+#define ext4_snapshot_take(inode) (0)
+#define ext4_snapshot_update(inode_i_sb, cleanup, zero) (0)
+#define ext4_snapshot_has_active(sb) (NULL)
+#define ext4_snapshot_get_bitmap_access(handle, sb, grp, bh) (0)
+#define ext4_snapshot_get_write_access(handle, inode, bh) (0)
+#define ext4_snapshot_get_create_access(handle, bh) (0)
+#define ext4_snapshot_excluded(ac_inode) (0)
+#define ext4_snapshot_get_delete_access(handle, inode, block, pcount) (0)
+
+#define ext4_snapshot_get_move_access(handle, inode, block, pcount, move) (0)
+#define ext4_snapshot_start_pending_cow(sbh)
+#define ext4_snapshot_end_pending_cow(sbh)
+#define ext4_snapshot_is_active(inode) (0)
+#define ext4_snapshot_mow_in_tid(inode) (1)
+
+#endif /* CONFIG_EXT4_FS_SNAPSHOT */
+#endif /* _LINUX_EXT4_SNAPSHOT_H */
diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
new file mode 100644
index 0000000..e69de29
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 414167a..2c345d1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -48,6 +48,7 @@
#include "xattr.h"
#include "acl.h"
#include "mballoc.h"
+#include "snapshot.h"

#define CREATE_TRACE_POINTS
#include <trace/events/ext4.h>
@@ -2612,6 +2613,24 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
return 0;
}
}
+ /* Enforce snapshots requirements: */
+ if (EXT4_SNAPSHOTS(sb)) {
+ if (EXT4_HAS_INCOMPAT_FEATURE(sb,
+ EXT4_FEATURE_INCOMPAT_META_BG|
+ EXT4_FEATURE_INCOMPAT_64BIT)) {
+ ext4_msg(sb, KERN_ERR,
+ "has_snapshot feature cannot be mixed with "
+ "features: meta_bg, 64bit");
+ return 0;
+ }
+ if (EXT4_TEST_FLAGS(sb, EXT4_FLAGS_IS_SNAPSHOT)) {
+ ext4_msg(sb, KERN_ERR,
+ "A snapshot image must be mounted read-only. "
+ "If this is an exported snapshot image, you "
+ "must run fsck -xy to make it writable.");
+ return 0;
+ }
+ }
return 1;
}

@@ -3194,6 +3213,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);

+ /* Enforce snapshots blocksize == pagesize */
+ if (EXT4_SNAPSHOTS(sb) && blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR,
+ "snapshots require that filesystem blocksize "
+ "(%d) be equal to system page size (%lu)",
+ blocksize, PAGE_SIZE);
+ goto failed_mount;
+ }
+
if (blocksize < EXT4_MIN_BLOCK_SIZE ||
blocksize > EXT4_MAX_BLOCK_SIZE) {
ext4_msg(sb, KERN_ERR,
@@ -3540,6 +3568,15 @@ no_journal:
goto failed_mount_wq;
}

+ /* Enforce journal ordered mode with snapshots */
+ if (EXT4_SNAPSHOTS(sb) && !(sb->s_flags & MS_RDONLY) &&
+ (!EXT4_SB(sb)->s_journal ||
+ test_opt(sb, DATA_FLAGS) != EXT4_MOUNT_ORDERED_DATA)) {
+ ext4_msg(sb, KERN_ERR,
+ "snapshots require journal ordered mode");
+ goto failed_mount4;
+ }
+
/*
* The jbd2_journal_load will have done any necessary log recovery,
* so we can safely mount the rest of the filesystem now.
@@ -4878,10 +4915,15 @@ static int __init ext4_init_fs(void)
err = register_filesystem(&ext4_fs_type);
if (err)
goto out;
+ err = init_ext4_snapshot();
+ if (err)
+ goto out_fs;

ext4_li_info = NULL;
mutex_init(&ext4_li_mtx);
return 0;
+out_fs:
+ unregister_filesystem(&ext4_fs_type);
out:
unregister_as_ext2();
unregister_as_ext3();
@@ -4905,6 +4947,7 @@ out7:

static void __exit ext4_exit_fs(void)
{
+ exit_ext4_snapshot();
ext4_destroy_lazyinit_thread();
unregister_as_ext2();
unregister_as_ext3();
--
1.7.0.4


2011-05-09 16:42:47

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 03/30] ext4: snapshot hooks - inside JBD hooks

From: Amir Goldstein <[email protected]>

Before every metadata buffer write, the journal API is called,
namely, one of the ext4_journal_get_XXX_access() functions.
We use these journal hooks to call the snapshot API, namely
ext4_snapshot_get_XXX_access(), to COW the metadata buffer before
it is modified for the first time.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 9 +++++++--
fs/ext4/ext4_jbd2.h | 15 +++++++++++----
fs/ext4/extents.c | 3 ++-
fs/ext4/inode.c | 22 +++++++++++++++-------
fs/ext4/move_extent.c | 3 ++-
5 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 560020d..833969b 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -23,13 +23,16 @@ int __ext4_journal_get_undo_access(const char *where, unsigned int line,
return err;
}

-int __ext4_journal_get_write_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh)
+int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
+ handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, int exclude)
{
int err = 0;

if (ext4_handle_valid(handle)) {
err = jbd2_journal_get_write_access(handle, bh);
+ if (!err && !exclude)
+ err = ext4_snapshot_get_write_access(handle, inode, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
@@ -111,6 +114,8 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,

if (ext4_handle_valid(handle)) {
err = jbd2_journal_get_create_access(handle, bh);
+ if (!err)
+ err = ext4_snapshot_get_create_access(handle, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 8ffffb1..75662f7 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -132,9 +132,9 @@ void ext4_journal_abort_handle(const char *caller, unsigned int line,
int __ext4_journal_get_undo_access(const char *where, unsigned int line,
handle_t *handle, struct buffer_head *bh);

-int __ext4_journal_get_write_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh);
-
+int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
+ handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, int exclude);
int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
int is_metadata, struct inode *inode,
struct buffer_head *bh, ext4_fsblk_t blocknr);
@@ -151,8 +151,15 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,

#define ext4_journal_get_undo_access(handle, bh) \
__ext4_journal_get_undo_access(__func__, __LINE__, (handle), (bh))
+#define ext4_journal_get_write_access_exclude(handle, bh) \
+ __ext4_journal_get_write_access_inode(__func__, __LINE__, \
+ (handle), NULL, (bh), 1)
#define ext4_journal_get_write_access(handle, bh) \
- __ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
+ __ext4_journal_get_write_access_inode(__func__, __LINE__, \
+ (handle), NULL, (bh), 0)
+#define ext4_journal_get_write_access_inode(handle, inode, bh) \
+ __ext4_journal_get_write_access_inode(__func__, __LINE__, \
+ (handle), (inode), (bh), 0)
#define ext4_forget(handle, is_metadata, inode, bh, block_nr) \
__ext4_forget(__func__, __LINE__, (handle), (is_metadata), (inode), \
(bh), (block_nr))
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0c3ea93..c8cab3d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -77,7 +77,8 @@ static int ext4_ext_get_access(handle_t *handle, struct inode *inode,
{
if (path->p_bh) {
/* path points to block */
- return ext4_journal_get_write_access(handle, path->p_bh);
+ return ext4_journal_get_write_access_inode(handle,
+ inode, path->p_bh);
}
/* path points to leaf/index in inode body */
/* we use in-core data, no need to protect them */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a597ff1..b848072 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -874,7 +874,8 @@ static int ext4_splice_branch(handle_t *handle, struct inode *inode,
*/
if (where->bh) {
BUFFER_TRACE(where->bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, where->bh);
+ err = ext4_journal_get_write_access_inode(handle, inode,
+ where->bh);
if (err)
goto err_out;
}
@@ -4172,7 +4173,8 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
goto out_err;
if (bh) {
BUFFER_TRACE(bh, "retaking write access");
- err = ext4_journal_get_write_access(handle, bh);
+ err = ext4_journal_get_write_access_inode(handle,
+ inode, bh);
if (unlikely(err))
goto out_err;
}
@@ -4223,7 +4225,8 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,

if (this_bh) { /* For indirect block */
BUFFER_TRACE(this_bh, "get_write_access");
- err = ext4_journal_get_write_access(handle, this_bh);
+ err = ext4_journal_get_write_access_inode(handle, inode,
+ this_bh);
/* Important: if we can't update the indirect pointers
* to the blocks, we can't free them. */
if (err)
@@ -4386,8 +4389,8 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
* pointed to by an indirect block: journal it
*/
BUFFER_TRACE(parent_bh, "get_write_access");
- if (!ext4_journal_get_write_access(handle,
- parent_bh)){
+ if (!ext4_journal_get_write_access_inode(
+ handle, inode, parent_bh)){
*p = 0;
BUFFER_TRACE(parent_bh,
"call ext4_handle_dirty_metadata");
@@ -4759,9 +4762,14 @@ has_buffer:

int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
{
- /* We have all inode data except xattrs in memory here. */
- return __ext4_get_inode_loc(inode, iloc,
+ int in_mem = (!EXT4_SNAPSHOTS(inode->i_sb) &&
!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
+
+ /*
+ * We have all inode's data except xattrs in memory here,
+ * but we must always read-in the entire inode block for COW.
+ */
+ return __ext4_get_inode_loc(inode, iloc, in_mem);
}

void ext4_set_inode_flags(struct inode *inode)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index b9f3e78..ad5409a 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -421,7 +421,8 @@ mext_insert_extents(handle_t *handle, struct inode *orig_inode,

if (depth) {
/* Register to journal */
- ret = ext4_journal_get_write_access(handle, orig_path->p_bh);
+ ret = ext4_journal_get_write_access_inode(handle,
+ orig_inode, orig_path->p_bh);
if (ret)
return ret;
}
--
1.7.0.4


2011-05-09 16:42:50

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 04/30] ext4: snapshot hooks - block bitmap access

From: Amir Goldstein <[email protected]>

The API ext4_handle_get_bitmap_access() is used instead of
ext4_journal_get_write_access(), before modifying a block bitmap
while allocating or deleting blocks. The bitmap access API is
used to initialize the COW bitmap for that group.
The old ext4_journal_get_undo_access() API was removed because it
is not being used in the code.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 10 +++++++---
fs/ext4/ext4_jbd2.h | 10 ++++++----
fs/ext4/mballoc.c | 7 ++++---
3 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 833969b..c44c362 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -9,13 +9,17 @@

#include <trace/events/ext4.h>

-int __ext4_journal_get_undo_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh)
+int __ext4_handle_get_bitmap_access(const char *where, unsigned int line,
+ handle_t *handle, struct super_block *sb,
+ ext4_group_t group, struct buffer_head *bh)
{
int err = 0;

if (ext4_handle_valid(handle)) {
- err = jbd2_journal_get_undo_access(handle, bh);
+ err = jbd2_journal_get_write_access(handle, bh);
+ if (!err)
+ err = ext4_snapshot_get_bitmap_access(handle, sb,
+ group, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 75662f7..707b810 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -129,8 +129,9 @@ void ext4_journal_abort_handle(const char *caller, unsigned int line,
const char *err_fn,
struct buffer_head *bh, handle_t *handle, int err);

-int __ext4_journal_get_undo_access(const char *where, unsigned int line,
- handle_t *handle, struct buffer_head *bh);
+int __ext4_handle_get_bitmap_access(const char *where, unsigned int line,
+ handle_t *handle, struct super_block *sb,
+ ext4_group_t group, struct buffer_head *bh);

int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
handle_t *handle, struct inode *inode,
@@ -149,8 +150,9 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
int __ext4_handle_dirty_super(const char *where, unsigned int line,
handle_t *handle, struct super_block *sb);

-#define ext4_journal_get_undo_access(handle, bh) \
- __ext4_journal_get_undo_access(__func__, __LINE__, (handle), (bh))
+#define ext4_handle_get_bitmap_access(handle, sb, group, bh) \
+ __ext4_handle_get_bitmap_access(__func__, __LINE__, \
+ (handle), (sb), (group), (bh))
#define ext4_journal_get_write_access_exclude(handle, bh) \
__ext4_journal_get_write_access_inode(__func__, __LINE__, \
(handle), NULL, (bh), 1)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 42961bf..e8bfd8d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2747,7 +2747,8 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
if (!bitmap_bh)
goto out_err;

- err = ext4_journal_get_write_access(handle, bitmap_bh);
+ err = ext4_handle_get_bitmap_access(handle, sb, ac->ac_b_ex.fe_group,
+ bitmap_bh);
if (err)
goto out_err;

@@ -4543,7 +4544,7 @@ do_more:
}

BUFFER_TRACE(bitmap_bh, "getting write access");
- err = ext4_journal_get_write_access(handle, bitmap_bh);
+ err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
if (err)
goto error_return;

@@ -4692,7 +4693,7 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
}

BUFFER_TRACE(bitmap_bh, "getting write access");
- err = ext4_journal_get_write_access(handle, bitmap_bh);
+ err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
if (err)
goto error_return;

--
1.7.0.4


2011-05-09 16:42:58

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 05/30] ext4: snapshot hooks - delete blocks

From: Amir Goldstein <[email protected]>

Before deleting file blocks in ext4_free_blocks(),
we call the snapshot API ext4_snapshot_get_delete_access(),
to optionally move the block to the snapshot file instead of
freeing them.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 10 +++++++---
fs/ext4/mballoc.c | 30 +++++++++++++++++++++++++++---
2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ca25e67..4e9e46a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1729,9 +1729,13 @@ extern int ext4_mb_reserve_blocks(struct super_block *, int);
extern void ext4_discard_preallocations(struct inode *);
extern int __init ext4_init_mballoc(void);
extern void ext4_exit_mballoc(void);
-extern void ext4_free_blocks(handle_t *handle, struct inode *inode,
- struct buffer_head *bh, ext4_fsblk_t block,
- unsigned long count, int flags);
+extern void __ext4_free_blocks(const char *where, unsigned int line,
+ handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, ext4_fsblk_t block,
+ unsigned long count, int flags);
+#define ext4_free_blocks(handle, inode, bh, block, count, flags) \
+ __ext4_free_blocks(__func__, __LINE__ , (handle), (inode), (bh), \
+ (block), (count), (flags))
extern int ext4_mb_add_groupinfo(struct super_block *sb,
ext4_group_t i, struct ext4_group_desc *desc);
extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e8bfd8d..3b1c6d1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4445,9 +4445,9 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
* @count: number of blocks to count
* @metadata: Are these metadata blocks
*/
-void ext4_free_blocks(handle_t *handle, struct inode *inode,
- struct buffer_head *bh, ext4_fsblk_t block,
- unsigned long count, int flags)
+void __ext4_free_blocks(const char *where, unsigned int line, handle_t *handle,
+ struct inode *inode, struct buffer_head *bh,
+ ext4_fsblk_t block, unsigned long count, int flags)
{
struct buffer_head *bitmap_bh = NULL;
struct super_block *sb = inode->i_sb;
@@ -4461,6 +4461,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
struct ext4_buddy e4b;
int err = 0;
int ret;
+ int maxblocks;

if (bh) {
if (block)
@@ -4543,6 +4544,29 @@ do_more:
goto error_return;
}

+ maxblocks = count;
+ ret = ext4_snapshot_get_delete_access(handle, inode,
+ block, &maxblocks);
+ if (ret < 0) {
+ ext4_journal_abort_handle(where, line, __func__,
+ NULL, handle, ret);
+ err = ret;
+ goto error_return;
+ }
+ if (ret > 0) {
+ /* 'ret' blocks were moved to snapshot - skip them */
+ block += maxblocks;
+ count -= maxblocks;
+ count += overflow;
+ cond_resched();
+ if (count > 0)
+ goto do_more;
+ /* no more blocks to free/move to snapshot */
+ ext4_mark_super_dirty(sb);
+ goto error_return;
+ }
+ overflow += count - maxblocks;
+ count = maxblocks;
BUFFER_TRACE(bitmap_bh, "getting write access");
err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
if (err)
--
1.7.0.4


2011-05-09 16:43:09

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 06/30] ext4: snapshot hooks - move data blocks

From: Amir Goldstein <[email protected]>

Before every regular file data buffer write, the function
ext4_get_block() is called to map the buffer to disk. We add a
new function ext4_get_block_mow() which is called when we want to
snapshot the blocks. We use this hook to call the snapshot API
snapshot_get_move_access(), to optionally move the block
to the snapshot file.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 15 +++-
fs/ext4/ext4_jbd2.h | 17 ++++
fs/ext4/inode.c | 242 +++++++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/mballoc.c | 23 +++++
4 files changed, 280 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4e9e46a..013eec2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -156,9 +156,10 @@ struct ext4_allocation_request {
#define EXT4_MAP_UNWRITTEN (1 << BH_Unwritten)
#define EXT4_MAP_BOUNDARY (1 << BH_Boundary)
#define EXT4_MAP_UNINIT (1 << BH_Uninit)
+#define EXT4_MAP_REMAP (1 << BH_Remap)
#define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
- EXT4_MAP_UNINIT)
+ EXT4_MAP_UNINIT | EXT4_MAP_REMAP)

struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
@@ -525,6 +526,12 @@ struct ext4_new_group_data {
/* Convert extent to initialized after IO complete */
#define EXT4_GET_BLOCKS_IO_CONVERT_EXT (EXT4_GET_BLOCKS_CONVERT|\
EXT4_GET_BLOCKS_CREATE_UNINIT_EXT)
+ /* Look up if mapped block is used by snapshot,
+ * if so and EXT4_GET_BLOCKS_CREATE is set, move it to snapshot
+ * and allocate a new block for new data.
+ * if EXT4_GET_BLOCKS_CREATE is not set, return REMAP flags.
+ */
+#define EXT4_GET_BLOCKS_MOVE_ON_WRITE 0x0100

/*
* Flags used by ext4_free_blocks
@@ -2128,10 +2135,16 @@ extern int ext4_bio_write_page(struct ext4_io_submit *io,
enum ext4_state_bits {
BH_Uninit /* blocks are allocated but uninitialized on disk */
= BH_JBDPrivateStart,
+ BH_Remap, /* Data block need to be remapped,
+ * now used by snapshot to do mow
+ */
+ BH_Partial_Write, /* Buffer should be uptodate before write */
};

BUFFER_FNS(Uninit, uninit)
TAS_BUFFER_FNS(Uninit, uninit)
+BUFFER_FNS(Remap, remap)
+BUFFER_FNS(Partial_Write, partial_write)

/*
* Add new method to test wether block and inode bitmaps are properly
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 707b810..1c119cc 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -360,5 +360,22 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
}

#ifdef CONFIG_EXT4_FS_SNAPSHOT
+/*
+ * check if @inode data blocks should be moved-on-write
+ */
+static inline int ext4_snapshot_should_move_data(struct inode *inode)
+{
+ if (!EXT4_SNAPSHOTS(inode->i_sb))
+ return 0;
+ if (EXT4_JOURNAL(inode) == NULL)
+ return 0;
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ return 0;
+ /* when a data block is journaled, it is already COWed as metadata */
+ if (ext4_should_journal_data(inode))
+ return 0;
+ return 1;
+}
+
#endif
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b848072..3ed64bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -78,7 +78,8 @@ static int noalloc_get_block_write(struct inode *inode, sector_t iblock,
static int ext4_set_bh_endio(struct buffer_head *bh, struct inode *inode);
static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate);
static int __ext4_journalled_writepage(struct page *page, unsigned int len);
-static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh);
+static int ext4_bh_delay_or_unwritten_or_remap(handle_t *handle,
+ struct buffer_head *bh);

/*
* Test whether an inode is a fast symlink.
@@ -987,6 +988,51 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,

partial = ext4_get_branch(inode, depth, offsets, chain, &err);

+ err = 0;
+ if (!partial && (flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE)) {
+ BUG_ON(!ext4_snapshot_should_move_data(inode));
+ first_block = le32_to_cpu(chain[depth - 1].key);
+ if (!(flags & EXT4_GET_BLOCKS_CREATE)) {
+ /*
+ * First call from ext4_map_blocks():
+ * test if first_block should be moved to snapshot?
+ */
+ err = ext4_snapshot_get_move_access(handle, inode,
+ first_block,
+ &map->m_len, 0);
+ if (err < 0) {
+ /* cleanup the whole chain and exit */
+ partial = chain + depth - 1;
+ goto cleanup;
+ }
+ if (err > 0) {
+ /*
+ * Return EXT4_MAP_REMAP via map->m_flags
+ * to tell ext4_map_blocks() that the
+ * found block should be moved to snapshot.
+ */
+ map->m_flags |= EXT4_MAP_REMAP;
+ }
+ /*
+ * Set max. blocks to map to max. blocks, which
+ * ext4_snapshot_get_move_access() allows us to handle
+ * (move or not move) in one ext4_map_blocks() call.
+ */
+ err = 0;
+ } else if (map->m_flags & EXT4_MAP_REMAP &&
+ map->m_pblk == first_block) {
+ /*
+ * Second call from ext4_map_blocks():
+ * If mapped block hasn't change, we can rely the
+ * cached result from the first call.
+ */
+ err = 1;
+ }
+ }
+ if (err)
+ /* do not map found block - it should be moved to snapshot */
+ partial = chain + depth - 1;
+
/* Simplest case - block found, no allocation needed */
if (!partial) {
first_block = le32_to_cpu(chain[depth - 1].key);
@@ -1021,8 +1067,12 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
* Next look up the indirect map to count the totoal number of
* direct blocks to allocate for this branch.
*/
- count = ext4_blks_to_allocate(partial, indirect_blks,
- map->m_len, blocks_to_boundary);
+ if (map->m_flags & EXT4_MAP_REMAP) {
+ BUG_ON(indirect_blks != 0);
+ count = map->m_len;
+ } else
+ count = ext4_blks_to_allocate(partial, indirect_blks,
+ map->m_len, blocks_to_boundary);
/*
* Block out ext4_truncate while we alter the tree
*/
@@ -1030,6 +1080,23 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
&count, goal,
offsets + (partial - chain), partial);

+ if (map->m_flags & EXT4_MAP_REMAP) {
+ map->m_len = count;
+ /* move old block to snapshot */
+ err = ext4_snapshot_get_move_access(handle, inode,
+ le32_to_cpu(*(partial->p)),
+ &map->m_len, 1);
+ if (err <= 0) {
+ /* failed to move to snapshot - abort! */
+ err = err ? : -EIO;
+ ext4_journal_abort_handle(__func__, __LINE__,
+ "ext4_snapshot_get_move_access", NULL,
+ handle, err);
+ goto cleanup;
+ }
+ /* block moved to snapshot - continue to splice new block */
+ err = 0;
+ }
/*
* The ext4_splice_branch call will free and forget any buffers
* on the new chain if there is a failure, but that risks using
@@ -1045,7 +1112,8 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,

map->m_flags |= EXT4_MAP_NEW;

- ext4_update_inode_fsync_trans(handle, inode, 1);
+ if (!IS_COWING(handle))
+ ext4_update_inode_fsync_trans(handle, inode, 1);
got_it:
map->m_flags |= EXT4_MAP_MAPPED;
map->m_pblk = le32_to_cpu(chain[depth-1].key);
@@ -1291,7 +1359,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, 0);
} else {
- retval = ext4_ind_map_blocks(handle, inode, map, 0);
+ retval = ext4_ind_map_blocks(handle, inode, map,
+ flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
}
up_read((&EXT4_I(inode)->i_data_sem));

@@ -1312,7 +1381,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* ext4_ext_get_block() returns th create = 0
* with buffer head unmapped.
*/
- if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED)
+ if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED &&
+ !(map->m_flags & EXT4_MAP_REMAP))
return retval;

/*
@@ -1375,6 +1445,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);

up_write((&EXT4_I(inode)->i_data_sem));
+ /* Clear EXT4_MAP_REMAP, it is not needed any more. */
+ map->m_flags &= ~EXT4_MAP_REMAP;
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
int ret = check_block_validity(inode, map);
if (ret != 0)
@@ -1383,6 +1455,41 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
return retval;
}

+/*
+ * Block may need to be moved to snapshot and we need to writeback part of the
+ * existing block data to the new block, so make sure the buffer and page are
+ * uptodate before moving the existing block to snapshot.
+ */
+static int ext4_partial_write_begin(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh)
+{
+ struct ext4_map_blocks map;
+ int ret;
+
+ BUG_ON(!buffer_partial_write(bh));
+ BUG_ON(!bh->b_page || !PageLocked(bh->b_page));
+ map.m_lblk = iblock;
+ map.m_len = 1;
+
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret <= 0)
+ return ret;
+
+ if (!buffer_uptodate(bh) && !buffer_unwritten(bh)) {
+ /* map existing block for read */
+ map_bh(bh, inode->i_sb, map.m_pblk);
+ ll_rw_block(READ, 1, &bh);
+ wait_on_buffer(bh);
+ /* clear existing block mapping */
+ clear_buffer_mapped(bh);
+ if (!buffer_uptodate(bh))
+ return -EIO;
+ }
+ /* prevent zero out of page with BH_New flag in block_write_begin() */
+ SetPageUptodate(bh->b_page);
+ return 0;
+}
+
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096

@@ -1405,11 +1512,18 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
handle = ext4_journal_start(inode, dio_credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- return ret;
+ goto out;
}
started = 1;
}

+ if ((flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE) &&
+ buffer_partial_write(bh)) {
+ /* Read existing block data before moving it to snapshot */
+ ret = ext4_partial_write_begin(inode, iblock, bh);
+ if (ret < 0)
+ goto out;
+ }
ret = ext4_map_blocks(handle, inode, &map, flags);
if (ret > 0) {
map_bh(bh, inode->i_sb, map.m_pblk);
@@ -1417,11 +1531,30 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
ret = 0;
}
+out:
if (started)
ext4_journal_stop(handle);
+ /*
+ * BH_Partial_Write flags are only used to pass
+ * hints to this function and should be cleared on exit.
+ */
+ clear_buffer_partial_write(bh);
return ret;
}

+/*
+ * ext4_get_block_mow is used when a block may be needed to be snapshotted.
+ */
+int ext4_get_block_mow(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ int flags = create ? EXT4_GET_BLOCKS_CREATE : 0;
+
+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE;
+ return _ext4_get_block(inode, iblock, bh, flags);
+}
+
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{
@@ -1600,6 +1733,45 @@ static void ext4_truncate_failed_write(struct inode *inode)
ext4_truncate(inode);
}

+/*
+ * Prepare for snapshot.
+ * Clear mapped flag of buffers,
+ * Set partial write flag of buffers in non-delayed-mow case.
+ */
+static void ext4_snapshot_write_begin(struct inode *inode,
+ struct page *page, unsigned len, int delay)
+{
+ struct buffer_head *bh = NULL;
+ /*
+ * XXX: We can also check ext4_snapshot_has_active() here and we don't
+ * need to unmap the buffers is there is no active snapshot, but the
+ * result must be valid throughout the writepage() operation and to
+ * guarantee this we have to know that the transaction is not restarted.
+ * Can we count on that?
+ */
+ if (!ext4_snapshot_should_move_data(inode))
+ return;
+
+ if (!page_has_buffers(page))
+ create_empty_buffers(page, inode->i_sb->s_blocksize, 0);
+ /* snapshots only work when blocksize == pagesize */
+ bh = page_buffers(page);
+ /*
+ * make sure that get_block() is called even if the buffer is
+ * mapped, but not if it is already a part of any transaction.
+ * in data=ordered,the only mode supported by ext4, all dirty
+ * data buffers are flushed on snapshot take via freeze_fs()
+ * API.
+ */
+ if (!buffer_jbd(bh) && !buffer_delay(bh)) {
+ clear_buffer_mapped(bh);
+ /* explicitly request move-on-write */
+ if (!delay && len < PAGE_CACHE_SIZE)
+ /* read block before moving it to snapshot */
+ set_buffer_partial_write(bh);
+ }
+}
+
static int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
static int ext4_write_begin(struct file *file, struct address_space *mapping,
@@ -1642,11 +1814,13 @@ retry:
goto out;
}
*pagep = page;
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ ext4_snapshot_write_begin(inode, page, len, 0);

if (ext4_should_dioread_nolock(inode))
ret = __block_write_begin(page, pos, len, ext4_get_block_write);
else
- ret = __block_write_begin(page, pos, len, ext4_get_block);
+ ret = __block_write_begin(page, pos, len, ext4_get_block_mow);

if (!ret && ext4_should_journal_data(inode)) {
ret = walk_page_buffers(handle, page_buffers(page),
@@ -2114,6 +2288,10 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
clear_buffer_delay(bh);
bh->b_blocknr = pblock;
}
+ if (buffer_remap(bh)) {
+ clear_buffer_remap(bh);
+ bh->b_blocknr = pblock;
+ }
if (buffer_unwritten(bh) ||
buffer_mapped(bh))
BUG_ON(bh->b_blocknr != pblock);
@@ -2123,7 +2301,8 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
}

/* skip page if block allocation undone */
- if (buffer_delay(bh) || buffer_unwritten(bh))
+ if (buffer_delay(bh) || buffer_unwritten(bh) ||
+ buffer_remap(bh))
skip_page = 1;
bh = bh->b_this_page;
block_start += bh->b_size;
@@ -2243,7 +2422,8 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
if ((mpd->b_size == 0) ||
((mpd->b_state & (1 << BH_Mapped)) &&
!(mpd->b_state & (1 << BH_Delay)) &&
- !(mpd->b_state & (1 << BH_Unwritten))))
+ !(mpd->b_state & (1 << BH_Unwritten)) &&
+ !(mpd->b_state & (1 << BH_Remap))))
goto submit_io;

handle = ext4_journal_current_handle();
@@ -2274,6 +2454,9 @@ static void mpage_da_map_and_submit(struct mpage_da_data *mpd)
get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
if (mpd->b_state & (1 << BH_Delay))
get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
+ if (mpd->b_state & (1 << BH_Remap))
+ get_blocks_flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE |
+ EXT4_GET_BLOCKS_DELALLOC_RESERVE;

blks = ext4_map_blocks(handle, mpd->inode, &map, get_blocks_flags);
if (blks < 0) {
@@ -2357,7 +2540,7 @@ submit_io:
}

#define BH_FLAGS ((1 << BH_Uptodate) | (1 << BH_Mapped) | \
- (1 << BH_Delay) | (1 << BH_Unwritten))
+ (1 << BH_Delay) | (1 << BH_Unwritten) | (1 << BH_Remap))

/*
* mpage_add_bh_to_extent - try to add one more block to extent of blocks
@@ -2434,9 +2617,11 @@ flush_it:
return;
}

-static int ext4_bh_delay_or_unwritten(handle_t *handle, struct buffer_head *bh)
+static int ext4_bh_delay_or_unwritten_or_remap(handle_t *handle,
+ struct buffer_head *bh)
{
- return (buffer_delay(bh) || buffer_unwritten(bh)) && buffer_dirty(bh);
+ return ((buffer_delay(bh) || buffer_unwritten(bh)) &&
+ buffer_dirty(bh)) || buffer_remap(bh);
}

/*
@@ -2456,6 +2641,8 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
{
struct ext4_map_blocks map;
int ret = 0;
+ handle_t *handle = ext4_journal_current_handle();
+ int flags = 0;
sector_t invalid_block = ~((sector_t) 0xffff);

if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
@@ -2467,12 +2654,15 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
map.m_lblk = iblock;
map.m_len = 1;

+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE;
+
/*
* first, we need to know whether the block is allocated already
* preallocated blocks are unmapped but should treated
* the same as allocated blocks.
*/
- ret = ext4_map_blocks(NULL, inode, &map, 0);
+ ret = ext4_map_blocks(handle, inode, &map, flags);
if (ret < 0)
return ret;
if (ret == 0) {
@@ -2492,6 +2682,11 @@ static int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
return 0;
}

+ if (map.m_flags & EXT4_MAP_REMAP) {
+ ret = ext4_da_reserve_space(inode, iblock);
+ if (ret < 0)
+ return ret;
+ }
map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;

@@ -2659,7 +2854,7 @@ static int ext4_writepage(struct page *page,
}
page_bufs = page_buffers(page);
if (walk_page_buffers(NULL, page_bufs, 0, len, NULL,
- ext4_bh_delay_or_unwritten)) {
+ ext4_bh_delay_or_unwritten_or_remap)) {
/*
* We don't want to do block allocation, so redirty
* the page and return. We may reach here when we do
@@ -2840,7 +3035,8 @@ static int write_cache_pages_da(struct address_space *mapping,
* Otherwise we won't make progress
* with the page in ext4_writepage
*/
- if (ext4_bh_delay_or_unwritten(NULL, bh)) {
+ if (ext4_bh_delay_or_unwritten_or_remap(
+ NULL, bh)) {
mpage_add_bh_to_extent(mpd, logical,
bh->b_size,
bh->b_state);
@@ -3175,6 +3371,8 @@ retry:
goto out;
}
*pagep = page;
+ if (EXT4_SNAPSHOTS(inode->i_sb))
+ ext4_snapshot_write_begin(inode, page, len, 1);

ret = __block_write_begin(page, pos, len, ext4_da_get_block_prep);
if (ret < 0) {
@@ -4002,6 +4200,18 @@ int ext4_block_truncate_page(handle_t *handle,
goto unlock;
}

+ /* check if block needs to be moved to snapshot before zeroing */
+ if (ext4_snapshot_should_move_data(inode)) {
+ err = ext4_get_block_mow(inode, iblock, bh, 1);
+ if (err)
+ goto unlock;
+ if (buffer_new(bh)) {
+ unmap_underlying_metadata(bh->b_bdev,
+ bh->b_blocknr);
+ clear_buffer_new(bh);
+ }
+ }
+
if (ext4_should_journal_data(inode)) {
BUFFER_TRACE(bh, "get write access");
err = ext4_journal_get_write_access(handle, bh);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 3b1c6d1..5eced75 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3174,6 +3174,29 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
struct ext4_prealloc_space *pa, *cpa = NULL;
ext4_fsblk_t goal_block;

+ /*
+ * All inode preallocations allocated before the time when the
+ * active snapshot is taken need to be discarded, otherwise blocks
+ * maybe used by both a regular file and the snapshot file that we
+ * are taking in the below case.
+ *
+ * Case: An user take a snapshot when an inode has a preallocation
+ * 12/512, of which 12/64 has been used by the inode. Here 12 is the
+ * logical block number. After the snapshot is taken, an user issues
+ * a write request on the 12th block, then an allocation on 12 is
+ * needed and allocator will use blocks from the preallocations.As
+ * a result, the event above happens.
+ *
+ *
+ * For now, all preallocations are discarded.
+ *
+ * Please refer to code and comments about preallocation in
+ * mballoc.c for more information.
+ */
+ if (ext4_snapshot_active(EXT4_SB(ac->ac_inode->i_sb)) &&
+ !ext4_snapshot_mow_in_tid(ac->ac_inode)) {
+ ext4_discard_preallocations(ac->ac_inode);
+ }
/* only data can be preallocated */
if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
return 0;
--
1.7.0.4


2011-05-09 16:43:20

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 07/30] ext4: snapshot hooks - direct I/O

From: Amir Goldstein <[email protected]>

With indirect mapped files, direct I/O write is not allowed to
initialize holes, so stale data won't be exposed.
With snapshots, direct I/O write is not allowed to do move-on-write,
for the exact same reason.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3ed64bb..476606b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1370,6 +1370,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
return ret;
}

+ if (retval > 0 && (map->m_flags & EXT4_MAP_REMAP) &&
+ (flags & EXT4_GET_BLOCKS_PRE_IO)) {
+ /*
+ * If mow is needed on the requested block and
+ * request comes from async-direct-io-write path,
+ * we return an unmapped buffer to fall back to buffered I/O.
+ */
+ map->m_flags &= ~EXT4_MAP_MAPPED;
+ return 0;
+ }
/* If it is only a block(s) look up */
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
return retval;
@@ -3678,6 +3688,29 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
}

/*
+ * ext4_get_block_dio used when preparing for a DIO write
+ * to indirect mapped files with snapshots.
+ */
+int ext4_get_block_dio_write(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ int flags = EXT4_GET_BLOCKS_CREATE;
+
+ /*
+ * DIO_SKIP_HOLES may ask to map direct I/O write with create=0,
+ * but we know this is a write, so we need to check if block
+ * needs to be moved to snapshot and fall back to buffered I/O.
+ * ext4_map_blocks() will return an unmapped buffer if block
+ * is not allocated or if it needs to be moved to snapshot.
+ */
+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE|
+ EXT4_GET_BLOCKS_PRE_IO;
+
+ return _ext4_get_block(inode, iblock, bh, flags);
+}
+
+/*
* O_DIRECT for ext3 (or indirect map) based files
*
* If the O_DIRECT write will extend the file then add this inode to the
@@ -3732,6 +3765,16 @@ retry:
ret = blockdev_direct_IO(rw, iocb, inode,
inode->i_sb->s_bdev, iov,
offset, nr_segs,
+ /*
+ * snapshots code gets here for DIO write
+ * to ind mapped files or outside i_size
+ * of extent mapped files and for DIO read
+ * to all files.
+ * XXX: isn't it possible to expose stale data
+ * on DIO read to newly allocated ind map
+ * blocks or newly MOWed blocks?
+ */
+ (rw == WRITE) ? ext4_get_block_dio_write :
ext4_get_block, NULL);

if (unlikely((rw & WRITE) && ret < 0)) {
@@ -3793,10 +3836,13 @@ out:
static int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
+ int flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
ext4_debug("ext4_get_block_write: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
+ if (ext4_snapshot_should_move_data(inode))
+ flags |= EXT4_GET_BLOCKS_MOVE_ON_WRITE;
+ return _ext4_get_block(inode, iblock, bh_result, flags);
}

static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
--
1.7.0.4


2011-05-09 16:43:29

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 08/30] ext4: snapshot hooks - move extent file data blocks

From: Amir Goldstein <[email protected]>

Extent mapped file data is moved into snapshot in ext4_ext_map_blocks().
If a part of a extent is to be moved, the extent is splitted. Fragmentation
is light because of delayed-move-on-write.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.h | 2 -
fs/ext4/extents.c | 143 +++++++++++++++++++++++++++++++++++++++++++++------
fs/ext4/inode.c | 3 +-
3 files changed, 128 insertions(+), 20 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 1c119cc..ea3a0a0 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -369,8 +369,6 @@ static inline int ext4_snapshot_should_move_data(struct inode *inode)
return 0;
if (EXT4_JOURNAL(inode) == NULL)
return 0;
- if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
- return 0;
/* when a data block is journaled, it is already COWed as metadata */
if (ext4_should_journal_data(inode))
return 0;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c8cab3d..11fe058 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1256,11 +1256,10 @@ static int ext4_ext_search_left(struct inode *inode,
return 0;
}

- if (unlikely(*logical < (le32_to_cpu(ex->ee_block) + ee_len))) {
- EXT4_ERROR_INODE(inode,
- "logical %d < ee_block %d + ee_len %d!",
- *logical, le32_to_cpu(ex->ee_block), ee_len);
- return -EIO;
+ if (*logical < (le32_to_cpu(ex->ee_block) + ee_len)) {
+ *logical -= 1;
+ *phys = ext4_ext_pblock(ex) + *logical;
+ return 0;
}

*logical = le32_to_cpu(ex->ee_block) + ee_len - 1;
@@ -1324,11 +1323,10 @@ static int ext4_ext_search_right(struct inode *inode,
return 0;
}

- if (unlikely(*logical < (le32_to_cpu(ex->ee_block) + ee_len))) {
- EXT4_ERROR_INODE(inode,
- "logical %d < ee_block %d + ee_len %d!",
- *logical, le32_to_cpu(ex->ee_block), ee_len);
- return -EIO;
+ if (*logical < (le32_to_cpu(ex->ee_block) + ee_len)) {
+ *logical += 1;
+ *phys = ext4_ext_pblock(ex) + *logical;
+ return 0;
}

if (ex != EXT_LAST_EXTENT(path[depth].p_hdr)) {
@@ -3155,7 +3153,8 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
struct ext4_ext_path *path = NULL;
- struct ext4_extent newex, *ex;
+ struct ext4_extent newex, *ex = NULL;
+ ext4_fsblk_t oldblock = 0;
ext4_fsblk_t newblock = 0;
int err = 0, depth, ret;
unsigned int allocated = 0;
@@ -3184,7 +3183,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* number of remaining blocks in the extent */
allocated = ext4_ext_get_actual_len(&newex) -
(map->m_lblk - le32_to_cpu(newex.ee_block));
- goto out;
+ goto found;
}
}

@@ -3235,7 +3234,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (!ext4_ext_is_uninitialized(ex)) {
ext4_ext_put_in_cache(inode, ee_block,
ee_len, ee_start);
- goto out;
+ goto found;
}
ret = ext4_ext_handle_uninitialized_extents(handle,
inode, map, path, flags, allocated,
@@ -3256,6 +3255,59 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
ext4_ext_put_gap_in_cache(inode, path, map->m_lblk);
goto out2;
}
+
+ /*
+ * two cases:
+ * 1. the request block is found.
+ * a. If EXT4_GET_BLOCKS_CREATE is not set, we will test
+ * if MOW is needed.
+ * b. If EXT4_GET_BLOCKS_CREATE is set. MOW will be done
+ * if MOW is needed.
+ *
+ * 2. the request block is not found, EXT4_GET_BLOCKS_CREATE
+ * must be set and MOW must be not needed.
+ */
+found:
+ if (newblock && (flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE)) {
+ BUG_ON(!ext4_snapshot_should_move_data(inode));
+ /*
+ * Should move 1 block to snapshot?
+ *
+ * XXX With delayed-move-write support,
+ * multi-blocks should be moved each time.
+ */
+ allocated = allocated < map->m_len ? allocated : map->m_len;
+ err = ext4_snapshot_get_move_access(handle, inode, newblock,
+ &allocated, 0);
+ map->m_len = allocated;
+ if (err > 0) {
+ if (!(flags & EXT4_GET_BLOCKS_CREATE)) {
+ /* Do not map found block. */
+ map->m_flags |= EXT4_MAP_REMAP;
+ err = 0;
+ goto out;
+ } else {
+ oldblock = newblock;
+ }
+ } else if (err < 0)
+ goto out2;
+
+ if ((path == NULL) && (flags & EXT4_GET_BLOCKS_CREATE)) {
+ /* find extent for this block */
+ path = ext4_ext_find_extent(inode, map->m_lblk, NULL);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ path = NULL;
+ goto out2;
+ }
+ depth = ext_depth(inode);
+ ex = path[depth].p_ext;
+ }
+ }
+
+ if (!(flags & EXT4_GET_BLOCKS_CREATE))
+ goto out;
+
/*
* Okay, we need to do block allocation.
*/
@@ -3265,7 +3317,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
err = ext4_ext_search_left(inode, path, &ar.lleft, &ar.pleft);
if (err)
goto out2;
- ar.lright = map->m_lblk;
+ ar.lright = map->m_lblk + allocated;
err = ext4_ext_search_right(inode, path, &ar.lright, &ar.pright);
if (err)
goto out2;
@@ -3286,7 +3338,11 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
/* Check if we can really insert (m_lblk)::(m_lblk + m_len) extent */
newex.ee_block = cpu_to_le32(map->m_lblk);
newex.ee_len = cpu_to_le16(map->m_len);
- err = ext4_ext_check_overlap(inode, &newex, path);
+ if (oldblock) {
+ /* Overlap checking is not needed for MOW case. */
+ err = 0;
+ } else
+ err = ext4_ext_check_overlap(inode, &newex, path);
if (err)
allocated = ext4_ext_get_actual_len(&newex);
else
@@ -3337,7 +3393,55 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (err)
goto out2;

- err = ext4_ext_insert_extent(handle, inode, path, &newex, flags);
+ if (oldblock) {
+ /*
+ * Move oldblocks to snapshot.
+ */
+ map->m_len = ar.len;
+ err = ext4_snapshot_get_move_access(handle, inode,
+ oldblock, &map->m_len, 1);
+ if (err <= 0 || map->m_len < ar.len) {
+ /* failed to move to snapshot - abort! */
+ err = err ? : -EIO;
+ ext4_journal_abort_handle(__func__, __LINE__,
+ "ext4_snapshot_get_move_access", NULL,
+ handle, err);
+ } else {
+ /*
+ * Move to snapshot successfully.
+ * TODO merge extent after finishing MOW
+ */
+ err = ext4_split_extent(handle, inode, path, map, 0,
+ flags | EXT4_GET_BLOCKS_PRE_IO);
+ if (err < 0)
+ goto out;
+
+ /* extent tree may be changed. */
+ depth = ext_depth(inode);
+ ext4_ext_drop_refs(path);
+ path = ext4_ext_find_extent(inode, map->m_lblk, path);
+ if (IS_ERR(path)) {
+ err = PTR_ERR(path);
+ goto out;
+ }
+
+ /* just verify splitting. */
+ ex = path[depth].p_ext;
+ BUG_ON(le32_to_cpu(ex->ee_block) != map->m_lblk ||
+ ext4_ext_get_actual_len(ex) != map->m_len);
+
+ err = ext4_ext_get_access(handle, inode, path + depth);
+ if (!err) {
+ /* splice new blocks to the inode*/
+ ext4_ext_store_pblock(ex, newblock);
+ err = ext4_ext_dirty(handle, inode,
+ path + depth);
+ }
+ }
+
+ } else
+ err = ext4_ext_insert_extent(handle, inode,
+ path, &newex, flags);
if (err) {
/* free data blocks we just allocated */
/* not a good idea to call discard here directly,
@@ -3366,7 +3470,12 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an uninitialized extent.
*/
- if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
+ if (IS_COWING(handle)) {
+ /*
+ * snapshot does not supprt fdatasync and fsync
+ * and there is no need to cache extent
+ */
+ } else if ((flags & EXT4_GET_BLOCKS_UNINIT_EXT) == 0) {
ext4_ext_put_in_cache(inode, map->m_lblk, allocated, newblock);
ext4_update_inode_fsync_trans(handle, inode, 1);
} else
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 476606b..866ac36 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1357,7 +1357,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
*/
down_read((&EXT4_I(inode)->i_data_sem));
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
- retval = ext4_ext_map_blocks(handle, inode, map, 0);
+ retval = ext4_ext_map_blocks(handle, inode, map,
+ flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
} else {
retval = ext4_ind_map_blocks(handle, inode, map,
flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
--
1.7.0.4


2011-05-09 16:43:35

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 09/30] ext4: snapshot file

From: Amir Goldstein <[email protected]>

Ext4 snapshot implementation as a file inside the file system.
Snapshot files are marked with the snapfile flag and have special
read-only address space ops.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 83 +++++++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/ext4_jbd2.h | 2 +
fs/ext4/ialloc.c | 8 ++++-
fs/ext4/inode.c | 29 ++++++++++++++++++
fs/ext4/super.c | 9 +++++
5 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 013eec2..4072036 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -361,17 +361,23 @@ struct flex_groups {
#define EXT4_EXTENTS_FL 0x00080000 /* Inode uses extents */
#define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */
#define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */
+/* snapshot persistent flags */
+#define EXT4_SNAPFILE_FL 0x01000000 /* snapshot file */
+#define EXT4_SNAPFILE_DELETED_FL 0x04000000 /* snapshot is deleted */
+#define EXT4_SNAPFILE_SHRUNK_FL 0x08000000 /* snapshot was shrunk */
+/* end of snapshot flags */
#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */

-#define EXT4_FL_USER_VISIBLE 0x004BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE 0x004B80FF /* User modifiable flags */
+
+#define EXT4_FL_USER_VISIBLE 0x014BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE 0x014B80FF /* User modifiable flags */

/* Flags that should be inherited by new inodes from their parent. */
#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
EXT4_SYNC_FL | EXT4_IMMUTABLE_FL | EXT4_APPEND_FL |\
EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
- EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
+ EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL | EXT4_SNAPFILE_FL)

/* Flags that are appropriate for regular files (all but dir-specific ones). */
#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
@@ -418,6 +424,9 @@ enum {
EXT4_INODE_EXTENTS = 19, /* Inode uses extents */
EXT4_INODE_EA_INODE = 21, /* Inode used for large EA */
EXT4_INODE_EOFBLOCKS = 22, /* Blocks allocated beyond EOF */
+ EXT4_INODE_SNAPFILE = 24, /* Snapshot file/dir */
+ EXT4_INODE_SNAPFILE_DELETED = 26, /* Snapshot is deleted */
+ EXT4_INODE_SNAPFILE_SHRUNK = 27, /* Snapshot was shrunk */
EXT4_INODE_RESERVED = 31, /* reserved for ext4 lib */
};

@@ -464,6 +473,9 @@ static inline void ext4_check_flag_values(void)
CHECK_FLAG_VALUE(EXTENTS);
CHECK_FLAG_VALUE(EA_INODE);
CHECK_FLAG_VALUE(EOFBLOCKS);
+ CHECK_FLAG_VALUE(SNAPFILE);
+ CHECK_FLAG_VALUE(SNAPFILE_DELETED);
+ CHECK_FLAG_VALUE(SNAPFILE_SHRUNK);
CHECK_FLAG_VALUE(RESERVED);
}

@@ -803,6 +815,14 @@ struct ext4_inode_info {
struct list_head i_orphan; /* unlinked but open inodes */

/*
+ * In-memory snapshot list overrides i_orphan to link snapshot inodes,
+ * but unlike the real orphan list, the next snapshot inode number
+ * is stored in i_next_snapshot_ino and not in i_dtime
+ */
+#define i_snaplist i_orphan
+ __u32 i_next_snapshot_ino;
+
+ /*
* i_disksize keeps track of what the inode size is ON DISK, not
* in memory. During truncate, i_size is set to the new size by
* the VFS prior to calling ext4_truncate(), but the filesystem won't
@@ -1158,6 +1178,8 @@ struct ext4_sb_info {
u32 s_max_batch_time;
u32 s_min_batch_time;
struct block_device *journal_bdev;
+ struct mutex s_snapshot_mutex; /* protects 2 fields below: */
+ struct inode *s_active_snapshot; /* [ s_snapshot_mutex ] */
#ifdef CONFIG_JBD2_DEBUG
struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */
wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */
@@ -1274,8 +1296,31 @@ enum {
EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_DELALLOC_RESERVED, /* blks already reserved for delalloc */
+ EXT4_STATE_LAST
};

+/*
+ * Snapshot dynamic state flags (starting at offset EXT4_STATE_LAST)
+ * These flags are read by GETSNAPFLAGS ioctl and interpreted by the lssnap
+ * utility. Do not change these values.
+ */
+enum {
+ EXT4_SNAPSTATE_LIST = 0, /* snapshot is on list (S) */
+ EXT4_SNAPSTATE_ENABLED = 1, /* snapshot is enabled (n) */
+ EXT4_SNAPSTATE_ACTIVE = 2, /* snapshot is active (a) */
+ EXT4_SNAPSTATE_INUSE = 3, /* snapshot is in-use (p) */
+ EXT4_SNAPSTATE_DELETED = 4, /* snapshot is deleted (s) */
+ EXT4_SNAPSTATE_SHRUNK = 5, /* snapshot was shrunk (h) */
+ EXT4_SNAPSTATE_OPEN = 6, /* snapshot is mounted (o) */
+ EXT4_SNAPSTATE_TAGGED = 7, /* snapshot is tagged (t) */
+ EXT4_SNAPSTATE_LAST
+};
+
+#define EXT4_SNAPSTATE_MASK \
+ ((1UL << EXT4_SNAPSTATE_LAST) - 1)
+
+
+/* atomic single bit funcs */
#define EXT4_INODE_BIT_FNS(name, field, offset) \
static inline int ext4_test_inode_##name(struct inode *inode, int bit) \
{ \
@@ -1290,9 +1335,28 @@ static inline void ext4_clear_inode_##name(struct inode *inode, int bit) \
clear_bit(bit + (offset), &EXT4_I(inode)->i_##field); \
}

+/* non-atomic multi bit funcs */
+#define EXT4_INODE_FLAGS_FNS(name, field, offset) \
+static inline int ext4_get_##name##_flags(struct inode *inode) \
+{ \
+ return EXT4_I(inode)->i_##field >> (offset); \
+} \
+static inline void ext4_set_##name##_flags(struct inode *inode, \
+ unsigned long flags) \
+{ \
+ EXT4_I(inode)->i_##field |= (flags << (offset)); \
+} \
+static inline void ext4_clear_##name##_flags(struct inode *inode, \
+ unsigned long flags) \
+{ \
+ EXT4_I(inode)->i_##field &= ~(flags << (offset)); \
+}
+
EXT4_INODE_BIT_FNS(flag, flags, 0)
#if (BITS_PER_LONG < 64)
EXT4_INODE_BIT_FNS(state, state_flags, 0)
+EXT4_INODE_BIT_FNS(snapstate, state_flags, EXT4_STATE_LAST)
+EXT4_INODE_FLAGS_FNS(snapstate, state_flags, EXT4_STATE_LAST)

static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
{
@@ -1300,6 +1364,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
}
#else
EXT4_INODE_BIT_FNS(state, flags, 32)
+EXT4_INODE_BIT_FNS(snapstate, flags, 32 + EXT4_STATE_LAST)
+EXT4_INODE_FLAGS_FNS(snapstate, flags, 32 + EXT4_STATE_LAST)

static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
{
@@ -1314,6 +1380,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#endif

#define NEXT_ORPHAN(inode) EXT4_I(inode)->i_dtime
+#define NEXT_SNAPSHOT(inode) (EXT4_I(inode)->i_next_snapshot_ino)

/*
* Codes for operating systems
@@ -1781,6 +1848,10 @@ extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
+
+/* snapshot_inode.c */
+extern int ext4_snapshot_readpage(struct file *file, struct page *page);
+
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
@@ -2004,6 +2075,12 @@ struct ext4_group_info {
void *bb_bitmap;
#endif
struct rw_semaphore alloc_sem;
+ /*
+ * bg_cow_bitmap is reset to zero on mount time and on every snapshot
+ * take and initialized lazily on first block group write access.
+ * bg_cow_bitmap is protected by sb_bgl_lock().
+ */
+ unsigned long bg_cow_bitmap; /* COW bitmap cache */
ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block
* regions, index is order.
* bb_counters[3] = 5 means
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index ea3a0a0..e0fef0d 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -369,6 +369,8 @@ static inline int ext4_snapshot_should_move_data(struct inode *inode)
return 0;
if (EXT4_JOURNAL(inode) == NULL)
return 0;
+ if (ext4_snapshot_excluded(inode))
+ return 0;
/* when a data block is journaled, it is already COWed as metadata */
if (ext4_should_journal_data(inode))
return 0;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 831d49a..ba928a7 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1048,8 +1048,12 @@ got:
goto fail_free_drop;

if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
- /* set extent flag only for directory, file and normal symlink*/
- if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) {
+ /*
+ * Set extent flag only for non-snapshot file, directory
+ * and normal symlink
+ */
+ if ((S_ISREG(mode) && !ext4_snapshot_file(inode)) ||
+ S_ISDIR(mode) || S_ISLNK(mode)) {
ext4_set_inode_flag(inode, EXT4_INODE_EXTENTS);
ext4_ext_tree_init(handle, inode);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 866ac36..4ec5f02 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4162,9 +4162,38 @@ static const struct address_space_operations ext4_da_aops = {
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
+static int ext4_no_writepage(struct page *page,
+ struct writeback_control *wbc)
+{
+ unlock_page(page);
+ return -EIO;
+}
+
+/*
+ * Snapshot file page operations:
+ * always readpage (by page) with buffer tracked read.
+ * user cannot writepage or direct_IO to a snapshot file.
+ *
+ * snapshot file pages are written to disk after a COW operation in "ordered"
+ * mode and are never changed after that again, so there is no data corruption
+ * risk when using "ordered" mode on snapshot files.
+ * some snapshot data pages are written to disk by sync_dirty_buffer(), namely
+ * the snapshot COW bitmaps and a few initial blocks copied on snapshot_take().
+ */
+static const struct address_space_operations ext4_snapfile_aops = {
+ .readpage = ext4_readpage,
+ .readpages = ext4_readpages,
+ .writepage = ext4_no_writepage,
+ .bmap = ext4_bmap,
+ .invalidatepage = ext4_invalidatepage,
+ .releasepage = ext4_releasepage,
+};

void ext4_set_aops(struct inode *inode)
{
+ if (ext4_snapshot_file(inode))
+ inode->i_mapping->a_ops = &ext4_snapfile_aops;
+ else
if (ext4_should_order_data(inode) &&
test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = &ext4_da_aops;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2c345d1..e3ebd7d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -745,6 +745,8 @@ static void ext4_put_super(struct super_block *sb)
destroy_workqueue(sbi->dio_unwritten_wq);

lock_super(sb);
+ if (EXT4_SNAPSHOTS(sb))
+ ext4_snapshot_destroy(sb);
if (sb->s_dirt)
ext4_commit_super(sb, 1);

@@ -3474,6 +3476,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

sb->s_root = NULL;

+ mutex_init(&sbi->s_snapshot_mutex);
+ sbi->s_active_snapshot = NULL;
+
needs_recovery = (es->s_last_orphan != 0 ||
EXT4_HAS_INCOMPAT_FEATURE(sb,
EXT4_FEATURE_INCOMPAT_RECOVER));
@@ -3676,6 +3681,10 @@ no_journal:
goto failed_mount4;
};

+ if (EXT4_SNAPSHOTS(sb) &&
+ ext4_snapshot_load(sb, es, sb->s_flags & MS_RDONLY))
+ /* XXX: how can we fail and force read-only at this point? */
+ ext4_error(sb, "load snapshot failed\n");
EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
ext4_orphan_cleanup(sb, es);
EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
--
1.7.0.4


2011-05-09 16:43:41

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 10/30] ext4: snapshot file - read through to block device

From: Amir Goldstein <[email protected]>

On active snapshot page read, the function ext4_snapshot_get_block()
is called to map the page to a disk block. If the page is not mapped
in the snapshot file a direct mapping to the block device is returned.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4ec5f02..3acdbe5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4181,8 +4181,7 @@ static int ext4_no_writepage(struct page *page,
* the snapshot COW bitmaps and a few initial blocks copied on snapshot_take().
*/
static const struct address_space_operations ext4_snapfile_aops = {
- .readpage = ext4_readpage,
- .readpages = ext4_readpages,
+ .readpage = ext4_snapshot_readpage,
.writepage = ext4_no_writepage,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
--
1.7.0.4


2011-05-09 16:43:44

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 11/30] ext4: snapshot file - permissions

From: Amir Goldstein <[email protected]>

Enforce snapshot file permissions.
Write, truncate and unlink of snapshot inodes is not allowed.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/file.c | 7 +++++++
fs/ext4/inode.c | 7 +++++++
fs/ext4/namei.c | 8 ++++++++
3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 60b3b19..f31e58e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -168,6 +168,13 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
struct path path;
char buf[64], *cp;

+ if (ext4_snapshot_file(inode) &&
+ (filp->f_flags & O_ACCMODE) != O_RDONLY)
+ /*
+ * allow only read-only access to snapshot files
+ */
+ return -EPERM;
+
if (unlikely(!(sbi->s_mount_flags & EXT4_MF_MNTDIR_SAMPLED) &&
!(sb->s_flags & MS_RDONLY))) {
sbi->s_mount_flags |= EXT4_MF_MNTDIR_SAMPLED;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3acdbe5..c3af773 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4748,6 +4748,13 @@ void ext4_truncate(struct inode *inode)
ext4_lblk_t last_block, max_block;
unsigned blocksize = inode->i_sb->s_blocksize;

+ /* prevent truncate of files on snapshot list */
+ if (ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "snapshot (%u) cannot be truncated!\n",
+ inode->i_generation);
+ return;
+ }
+
if (!ext4_can_truncate(inode))
return;

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index b70fa13..02ba825 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2213,6 +2213,14 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
inode->i_ino, inode->i_nlink);
inode->i_nlink = 1;
}
+ /* prevent unlink of files on snapshot list */
+ if (inode->i_nlink == 1 &&
+ ext4_snapshot_list(inode)) {
+ snapshot_debug(1, "snapshot (%u) cannot be unlinked!\n",
+ inode->i_generation);
+ retval = -EPERM;
+ goto end_unlink;
+ }
retval = ext4_delete_entry(handle, dir, de, bh);
if (retval)
goto end_unlink;
--
1.7.0.4


2011-05-09 16:43:46

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 12/30] ext4: snapshot file - store on disk

From: Amir Goldstein <[email protected]>

Snapshot inode is stored differently in memory and on disk.
During store and load of snapshot inode, some of the inode flags
and fields are converted.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 33 +++++++++++++++++++++++++++------
1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c3af773..db1706f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5201,6 +5201,17 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
*/
for (block = 0; block < EXT4_N_BLOCKS; block++)
ei->i_data[block] = raw_inode->i_block[block];
+ /* snapshot on-disk list is stored in snapshot inode on-disk version */
+ if (ext4_snapshot_file(inode)) {
+ ei->i_next_snapshot_ino =
+ le32_to_cpu(raw_inode->i_disk_version);
+ /*
+ * snapshot volume size is stored in i_disksize.
+ * in-memory i_size of snapshot files is set to 0 (disabled).
+ * enabling a snapshot is setting i_size to i_disksize.
+ */
+ inode->i_size = 0;
+ }
INIT_LIST_HEAD(&ei->i_orphan);

/*
@@ -5465,12 +5476,22 @@ static int ext4_do_update_inode(handle_t *handle,
for (block = 0; block < EXT4_N_BLOCKS; block++)
raw_inode->i_block[block] = ei->i_data[block];

- raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
- if (ei->i_extra_isize) {
- if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
- raw_inode->i_version_hi =
- cpu_to_le32(inode->i_version >> 32);
- raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
+ if (ext4_snapshot_file(inode)) {
+ /*
+ * Snapshot on-disk list overrides snapshot on-disk version.
+ * Snapshot files are not writable and have a fixed version.
+ */
+ raw_inode->i_disk_version =
+ cpu_to_le32(ei->i_next_snapshot_ino);
+ } else {
+ raw_inode->i_disk_version = cpu_to_le32(inode->i_version);
+ if (ei->i_extra_isize) {
+ if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
+ raw_inode->i_version_hi =
+ cpu_to_le32(inode->i_version >> 32);
+ raw_inode->i_extra_isize =
+ cpu_to_le16(ei->i_extra_isize);
+ }
}

BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
--
1.7.0.4


2011-05-09 16:43:50

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 14/30] ext4: snapshot block operations

From: Amir Goldstein <[email protected]>

Core API of special snapshot file block operations.
The argument @create to the function ext4_getblk()
is re-interpreted as a snapshot block command argument. The old
argument values 0(=read) and 1(=create) preserve the original
behavior of the function. The bit field h_cowing in the current
transaction handle is used to prevent COW recursions.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 10 ++++++++++
fs/ext4/ext4_jbd2.h | 3 +++
fs/ext4/inode.c | 4 ++--
3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8f59322..c7fd33e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -557,6 +557,16 @@ struct ext4_new_group_data {
* if EXT4_GET_BLOCKS_CREATE is not set, return REMAP flags.
*/
#define EXT4_GET_BLOCKS_MOVE_ON_WRITE 0x0100
+/*
+ * snapshot_map_blocks() flags passed to ext4_map_blocks() for mapping
+ * blocks to snapshot.
+ */
+ /* handle COW race conditions */
+#define EXT4_GET_BLOCKS_COW 0x200
+ /* allocate only indirect blocks */
+#define EXT4_GET_BLOCKS_MOVE 0x400
+ /* bypass journal and sync allocated indirect blocks directly to disk */
+#define EXT4_GET_BLOCKS_SYNC 0x800

/*
* Flags used by ext4_free_blocks
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index e0fef0d..79b6594 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -173,6 +173,9 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define ext4_handle_dirty_super(handle, sb) \
__ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))

+#define trace_cow_add(handle, name, num)
+#define trace_cow_inc(handle, name)
+
handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks);
int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 425dabb..ba66545 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1600,8 +1600,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,

map.m_lblk = block;
map.m_len = 1;
- err = ext4_map_blocks(handle, inode, &map,
- create ? EXT4_GET_BLOCKS_CREATE : 0);
+ /* passing SNAPMAP flags on create argument */
+ err = ext4_map_blocks(handle, inode, &map, create);

if (err < 0)
*errp = err;
--
1.7.0.4


2011-05-09 16:43:52

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 15/30] ext4: snapshot block operation - copy blocks to snapshot

From: Amir Goldstein <[email protected]>

Implementation of copying blocks into a snapshot file.
This mechanism is used to copy-on-write metadata blocks to snapshot.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/inode.c | 40 ++++++++++++++++++++++++++++++++++++----
fs/ext4/mballoc.c | 18 ++++++++++++++++++
fs/ext4/resize.c | 10 +++++++++-
4 files changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c7fd33e..942cd9c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -122,6 +122,8 @@ typedef unsigned int ext4_group_t;
/* We are doing stream allocation */
#define EXT4_MB_STREAM_ALLOC 0x0800

+/* allocate blocks for active snapshot */
+#define EXT4_MB_HINT_COWING 0x02000

struct ext4_allocation_request {
/* target inode for block we're allocating */
@@ -1836,6 +1838,7 @@ extern void __ext4_free_blocks(const char *where, unsigned int line,
extern int ext4_mb_add_groupinfo(struct super_block *sb,
ext4_group_t i, struct ext4_group_desc *desc);
extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
+extern int ext4_mb_test_bit_range(int bit, void *addr, int *pcount);

/* inode.c */
struct buffer_head *ext4_getblk(handle_t *, struct inode *,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba66545..b930645 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -699,8 +699,17 @@ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode,
ar.goal = goal;
ar.len = target;
ar.logical = iblock;
- if (S_ISREG(inode->i_mode))
- /* enable in-core preallocation only for regular files */
+ if (IS_COWING(handle)) {
+ /*
+ * This hint is used to tell the allocator not to fail
+ * on quota limits and allow allocation from blocks which
+ * are reserved for snapshots.
+ * Failing allocation during COW operations would result
+ * in I/O error, which is not desirable.
+ */
+ ar.flags = EXT4_MB_HINT_COWING;
+ } else if (S_ISREG(inode->i_mode) && !ext4_snapshot_file(inode))
+ /* Enable preallocation only for non-snapshot regular files */
ar.flags = EXT4_MB_HINT_DATA;

current_block = ext4_mb_new_blocks(handle, &ar, err);
@@ -1359,6 +1368,21 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags)
{
int retval;
+ int cowing = 0;
+
+ if (handle && IS_COWING(handle)) {
+ /*
+ * locking order for locks validator:
+ * inode (VFS operation) -> active snapshot (COW operation)
+ *
+ * The i_data_sem lock is nested during COW operation, but
+ * the active snapshot i_data_sem write lock is not taken
+ * otherwise, because snapshot file has read-only aops and
+ * because truncate/unlink of active snapshot is not permitted.
+ */
+ BUG_ON(!ext4_snapshot_is_active(inode));
+ cowing = 1;
+ }

map->m_flags = 0;
ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
@@ -1368,7 +1392,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* Try to see if we can get the block without requesting a new
* file system block.
*/
- down_read((&EXT4_I(inode)->i_data_sem));
+ down_read_nested((&EXT4_I(inode)->i_data_sem), cowing);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map,
flags & EXT4_GET_BLOCKS_MOVE_ON_WRITE);
@@ -1427,7 +1451,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* the write lock of i_data_sem, and call get_blocks()
* with create == 1 flag.
*/
- down_write((&EXT4_I(inode)->i_data_sem));
+ down_write_nested((&EXT4_I(inode)->i_data_sem), cowing);

/*
* if the caller is from delayed allocation writeout path
@@ -1618,6 +1642,14 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
J_ASSERT(create != 0);
J_ASSERT(handle != NULL);

+ if (SNAPMAP_ISCOW(create)) {
+ /* COWing block or creating COW bitmap */
+ lock_buffer(bh);
+ clear_buffer_uptodate(bh);
+ /* flag locked buffer and return */
+ *errp = 1;
+ return bh;
+ }
/*
* Now that we do not always journal data, we should
* keep in mind whether this should always journal the
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 5eced75..d43f493 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -420,6 +420,24 @@ static inline int mb_find_next_bit(void *addr, int max, int start)
return ret;
}

+/*
+ * Find the largest range of set or clear bits.
+ * Return 1 for set bits and 0 for clear bits.
+ * Set *pcount to number of bits in range.
+ */
+int ext4_mb_test_bit_range(int bit, void *addr, int *pcount)
+{
+ int i, ret;
+
+ ret = mb_test_bit(bit, addr);
+ if (ret)
+ i = mb_find_next_zero_bit(addr, bit + *pcount, bit);
+ else
+ i = mb_find_next_bit(addr, bit + *pcount, bit);
+ *pcount = i - bit;
+ return ret ? 1 : 0;
+}
+
static void *mb_find_buddy(struct ext4_buddy *e4b, int order, int *max)
{
char *bb;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index ee9b999..06c11fd 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -673,7 +673,15 @@ static void update_backups(struct super_block *sb,
(err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA)))
break;

- bh = sb_getblk(sb, group * bpg + blk_off);
+ if (ext4_snapshot_has_active(sb))
+ /*
+ * test_and_cow() expects an uptodate buffer.
+ * Read the buffer here to suppress the
+ * "non uptodate buffer" warning.
+ */
+ bh = sb_bread(sb, group * bpg + blk_off);
+ else
+ bh = sb_getblk(sb, group * bpg + blk_off);
if (!bh) {
err = -EIO;
break;
--
1.7.0.4


2011-05-09 16:43:54

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 16/30] ext4: snapshot block operation - move blocks to snapshot

From: Amir Goldstein <[email protected]>

Implementation of moving blocks into a snapshot file.
The move block command maps an allocated blocks to the snapshot file,
allocating only the indirect blocks when needed.
This mechanism is used to move-on-write data blocks to snapshot.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 62 +++++++++++++++++++++++++++++++++++++++---------------
1 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b930645..c3d4e7a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -676,6 +676,11 @@ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode,
new_blocks[index++] = current_block++;
count--;
}
+ if (blks == 0 && target == 0) {
+ /* mapping data blocks */
+ *err = 0;
+ return 0;
+ }
if (count > 0) {
/*
* save the new block number
@@ -777,10 +782,10 @@ failed_out:
* ext4_alloc_block() (normally -ENOSPC). Otherwise we set the chain
* as described above and return 0.
*/
-static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
- ext4_lblk_t iblock, int indirect_blks,
- int *blks, ext4_fsblk_t goal,
- ext4_lblk_t *offsets, Indirect *branch)
+static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
+ ext4_fsblk_t iblock, int indirect_blks,
+ int *blks, ext4_fsblk_t goal,
+ int *offsets, Indirect *branch, int flags)
{
int blocksize = inode->i_sb->s_blocksize;
int i, n = 0;
@@ -790,6 +795,22 @@ static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
ext4_fsblk_t new_blocks[4];
ext4_fsblk_t current_block;

+ if (SNAPMAP_ISMOVE(flags)) {
+ /* mapping snapshot block to block device block */
+ current_block = SNAPSHOT_BLOCK(iblock);
+ num = 0;
+ if (indirect_blks > 0) {
+ /* allocating only indirect blocks */
+ ext4_alloc_blocks(handle, inode, iblock, goal,
+ indirect_blks, 0, new_blocks, &err);
+ if (err)
+ return err;
+ }
+ /* charge snapshot file owner for moved blocks */
+ dquot_alloc_block_nofail(inode, *blks);
+ num = *blks;
+ new_blocks[indirect_blks] = current_block;
+ } else
num = ext4_alloc_blocks(handle, inode, iblock, goal, indirect_blks,
*blks, new_blocks, &err);
if (err)
@@ -861,8 +882,11 @@ failed:
}
for (i = n+1; i < indirect_blks; i++)
ext4_free_blocks(handle, inode, NULL, new_blocks[i], 1, 0);
-
- ext4_free_blocks(handle, inode, NULL, new_blocks[i], num, 0);
+ if (SNAPMAP_ISMOVE(flags) && num > 0)
+ /* don't charge snapshot file owner if move failed */
+ dquot_free_block(inode, num);
+ else if (num > 0)
+ ext4_free_blocks(handle, inode, NULL, new_blocks[i], num, 0);

return err;
}
@@ -882,9 +906,8 @@ failed:
* inode (->i_blocks, etc.). In case of success we end up with the full
* chain to new block and return 0.
*/
-static int ext4_splice_branch(handle_t *handle, struct inode *inode,
- ext4_lblk_t block, Indirect *where, int num,
- int blks)
+static int ext4_splice_branch_cow(handle_t *handle, struct inode *inode,
+ long block, Indirect *where, int num, int blks, int flags)
{
int i;
int err = 0;
@@ -951,8 +974,12 @@ err_out:
ext4_free_blocks(handle, inode, where[i].bh, 0, 1,
EXT4_FREE_BLOCKS_FORGET);
}
- ext4_free_blocks(handle, inode, NULL, le32_to_cpu(where[num].key),
- blks, 0);
+ if (SNAPMAP_ISMOVE(flags))
+ /* don't charge snapshot file owner if move failed */
+ dquot_free_block(inode, blks);
+ else
+ ext4_free_blocks(handle, inode, NULL,
+ le32_to_cpu(where[num].key), blks, 0);

return err;
}
@@ -1098,9 +1125,11 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
/*
* Block out ext4_truncate while we alter the tree
*/
- err = ext4_alloc_branch(handle, inode, map->m_lblk, indirect_blks,
- &count, goal,
- offsets + (partial - chain), partial);
+ err = ext4_alloc_branch_cow(handle, inode, map->m_lblk, indirect_blks,
+ &count, goal, offsets + (partial - chain),
+ partial, flags);
+ if (err)
+ goto cleanup;

if (map->m_flags & EXT4_MAP_REMAP) {
map->m_len = count;
@@ -1126,9 +1155,8 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
* credits cannot be returned. Can we handle this somehow? We
* may need to return -EAGAIN upwards in the worst case. --sct
*/
- if (!err)
- err = ext4_splice_branch(handle, inode, map->m_lblk,
- partial, indirect_blks, count);
+ err = ext4_splice_branch_cow(handle, inode, map->m_lblk, partial,
+ indirect_blks, count, flags);
if (err)
goto cleanup;

--
1.7.0.4


2011-05-09 16:43:48

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 13/30] ext4: snapshot file - increase maximum file size limit to 16TB

From: Amir Goldstein <[email protected]>

Files larger than 2TB use Ext4 huge_file flag to store i_blocks
in file system blocks units, so the upper limit on snapshot actual
size is increased from 512*2^32 = 2TB to 4K*2^32 = 16TB,
which is also the upper limit on file system size.
To map 2^32 logical blocks, 4 triple indirect blocks are used instead
of just one. The extra 3 triple indirect blocks are stored in-place
of direct blocks, which are not in use by snapshot files.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 13 +++++++++++++
fs/ext4/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/super.c | 5 ++++-
3 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4072036..8f59322 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -333,6 +333,19 @@ struct flex_groups {
#define EXT4_DIND_BLOCK (EXT4_IND_BLOCK + 1)
#define EXT4_TIND_BLOCK (EXT4_DIND_BLOCK + 1)
#define EXT4_N_BLOCKS (EXT4_TIND_BLOCK + 1)
+/*
+ * Snapshot files have different indirection mapping that can map up to 2^32
+ * logical blocks, so they can cover the mapped filesystem block address space.
+ * Ext4 must use either 4K or 8K blocks (depending on PAGE_SIZE).
+ * With 8K blocks, 1 triple indirect block maps 2^33 logical blocks.
+ * With 4K blocks (the system default), each triple indirect block maps 2^30
+ * logical blocks, so 4 triple indirect blocks map 2^32 logical blocks.
+ * Snapshot files in small filesystems (<= 4G), use only 1 double indirect
+ * block to map the entire filesystem.
+ */
+#define EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS 3
+#define EXT4_SNAPSHOT_N_BLOCKS (EXT4_TIND_BLOCK + 1 + \
+ EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS)

/*
* Inode flags
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index db1706f..425dabb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -335,6 +335,7 @@ static int ext4_block_to_path(struct inode *inode,
double_blocks = (1 << (ptrs_bits * 2));
int n = 0;
int final = 0;
+ int tind;

if (i_block < direct_blocks) {
offsets[n++] = i_block;
@@ -354,6 +355,18 @@ static int ext4_block_to_path(struct inode *inode,
offsets[n++] = (i_block >> ptrs_bits) & (ptrs - 1);
offsets[n++] = i_block & (ptrs - 1);
final = ptrs;
+ } else if (ext4_snapshot_file(inode) &&
+ (i_block >> (ptrs_bits * 3)) <
+ EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS + 1) {
+ tind = i_block >> (ptrs_bits * 3);
+ BUG_ON(tind == 0);
+ /* use up to 4 triple indirect blocks to map 2^32 blocks */
+ i_block -= (tind << (ptrs_bits * 3));
+ offsets[n++] = (EXT4_TIND_BLOCK + tind) % EXT4_NDIR_BLOCKS;
+ offsets[n++] = i_block >> (ptrs_bits * 2);
+ offsets[n++] = (i_block >> ptrs_bits) & (ptrs - 1);
+ offsets[n++] = i_block & (ptrs - 1);
+ final = ptrs;
} else {
ext4_warning(inode->i_sb, "block %lu > max in inode %lu",
i_block + direct_blocks +
@@ -4748,6 +4761,13 @@ void ext4_truncate(struct inode *inode)
ext4_lblk_t last_block, max_block;
unsigned blocksize = inode->i_sb->s_blocksize;

+ /* prevent partial truncate of snapshot files */
+ if (ext4_snapshot_file(inode) && inode->i_size != 0) {
+ snapshot_debug(1, "snapshot file (%lu) cannot be partly "
+ "truncated!\n", inode->i_ino);
+ return;
+ }
+
/* prevent truncate of files on snapshot list */
if (ext4_snapshot_list(inode)) {
snapshot_debug(1, "snapshot (%u) cannot be truncated!\n",
@@ -4861,6 +4881,10 @@ do_indirects:
/* Kill the remaining (whole) subtrees */
switch (offsets[0]) {
default:
+ if (ext4_snapshot_file(inode) &&
+ offsets[0] < EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS)
+ /* Freeing snapshot extra tind branches */
+ break;
nr = i_data[EXT4_IND_BLOCK];
if (nr) {
ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 1);
@@ -4882,6 +4906,19 @@ do_indirects:
;
}

+ if (ext4_snapshot_file(inode)) {
+ int i;
+
+ /* Kill the remaining snapshot file triple indirect trees */
+ for (i = 0; i < EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS; i++) {
+ nr = i_data[i];
+ if (!nr)
+ continue;
+ ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 3);
+ i_data[i] = 0;
+ }
+ }
+
out_unlock:
up_write(&ei->i_data_sem);
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
@@ -5114,7 +5151,8 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
struct super_block *sb = inode->i_sb;

if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
- EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) {
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE) ||
+ ext4_snapshot_file(inode)) {
/* we are using combined 48 bit field */
i_blocks = ((u64)le16_to_cpu(raw_inode->i_blocks_high)) << 32 |
le32_to_cpu(raw_inode->i_blocks_lo);
@@ -5353,7 +5391,9 @@ static int ext4_inode_blocks_set(handle_t *handle,
ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
return 0;
}
- if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_HUGE_FILE))
+ /* snapshot files may be represented as huge files */
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_HUGE_FILE) &&
+ !ext4_snapshot_file(inode))
return -EFBIG;

if (i_blocks <= 0xffffffffffffULL) {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e3ebd7d..d26831a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2316,7 +2316,7 @@ static loff_t ext4_max_bitmap_size(int bits, int has_huge_files)

res += 1LL << (bits-2);
res += 1LL << (2*(bits-2));
- res += 1LL << (3*(bits-2));
+ res += (1LL + EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS) << (3*(bits-2));
res <<= bits;
if (res > upper_limit)
res = upper_limit;
@@ -3259,6 +3259,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

has_huge_files = EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_HUGE_FILE);
+ if (EXT4_SNAPSHOTS(sb))
+ /* Snapshot files are huge files */
+ has_huge_files = 1;
sbi->s_bitmap_maxbytes = ext4_max_bitmap_size(sb->s_blocksize_bits,
has_huge_files);
sb->s_maxbytes = ext4_max_size(sb->s_blocksize_bits, has_huge_files);
--
1.7.0.4


2011-05-09 16:43:56

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 17/30] ext4: snapshot control

From: Amir Goldstein <[email protected]>

Snapshot control with chsnap/lssnap.
Take/delete snapshot with chsnap +/-S.
Enable/disable snapshot with chsnap +/-n.
Show snapshot status with lssnap.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 2 +
fs/ext4/ioctl.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 119 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 942cd9c..3cf6602 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -597,6 +597,8 @@ struct ext4_new_group_data {
/* note ioctl 10 reserved for an early version of the FIEMAP ioctl */
/* note ioctl 11 reserved for filesystem-independent FIEMAP ioctl */
#define EXT4_IOC_ALLOC_DA_BLKS _IO('f', 12)
+#define EXT4_IOC_GETSNAPFLAGS _IOR('f', 13, long)
+#define EXT4_IOC_SETSNAPFLAGS _IOW('f', 14, long)
#define EXT4_IOC_MOVE_EXT _IOWR('f', 15, struct move_extent)

#if defined(__KERNEL__) && defined(CONFIG_COMPAT)
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a426332..e100c04 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -83,6 +83,21 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
if (!capable(CAP_SYS_RESOURCE))
goto flags_out;
}
+
+ /*
+ * The SNAPFILE flag can only be changed on directories by
+ * the relevant capability.
+ * It can only be inherited by regular files.
+ */
+ if ((flags ^ oldflags) & EXT4_SNAPFILE_FL) {
+ if (!S_ISDIR(inode->i_mode)) {
+ err = -ENOTDIR;
+ goto flags_out;
+ }
+ if (!capable(CAP_SYS_RESOURCE))
+ goto flags_out;
+ }
+
if (oldflags & EXT4_EXTENTS_FL) {
/* We don't support clearning extent flags */
if (!(flags & EXT4_EXTENTS_FL)) {
@@ -139,6 +154,102 @@ flags_out:
mnt_drop_write(filp->f_path.mnt);
return err;
}
+ case EXT4_IOC_GETSNAPFLAGS:
+ if (!EXT4_SNAPSHOTS(inode->i_sb))
+ return -EOPNOTSUPP;
+
+ ext4_snapshot_get_flags(inode, filp);
+ flags = ext4_get_snapstate_flags(inode);
+ return put_user(flags, (int __user *) arg);
+
+ case EXT4_IOC_SETSNAPFLAGS: {
+ handle_t *handle = NULL;
+ struct ext4_iloc iloc;
+ unsigned int oldflags;
+ int err;
+
+ if (!EXT4_SNAPSHOTS(inode->i_sb))
+ return -EOPNOTSUPP;
+
+ if (!is_owner_or_cap(inode))
+ return -EACCES;
+
+ if (get_user(flags, (int __user *) arg))
+ return -EFAULT;
+
+ err = mnt_want_write(filp->f_path.mnt);
+ if (err)
+ return err;
+
+ /*
+ * Snapshot file state flags can only be changed by
+ * the relevant capability and under snapshot_mutex lock.
+ */
+ if (!ext4_snapshot_file(inode) ||
+ !capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+
+ /* update snapshot 'open' flag under i_mutex */
+ mutex_lock(&inode->i_mutex);
+ ext4_snapshot_get_flags(inode, filp);
+ oldflags = ext4_get_snapstate_flags(inode);
+
+ /*
+ * snapshot_mutex should be held throughout the trio
+ * snapshot_{set_flags,take,update}(). It must be taken
+ * before starting the transaction, otherwise
+ * journal_lock_updates() inside snapshot_take()
+ * can deadlock:
+ * A: journal_start()
+ * A: snapshot_mutex_lock()
+ * B: journal_start()
+ * B: snapshot_mutex_lock() (waiting for A)
+ * A: journal_stop()
+ * A: snapshot_take() ->
+ * A: journal_lock_updates() (waiting for B)
+ */
+ mutex_lock(&EXT4_SB(inode->i_sb)->s_snapshot_mutex);
+
+ handle = ext4_journal_start(inode, 1);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto snapflags_out;
+ }
+ err = ext4_reserve_inode_write(handle, inode, &iloc);
+ if (err)
+ goto snapflags_err;
+
+ err = ext4_snapshot_set_flags(handle, inode, flags);
+ if (err)
+ goto snapflags_err;
+
+ err = ext4_mark_iloc_dirty(handle, inode, &iloc);
+snapflags_err:
+ ext4_journal_stop(handle);
+ if (err)
+ goto snapflags_out;
+
+ if (!(oldflags & 1UL<<EXT4_SNAPSTATE_LIST) &&
+ (flags & 1UL<<EXT4_SNAPSTATE_LIST))
+ /* setting list flag - take snapshot */
+ err = ext4_snapshot_take(inode);
+snapflags_out:
+ if ((oldflags|flags) & 1UL<<EXT4_SNAPSTATE_LIST) {
+ /* if clearing list flag, cleanup snapshot list */
+ int ret;
+
+ /* update/cleanup snapshots list even if take failed */
+ ret = ext4_snapshot_update(inode->i_sb,
+ !(flags & 1UL<<EXT4_SNAPSTATE_LIST), 0);
+ if (!err)
+ err = ret;
+ }
+
+ mutex_unlock(&EXT4_SB(inode->i_sb)->s_snapshot_mutex);
+ mutex_unlock(&inode->i_mutex);
+ mnt_drop_write(filp->f_path.mnt);
+ return err;
+ }
case EXT4_IOC_GETVERSION:
case EXT4_IOC_GETVERSION_OLD:
return put_user(inode->i_generation, (int __user *) arg);
@@ -210,6 +321,8 @@ setversion_out:

if (get_user(n_blocks_count, (__u32 __user *)arg))
return -EFAULT;
+ /* avoid snapshot_take() in the middle of group_extend() */
+ mutex_lock(&EXT4_SB(sb)->s_snapshot_mutex);

err = mnt_want_write(filp->f_path.mnt);
if (err)
@@ -223,6 +336,7 @@ setversion_out:
}
if (err == 0)
err = err2;
+ mutex_unlock(&EXT4_SB(sb)->s_snapshot_mutex);
mnt_drop_write(filp->f_path.mnt);

return err;
@@ -285,6 +399,8 @@ mext_out:
if (err)
return err;

+ /* avoid snapshot_take() in the middle of group_add() */
+ mutex_lock(&EXT4_SB(sb)->s_snapshot_mutex);
err = ext4_group_add(sb, &input);
if (EXT4_SB(sb)->s_journal) {
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
@@ -293,6 +409,7 @@ mext_out:
}
if (err == 0)
err = err2;
+ mutex_unlock(&EXT4_SB(sb)->s_snapshot_mutex);
mnt_drop_write(filp->f_path.mnt);

return err;
--
1.7.0.4


2011-05-09 16:43:59

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 18/30] ext4: snapshot control - fix new snapshot

From: Amir Goldstein <[email protected]>

On snapshot take, after copying the pre-allocated blocks, some are
fixed to make the snapshot image appear as a valid Ext4 file system.
The has_snapshot flags is cleared from the super block as well as
the last_snapshot field and all snapshot inodes are cleared
(to appear as empty inodes).

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/inode.c | 4 ++--
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3cf6602..c04a031 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1850,6 +1850,8 @@ struct buffer_head *ext4_bread(handle_t *, struct inode *,
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);

+extern blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei);
extern struct inode *ext4_iget(struct super_block *, unsigned long);
extern int ext4_write_inode(struct inode *, struct writeback_control *);
extern int ext4_setattr(struct dentry *, struct iattr *);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c3d4e7a..4bc60f1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5203,8 +5203,8 @@ void ext4_get_inode_flags(struct ext4_inode_info *ei)
} while (cmpxchg(&ei->i_flags, old_fl, new_fl) != old_fl);
}

-static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
- struct ext4_inode_info *ei)
+blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei)
{
blkcnt_t i_blocks ;
struct inode *inode = &(ei->vfs_inode);
--
1.7.0.4


2011-05-09 16:44:03

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 19/30] ext4: snapshot control - reserve disk space for snapshot

From: Amir Goldstein <[email protected]>

Ensure there is enough disk space for snapshot file future use.
Reserve disk space on snapshot take based on file system overhead
size, number of directories and number of blocks/inodes in use.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/balloc.c | 25 +++++++++++++++++++++++++
fs/ext4/ext4.h | 2 ++
fs/ext4/mballoc.c | 6 ++++++
fs/ext4/super.c | 16 +++++++++++++++-
4 files changed, 48 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 350f502..7d22e50 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -370,6 +370,8 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
static int ext4_has_free_blocks(struct ext4_sb_info *sbi, s64 nblocks)
{
s64 free_blocks, dirty_blocks, root_blocks;
+ ext4_fsblk_t snapshot_r_blocks;
+ handle_t *handle = journal_current_handle();
struct percpu_counter *fbc = &sbi->s_freeblocks_counter;
struct percpu_counter *dbc = &sbi->s_dirtyblocks_counter;

@@ -377,6 +379,29 @@ static int ext4_has_free_blocks(struct ext4_sb_info *sbi, s64 nblocks)
dirty_blocks = percpu_counter_read_positive(dbc);
root_blocks = ext4_r_blocks_count(sbi->s_es);

+ if (ext4_snapshot_active(sbi)) {
+ if (unlikely(free_blocks < (nblocks + dirty_blocks)))
+ /* sorry, but we're really out of space */
+ return 0;
+ if (handle && unlikely(IS_COWING(handle)))
+ /* any available space may be used by COWing task */
+ return 1;
+ /* reserve blocks for active snapshot */
+ snapshot_r_blocks =
+ le64_to_cpu(sbi->s_es->s_snapshot_r_blocks_count);
+ /*
+ * The last snapshot_r_blocks are reserved for active snapshot
+ * and may not be allocated even by root.
+ */
+ if (free_blocks < (nblocks + dirty_blocks + snapshot_r_blocks))
+ return 0;
+ /*
+ * Mortal users must reserve blocks for both snapshot and
+ * root user.
+ */
+ root_blocks += snapshot_r_blocks;
+ }
+
if (free_blocks - (nblocks + root_blocks + dirty_blocks) <
EXT4_FREEBLOCKS_WATERMARK) {
free_blocks = percpu_counter_sum_positive(fbc);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c04a031..884033f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1974,6 +1974,8 @@ extern __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 group,
struct ext4_group_desc *gdp);
extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
struct ext4_group_desc *gdp);
+struct kstatfs;
+extern int ext4_statfs_sb(struct super_block *sb, struct kstatfs *buf);

static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
{
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d43f493..4813b15 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4290,10 +4290,16 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
return 0;
}
reserv_blks = ar->len;
+ if (unlikely(ar->flags & EXT4_MB_HINT_COWING)) {
+ /* don't fail when allocating blocks for COW */
+ dquot_alloc_block_nofail(ar->inode, ar->len);
+ goto nofail;
+ }
while (ar->len && dquot_alloc_block(ar->inode, ar->len)) {
ar->flags |= EXT4_MB_HINT_NOPREALLOC;
ar->len--;
}
+nofail:
inquota = ar->len;
if (ar->len == 0) {
*errp = -EDQUOT;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d26831a..a768b63 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4464,7 +4464,11 @@ restore_opts:

static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
{
- struct super_block *sb = dentry->d_sb;
+ return ext4_statfs_sb(dentry->d_sb, buf);
+}
+
+int ext4_statfs_sb(struct super_block *sb, struct kstatfs *buf)
+{
struct ext4_sb_info *sbi = EXT4_SB(sb);
struct ext4_super_block *es = sbi->s_es;
u64 fsid;
@@ -4516,6 +4520,16 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
if (buf->f_bfree < ext4_r_blocks_count(es))
buf->f_bavail = 0;
+ if (ext4_snapshot_active(sbi)) {
+ if (buf->f_bfree < ext4_r_blocks_count(es) +
+ le64_to_cpu(es->s_snapshot_r_blocks_count))
+ buf->f_bavail = 0;
+ else
+ buf->f_bavail -=
+ le64_to_cpu(es->s_snapshot_r_blocks_count);
+ }
+ buf->f_spare[0] = percpu_counter_sum_positive(&sbi->s_dirs_counter);
+ buf->f_spare[1] = sbi->s_overhead_last;
buf->f_files = le32_to_cpu(es->s_inodes_count);
buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
buf->f_namelen = EXT4_NAME_LEN;
--
1.7.0.4


2011-05-09 16:44:13

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 21/30] ext4: snapshot journaled - implement journal_release_buffer()

From: Amir Goldstein <[email protected]>

The API journal_release_buffer() is called to cancel a previous call
to journal_get_write_access() and to recall the used buffer credit.
Current implementation of journal_release_buffer() in JBD is empty,
since no buffer credits are used until the buffer is marked dirty.
However, since the resulting snapshot COW operation cannot be undone,
we try to extend the current transaction to compensate for the used
credits of the extra COW operation, so we don't run out of buffer
credits too soon.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 39 +++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 11 +++++------
fs/ext4/ialloc.c | 10 ++++++++--
fs/ext4/xattr.c | 4 +++-
4 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 015f727..e8287f4 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -127,6 +127,45 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
return err;
}

+int __ext4_handle_release_buffer(const char *where, handle_t *handle,
+ struct buffer_head *bh)
+{
+ struct super_block *sb;
+ int err = 0;
+
+ if (!ext4_handle_valid(handle))
+ return 0;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (!EXT4_SNAPSHOTS(sb) || IS_COWING(handle))
+ goto out;
+
+ /*
+ * Trying to cancel a previous call to get_write_access(), which may
+ * have resulted in a single COW operation. We don't need to add
+ * user credits, but if COW credits are too low we will try to
+ * extend the transaction to compensate for the buffer credits used
+ * by the extra COW operation.
+ */
+ err = ext4_journal_extend(handle, 0);
+ if (err > 0) {
+ /* well, we can't say we didn't try - now lets hope
+ * we have enough buffer credits to spare */
+ snapshot_debug(handle->h_buffer_credits < EXT4_MAX_TRANS_DATA
+ ? 1 : 2,
+ "%s: warning: couldn't extend transaction "
+ "from %s (credits=%d/%d)\n", __func__,
+ where, handle->h_buffer_credits,
+ handle->h_user_credits);
+ err = 0;
+ }
+ ext4_journal_trace(SNAP_WARN, where, handle, -1);
+out:
+ if (!err)
+ jbd2_journal_release_buffer(handle, bh);
+ return err;
+}
+
int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
handle_t *handle, struct inode *inode,
struct buffer_head *bh)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index e80402b..1e05e2c 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -264,12 +264,11 @@ static inline void ext4_handle_sync(handle_t *handle)
handle->h_sync = 1;
}

-static inline void ext4_handle_release_buffer(handle_t *handle,
- struct buffer_head *bh)
-{
- if (ext4_handle_valid(handle))
- jbd2_journal_release_buffer(handle, bh);
-}
+int __ext4_handle_release_buffer(const char *where, handle_t *handle,
+ struct buffer_head *bh);
+
+#define ext4_handle_release_buffer(handle, bh) \
+ __ext4_handle_release_buffer(__func__, (handle), (bh))

static inline int ext4_handle_is_aborted(handle_t *handle)
{
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ba928a7..c4d3512 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -916,8 +916,14 @@ repeat_in_this_group:
goto got;
}
/* we lost it */
- ext4_handle_release_buffer(handle, inode_bitmap_bh);
- ext4_handle_release_buffer(handle, group_desc_bh);
+ err = ext4_handle_release_buffer(handle,
+ inode_bitmap_bh);
+ if (err)
+ goto fail;
+ err = ext4_handle_release_buffer(handle,
+ group_desc_bh);
+ if (err)
+ goto fail;

if (++ino < EXT4_INODES_PER_GROUP(sb))
goto repeat_in_this_group;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index b545ca1..83f5f9d 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -735,7 +735,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
int offset = (char *)s->here - bs->bh->b_data;

unlock_buffer(bs->bh);
- ext4_handle_release_buffer(handle, bs->bh);
+ error = ext4_handle_release_buffer(handle, bs->bh);
+ if (error)
+ goto cleanup;
if (ce) {
mb_cache_entry_release(ce);
ce = NULL;
--
1.7.0.4


2011-05-09 16:44:08

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 20/30] ext4: snapshot journaled - increase transaction credits

From: Amir Goldstein <[email protected]>

Snapshot operations are journaled as part of the running transaction.
The amount of requested credits is multiplied with a factor, to ensure
that enough buffer credits are reserved in the running transaction.
The new field h_base_credits stored to original credits request and
the new filed u_user_credits counts the number of credits used by
non-COW operations. They are especially useful when exteding a large
transaction, which did not use the extra COW credits it requested.
In this case, only the missing extra credits are requested.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 21 +++++++
fs/ext4/ext4_jbd2.h | 159 ++++++++++++++++++++++++++++++++++++++++++++++-----
fs/ext4/resize.c | 2 +-
fs/ext4/super.c | 38 ++++++++++++-
4 files changed, 202 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index c44c362..015f727 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -131,6 +131,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
handle_t *handle, struct inode *inode,
struct buffer_head *bh)
{
+ struct super_block *sb;
int err = 0;

if (ext4_handle_valid(handle)) {
@@ -138,6 +139,26 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ if (err)
+ return err;
+ sb = handle->h_transaction->t_journal->j_private;
+ if (EXT4_SNAPSHOTS(sb) && !IS_COWING(handle)) {
+ struct journal_head *jh = bh2jh(bh);
+ jbd_lock_bh_state(bh);
+ /*
+ * buffer_credits was decremented when buffer was
+ * modified for the first time in the current
+ * transaction, which may have been during a COW
+ * operation. We decrement user_credits and mark
+ * b_modified = 2, on the first time that the buffer
+ * is modified not during a COW operation (!h_cowing).
+ */
+ if (jh->b_modified == 1) {
+ jh->b_modified = 2;
+ handle->h_user_credits--;
+ }
+ jbd_unlock_bh_state(bh);
+ }
} else {
if (inode)
mark_buffer_dirty_inode(bh, inode);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 79b6594..e80402b 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -83,6 +83,62 @@
* one block, plus two quota updates. Quota allocations are not
* needed. */

+/* on block write we have to journal the block itself */
+#define EXT4_WRITE_CREDITS 1
+/* on snapshot block alloc we have to journal block group bitmap, exclude
+ bitmap and gdb */
+#define EXT4_ALLOC_CREDITS 3
+/* number of credits for COW bitmap operation (allocated blocks are not
+ journalled): alloc(dind+ind+cow) = 9 */
+#define EXT4_COW_BITMAP_CREDITS (3*EXT4_ALLOC_CREDITS)
+/* number of credits for other block COW operations:
+ alloc(dind+ind+cow)+write(dind+ind) = 11 */
+#define EXT4_COW_BLOCK_CREDITS (3*EXT4_ALLOC_CREDITS+2*EXT4_WRITE_CREDITS)
+/* number of credits for the first COW operation in the block group, which
+ * is not the first group in a flex group (alloc 2 dind blocks):
+ 9+11 = 20 */
+#define EXT4_COW_CREDITS (EXT4_COW_BLOCK_CREDITS + \
+ EXT4_COW_BITMAP_CREDITS)
+/* number of credits for snapshot operations counted once per transaction:
+ write(sb+inode+tind) = 3 */
+#define EXT4_SNAPSHOT_CREDITS (3*EXT4_WRITE_CREDITS)
+/*
+ * in total, for N COW operations, we may have to journal 20N+3 blocks,
+ * and we also want to reserve 20+3 credits for the last COW operation,
+ * so we add 20(N-1)+3+(20+3) to the requested N buffer credits
+ * and request 21N+6 buffer credits.
+ * that's a lot of extra credits and much more then needed for the common
+ * case, but what can we do?
+ *
+ * we are going to need a bigger journal to accommodate the
+ * extra snapshot credits.
+ * mke2fs -j uses the following default formula for fs-size above 1G:
+ * journal-size = MIN(128M, fs-size/32)
+ * mke2fs -j -J big uses the following formula:
+ * journal-size = MIN(3G, fs-size/32)
+ */
+#define EXT4_SNAPSHOT_TRANS_BLOCKS(n) \
+ ((n)*(1+EXT4_COW_CREDITS)+EXT4_SNAPSHOT_CREDITS)
+#define EXT4_SNAPSHOT_START_TRANS_BLOCKS(n) \
+ ((n)*(1+EXT4_COW_CREDITS)+2*EXT4_SNAPSHOT_CREDITS)
+
+/*
+ * check for sufficient buffer and COW credits
+ */
+#define EXT4_SNAPSHOT_HAS_TRANS_BLOCKS(handle, n) \
+ ((handle)->h_buffer_credits >= EXT4_SNAPSHOT_TRANS_BLOCKS(n) && \
+ (handle)->h_user_credits >= (n))
+
+#define EXT4_RESERVE_COW_CREDITS (EXT4_COW_CREDITS + \
+ EXT4_SNAPSHOT_CREDITS)
+
+/*
+ * Ext4 is not designed for filesystems under 4G with journal size < 128M
+ * Recommended journal size is 3G (created with 'mke2fs -j -J big')
+ */
+#define EXT4_MIN_JOURNAL_BLOCKS 32768U
+#define EXT4_BIG_JOURNAL_BLOCKS (24*EXT4_MIN_JOURNAL_BLOCKS)
+
#define EXT4_RESERVE_TRANS_BLOCKS 12U

#define EXT4_INDEX_EXTRA_TRANS_BLOCKS 8
@@ -176,7 +232,19 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define trace_cow_add(handle, name, num)
#define trace_cow_inc(handle, name)

-handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks);
+#define ext4_journal_trace(n, caller, handle, nblocks)
+
+handle_t *__ext4_journal_start(const char *where,
+ struct super_block *sb, int nblocks);
+
+#define ext4_journal_start_sb(sb, nblocks) \
+ __ext4_journal_start(__func__, \
+ (sb), (nblocks))
+
+#define ext4_journal_start(inode, nblocks) \
+ __ext4_journal_start(__func__, \
+ (inode)->i_sb, (nblocks))
+
int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);

#define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
@@ -212,16 +280,20 @@ static inline int ext4_handle_is_aborted(handle_t *handle)

static inline int ext4_handle_has_enough_credits(handle_t *handle, int needed)
{
- if (ext4_handle_valid(handle) && handle->h_buffer_credits < needed)
+ struct super_block *sb;
+
+ if (!ext4_handle_valid(handle))
+ return 1;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (EXT4_SNAPSHOTS(sb))
+ return EXT4_SNAPSHOT_HAS_TRANS_BLOCKS(handle, needed);
+ /* sb has no snapshot feature */
+ if (handle->h_buffer_credits < needed)
return 0;
return 1;
}

-static inline handle_t *ext4_journal_start(struct inode *inode, int nblocks)
-{
- return ext4_journal_start_sb(inode->i_sb, nblocks);
-}
-
#define ext4_journal_stop(handle) \
__ext4_journal_stop(__func__, __LINE__, (handle))

@@ -230,20 +302,77 @@ static inline handle_t *ext4_journal_current_handle(void)
return journal_current_handle();
}

-static inline int ext4_journal_extend(handle_t *handle, int nblocks)
+/*
+ * Ext4 wrapper for journal_extend()
+ * When transaction runs out of buffer credits it is possible to try and
+ * extend the buffer credits without restarting the transaction.
+ * Ext4 wrapper for journal_start() has increased the user requested buffer
+ * credits to include the extra credits for COW operations.
+ * This wrapper checks the remaining user credits and how many COW credits
+ * are missing and then tries to extend the transaction.
+ */
+static inline int __ext4_journal_extend(const char *where,
+ handle_t *handle, int nblocks)
{
- if (ext4_handle_valid(handle))
- return jbd2_journal_extend(handle, nblocks);
- return 0;
+ int credits = 0;
+ int err = 0;
+ struct super_block *sb;
+
+ if (!ext4_handle_valid((handle_t *)handle))
+ return 0;
+
+ credits = nblocks;
+ sb = handle->h_transaction->t_journal->j_private;
+ if (EXT4_SNAPSHOTS(sb)) {
+ /* extend transaction to valid buffer/user credits ratio */
+ credits = EXT4_SNAPSHOT_TRANS_BLOCKS(handle->h_user_credits +
+ nblocks) - handle->h_buffer_credits;
+ }
+ if (credits > 0)
+ err = jbd2_journal_extend((handle_t *)handle, credits);
+ if (EXT4_SNAPSHOTS(sb) && !err) {
+ /* update base/user credits for future extends */
+ handle->h_base_credits += nblocks;
+ handle->h_user_credits += nblocks;
+ ext4_journal_trace(SNAP_WARN, where, handle, nblocks);
+ }
+ return err;
}

-static inline int ext4_journal_restart(handle_t *handle, int nblocks)
+/*
+ * Ext4 wrapper for journal_restart()
+ * When transaction runs out of buffer credits and cannot be extended,
+ * the alternative is to restart it (start a new transaction).
+ * This wrapper increases the user requested buffer credits to include the
+ * extra credits for COW operations.
+ */
+static inline int __ext4_journal_restart(const char *where,
+ handle_t *handle, int nblocks)
{
- if (ext4_handle_valid(handle))
- return jbd2_journal_restart(handle, nblocks);
- return 0;
+ int err = 0;
+ int credits = 0;
+ struct super_block *sb;
+
+ if (!ext4_handle_valid((handle_t *)handle))
+ return 0;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ credits = EXT4_SNAPSHOTS(sb) ?
+ EXT4_SNAPSHOT_START_TRANS_BLOCKS(nblocks) : nblocks;
+ err = jbd2_journal_restart((handle_t *)handle, credits);
+ if (EXT4_SNAPSHOTS(sb) && !err) {
+ handle->h_base_credits = nblocks;
+ handle->h_user_credits = nblocks;
+ ext4_journal_trace(SNAP_WARN, where, handle, nblocks);
+ }
+ return err;
}

+#define ext4_journal_extend(handle, nblocks) \
+ __ext4_journal_extend(__func__, (handle), (nblocks))
+
+#define ext4_journal_restart(handle, nblocks) \
+ __ext4_journal_restart(__func__, (handle), (nblocks))
static inline int ext4_journal_blocks_per_page(struct inode *inode)
{
if (EXT4_JOURNAL(inode) != NULL)
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 06c11fd..dff9b5d 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -668,7 +668,7 @@ static void update_backups(struct super_block *sb,

/* Out of journal space, and can't get more - abort - so sad */
if (ext4_handle_valid(handle) &&
- handle->h_buffer_credits == 0 &&
+ !ext4_handle_has_enough_credits(handle, 1) &&
ext4_journal_extend(handle, EXT4_MAX_TRANS_DATA) &&
(err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA)))
break;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a768b63..0bde939 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -248,8 +248,10 @@ static void ext4_put_nojournal(handle_t *handle)
* ext4 prevents a new handle from being started by s_frozen, which
* is in an upper layer.
*/
-handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
+handle_t *__ext4_journal_start(const char *where,
+ struct super_block *sb, int nblocks)
{
+ int credits;
journal_t *journal;
handle_t *handle;

@@ -280,7 +282,18 @@ handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
ext4_abort(sb, "Detected aborted journal");
return ERR_PTR(-EROFS);
}
- return jbd2_journal_start(journal, nblocks);
+
+ credits = EXT4_SNAPSHOTS(sb) ?
+ EXT4_SNAPSHOT_START_TRANS_BLOCKS(nblocks) : nblocks;
+ handle = jbd2_journal_start(journal, credits);
+ if (EXT4_SNAPSHOTS(sb) && !IS_ERR(handle)) {
+ if (handle->h_ref == 1) {
+ handle->h_base_credits = nblocks;
+ handle->h_user_credits = nblocks;
+ }
+ ext4_journal_trace(SNAP_WARN, where, handle, nblocks);
+ }
+ return handle;
}

/*
@@ -3823,6 +3836,27 @@ static journal_t *ext4_get_journal(struct super_block *sb,
return NULL;
}

+ if (EXT4_SNAPSHOTS(sb) &&
+ (journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb)) <
+ EXT4_MIN_JOURNAL_BLOCKS) {
+ ext4_msg(sb, KERN_ERR,
+ "journal is too small (%lld < %u) for snapshots",
+ journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb),
+ EXT4_MIN_JOURNAL_BLOCKS);
+ iput(journal_inode);
+ return NULL;
+ }
+
+ if (EXT4_SNAPSHOTS(sb) &&
+ (journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb)) <
+ EXT4_BIG_JOURNAL_BLOCKS) {
+ snapshot_debug(1, "warning: journal is not big enough "
+ "(%lld < %u) - this might affect concurrent "
+ "filesystem writers performance!\n",
+ journal_inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb),
+ EXT4_BIG_JOURNAL_BLOCKS);
+ }
+
journal = jbd2_journal_init_inode(journal_inode);
if (!journal) {
ext4_msg(sb, KERN_ERR, "Could not load journal inode");
--
1.7.0.4


2011-05-09 16:44:17

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 22/30] ext4: snapshot journaled - bypass to save credits

From: Amir Goldstein <[email protected]>

Don't journal COW bitmap indirect blocks to save journal credits.
On very few COW operations (i.e., first block group access after
snapshot take), there may be up to 3 extra blocks allocated for the
active snapshot (i.e., COW bitmap block and up to 2 indirect blocks).
Taking these 2 indorect blocks into account on every COW operation
would further increase the transaction's COW credits factor.
Instead, we choose to pay a small performance penalty on these few
COW bitmap operations and wait until they are synced to disk.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 31 +++++++++++++++++++++++++++----
1 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4bc60f1..d23743a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -835,7 +835,8 @@ static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
branch[n].bh = bh;
lock_buffer(bh);
BUFFER_TRACE(bh, "call get_create_access");
- err = ext4_journal_get_create_access(handle, bh);
+ if (!SNAPMAP_ISSYNC(flags))
+ err = ext4_journal_get_create_access(handle, bh);
if (err) {
/* Don't brelse(bh) here; it's done in
* ext4_journal_forget() below */
@@ -862,7 +863,21 @@ static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
unlock_buffer(bh);

BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, inode, bh);
+ /*
+ * When accessing a block group for the first time, the
+ * block bitmap is the first block to be copied to the
+ * snapshot. We don't want to reserve journal credits for
+ * the indirect blocks that map the bitmap copy (the COW
+ * bitmap), so instead of writing through the journal, we
+ * sync the indirect blocks directly to disk. Of course,
+ * this is not good for performance but it only happens once
+ * per snapshot/blockgroup.
+ */
+ if (SNAPMAP_ISSYNC(flags)) {
+ mark_buffer_dirty(bh);
+ sync_dirty_buffer(bh);
+ } else
+ err = ext4_handle_dirty_metadata(handle, inode, bh);
if (err)
goto failed;
}
@@ -871,6 +886,9 @@ static int ext4_alloc_branch_cow(handle_t *handle, struct inode *inode,
failed:
/* Allocation failed, free what we already allocated */
ext4_free_blocks(handle, inode, NULL, new_blocks[0], 1, 0);
+ /* If we bypassed journal, we don't need to forget any block */
+ if (SNAPMAP_ISSYNC(flags))
+ n = 1;
for (i = 1; i <= n ; i++) {
/*
* branch[i].bh is newly allocated, so there is no
@@ -966,13 +984,18 @@ static int ext4_splice_branch_cow(handle_t *handle, struct inode *inode,

err_out:
for (i = 1; i <= num; i++) {
+ int forget = EXT4_FREE_BLOCKS_FORGET;
+
+ /* If we bypassed journal, we don't need to forget */
+ if (SNAPMAP_ISSYNC(flags))
+ forget = 0;
+
/*
* branch[i].bh is newly allocated, so there is no
* need to revoke the block, which is why we don't
* need to set EXT4_FREE_BLOCKS_METADATA.
*/
- ext4_free_blocks(handle, inode, where[i].bh, 0, 1,
- EXT4_FREE_BLOCKS_FORGET);
+ ext4_free_blocks(handle, inode, where[i].bh, 0, 1, forget);
}
if (SNAPMAP_ISMOVE(flags))
/* don't charge snapshot file owner if move failed */
--
1.7.0.4


2011-05-09 16:44:26

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 25/30] ext4: snapshot race conditions - concurrent COW operations

From: Amir Goldstein <[email protected]>

Wait for pending COW operations to complete.
When concurrent tasks try to COW the same buffer, the task that takes
the active snapshot i_data_sem is elected as the the COWing task.
The COWing task allocates a new snapshot block and creates a buffer
cache entry with ref_count=1 for that new block. It then locks the
new buffer and marks it with the buffer_new flag. The rest of the
tasks wait (in msleep(1) loop), until the buffer_new flag is cleared.
The COWing task copies the source buffer into the 'new' buffer,
unlocks it, clears the new_buffer flag and drops its reference count.
On active snapshot readpage, the buffer cache is checked.
If a 'new' buffer entry is found, the reader task waits until the
buffer_new flag is cleared and then copies the 'new' buffer directly
into the snapshot file page.
The sleep loop method was copied from LVM snapshot code, which does
the same thing to deal with these (rare) races without wait queues.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/inode.c | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d23743a..794b29f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1049,6 +1049,7 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
int depth;
int count = 0;
ext4_fsblk_t first_block = 0;
+ struct buffer_head *sbh = NULL;

J_ASSERT(!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)));
J_ASSERT(handle != NULL || (flags & EXT4_GET_BLOCKS_CREATE) == 0);
@@ -1154,6 +1155,25 @@ static int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
if (err)
goto cleanup;

+ if (SNAPMAP_ISCOW(flags)) {
+ /*
+ * COWing block or creating COW bitmap.
+ * we now have exclusive access to the COW destination block
+ * and we are about to create the snapshot block mapping
+ * and make it public.
+ * grab the buffer cache entry and mark it new
+ * to indicate a pending COW operation.
+ * the refcount for the buffer cache will be released
+ * when the COW operation is either completed or canceled.
+ */
+ sbh = sb_getblk(inode->i_sb, le32_to_cpu(chain[depth-1].key));
+ if (!sbh) {
+ err = -EIO;
+ goto cleanup;
+ }
+ ext4_snapshot_start_pending_cow(sbh);
+ }
+
if (map->m_flags & EXT4_MAP_REMAP) {
map->m_len = count;
/* move old block to snapshot */
@@ -1197,6 +1217,12 @@ got_it:
/* Clean up and exit */
partial = chain + depth - 1; /* the whole chain */
cleanup:
+ /* cancel pending COW operation on failure to alloc snapshot block */
+ if (SNAPMAP_ISCOW(flags)) {
+ if (err < 0 && sbh)
+ ext4_snapshot_end_pending_cow(sbh);
+ brelse(sbh);
+ }
while (partial > chain) {
BUFFER_TRACE(partial->bh, "call brelse");
brelse(partial->bh);
--
1.7.0.4


2011-05-09 16:44:28

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 26/30] ext4: snapshot race conditions - tracked reads

From: Amir Goldstein <[email protected]>

Wait for pending read I/O requests to complete.
When a snapshot file readpage reads through to the block device,
the reading task increments the block tracked readers count.
Upon completion of the async read I/O request of the snapshot page,
the tracked readers count is decremented.
When a task is COWing a block with non-zero tracked readers count,
that task has to wait (in msleep(1) loop), until the block's tracked
readers count drops to zero, before the COW operation is completed.
After a pending COW operation has started, reader tasks have to wait
(again, in msleep(1) loop), until the pending COW operation is
completed, so the COWing task cannot be starved by reader tasks.
The sleep loop method was copied from LVM snapshot code, which does
the same thing to deal with these (rare) races without wait queues.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a7bb8ed..bf5aa4d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2249,12 +2249,18 @@ enum ext4_state_bits {
* now used by snapshot to do mow
*/
BH_Partial_Write, /* Buffer should be uptodate before write */
+ BH_Tracked_Read, /* Buffer read I/O is being tracked,
+ * to serialize write I/O to block device.
+ * that is, don't write over this block
+ * until I finished reading it.
+ */
};

BUFFER_FNS(Uninit, uninit)
TAS_BUFFER_FNS(Uninit, uninit)
BUFFER_FNS(Remap, remap)
BUFFER_FNS(Partial_Write, partial_write)
+BUFFER_FNS(Tracked_Read, tracked_read)

/*
* Add new method to test wether block and inode bitmaps are properly
--
1.7.0.4


2011-05-09 16:44:30

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 27/30] ext4: snapshot exclude - the exclude bitmap

From: Amir Goldstein <[email protected]>

Mark all snapshot blocks excluded from COW (i.e., mark that
they do not need to be COWed). The excluded blocks appear as not
allocated inside the snapshot image (no snapshots of snapshot files).
Excluding snapshot file blocks is essential for efficient cleanup
of deleted snapshot files.
Excluding blocks is done by setting their bit in the exclude bitmap.
There is one exclude bitmap block per block group, which is allocated
on mkfs when setting the exclude_bitmap feature. The exclude bitmap
location is stored in the group descriptor.
The exclude_bitmap feature is backward compatible, but online resize
support with exclude_bitmap was not yet implemented.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/balloc.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 12 ++++++-
fs/ext4/mballoc.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/resize.c | 6 +++
fs/ext4/super.c | 35 +++++++++++++++++++-
5 files changed, 230 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 7d22e50..37ea965 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -359,6 +359,100 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
return bh;
}

+/* Initializes an uninitialized exclude bitmap if given, and returns 0 */
+unsigned ext4_init_exclude_bitmap(struct super_block *sb,
+ struct buffer_head *bh,
+ ext4_group_t block_group,
+ struct ext4_group_desc *gdp)
+{
+ if (!bh)
+ /* we can return no. of blocks in exclude bitmap */
+ return 0;
+
+ J_ASSERT_BH(bh, buffer_locked(bh));
+ memset(bh->b_data, 0, sb->s_blocksize);
+ return 0;
+}
+
+/**
+ * read_exclude_bitmap()
+ * @sb: super block
+ * @block_group: given block group
+ *
+ * Read the exclude bitmap for a given block_group
+ *
+ * Return buffer_head on success or NULL in case of failure.
+ */
+struct buffer_head *
+ext4_read_exclude_bitmap(struct super_block *sb, ext4_group_t block_group)
+{
+ struct ext4_group_desc *desc;
+ struct buffer_head *bh = NULL;
+ ext4_fsblk_t bitmap_blk;
+
+ desc = ext4_get_group_desc(sb, block_group, NULL);
+ if (!desc)
+ return NULL;
+ bitmap_blk = ext4_exclude_bitmap(sb, desc);
+ if (!bitmap_blk)
+ return NULL;
+ bh = sb_getblk(sb, bitmap_blk);
+ if (unlikely(!bh)) {
+ ext4_error(sb, "Cannot read exclude bitmap - "
+ "block_group = %d, exclude_bitmap = %llu",
+ block_group, bitmap_blk);
+ return NULL;
+ }
+
+ if (bitmap_uptodate(bh))
+ return bh;
+
+ lock_buffer(bh);
+ if (bitmap_uptodate(bh)) {
+ unlock_buffer(bh);
+ return bh;
+ }
+
+ ext4_lock_group(sb, block_group);
+ if (desc->bg_flags & cpu_to_le16(EXT4_BG_EXCLUDE_UNINIT)) {
+ ext4_init_exclude_bitmap(sb, bh, block_group, desc);
+ set_bitmap_uptodate(bh);
+ set_buffer_uptodate(bh);
+ ext4_unlock_group(sb, block_group);
+ unlock_buffer(bh);
+ return bh;
+ }
+ ext4_unlock_group(sb, block_group);
+ if (buffer_uptodate(bh)) {
+ /*
+ * if not uninit if bh is uptodate,
+ * bitmap is also uptodate
+ */
+ set_bitmap_uptodate(bh);
+ unlock_buffer(bh);
+ return bh;
+ }
+ /*
+ * submit the buffer_head for read. We can
+ * safely mark the bitmap as uptodate now.
+ * We do it here so the bitmap uptodate bit
+ * get set with buffer lock held.
+ */
+ set_bitmap_uptodate(bh);
+ if (bh_submit_read(bh) < 0) {
+ put_bh(bh);
+ ext4_error(sb, "Cannot read exclude bitmap - "
+ "block_group = %u, block_bitmap = %llu",
+ block_group, bitmap_blk);
+ return NULL;
+ }
+ /*
+ * file system mounted not to panic on error,
+ * continue with corrupt bitmap
+ */
+ return bh;
+}
+
/**
* ext4_has_free_blocks()
* @sbi: in-core super block structure.
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf5aa4d..37c608b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -282,7 +282,8 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_lo;/* Free inodes count */
__le16 bg_used_dirs_count_lo; /* Directories count */
__le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
- __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */
+ __le32 bg_exclude_bitmap_lo; /* Exclude bitmap block */
+ __u32 bg_reserved[1]; /* Likely block/inode bitmap checksum */
__le16 bg_itable_unused_lo; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
@@ -292,7 +293,8 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
__le16 bg_used_dirs_count_hi; /* Directories count MSB */
__le16 bg_itable_unused_hi; /* Unused inodes count MSB */
- __u32 bg_reserved2[3];
+ __le32 bg_exclude_bitmap_hi; /* Exclude bitmap block MSB */
+ __u32 bg_reserved2[2];
};

/*
@@ -308,6 +310,7 @@ struct flex_groups {
#define EXT4_BG_INODE_UNINIT 0x0001 /* Inode table/bitmap not in use */
#define EXT4_BG_BLOCK_UNINIT 0x0002 /* Block bitmap not in use */
#define EXT4_BG_INODE_ZEROED 0x0004 /* On-disk itable initialized to zero */
+#define EXT4_BG_EXCLUDE_UNINIT 0x0008 /* Exclude bitmap not in use */

/*
* Macro-instructions used to manage group descriptors
@@ -1459,6 +1462,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_COMPAT_EXT_ATTR 0x0008
#define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010
#define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020
+#define EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP 0x0100 /* Has exclude bitmap */

#define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
@@ -1786,6 +1790,8 @@ extern unsigned ext4_init_block_bitmap(struct super_block *sb,
struct ext4_group_desc *desc);
#define ext4_free_blocks_after_init(sb, group, desc) \
ext4_init_block_bitmap(sb, NULL, group, desc)
+extern struct buffer_head *ext4_read_exclude_bitmap(struct super_block *sb,
+ unsigned int block_group);

/* dir.c */
extern int __ext4_check_dir_entry(const char *, unsigned int, struct inode *,
@@ -1947,6 +1953,8 @@ extern ext4_fsblk_t ext4_block_bitmap(struct super_block *sb,
struct ext4_group_desc *bg);
extern ext4_fsblk_t ext4_inode_bitmap(struct super_block *sb,
struct ext4_group_desc *bg);
+extern ext4_fsblk_t ext4_exclude_bitmap(struct super_block *sb,
+ struct ext4_group_desc *bg);
extern ext4_fsblk_t ext4_inode_table(struct super_block *sb,
struct ext4_group_desc *bg);
extern __u32 ext4_free_blks_count(struct super_block *sb,
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4813b15..46bfb7e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2753,6 +2753,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
struct super_block *sb;
ext4_fsblk_t block;
int err, len;
+ struct buffer_head *exclude_bitmap_bh = NULL;

BUG_ON(ac->ac_status != AC_STATUS_FOUND);
BUG_ON(ac->ac_b_ex.fe_len <= 0);
@@ -2761,6 +2762,22 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
sbi = EXT4_SB(sb);

err = -EIO;
+ if (EXT4_SNAPSHOTS(sb) && ext4_snapshot_excluded(ac->ac_inode)) {
+ /*
+ * try to read exclude bitmap
+ */
+ exclude_bitmap_bh = ext4_read_exclude_bitmap(sb,
+ ac->ac_b_ex.fe_group);
+ if (!exclude_bitmap_bh)
+ goto out_err;
+ err = ext4_journal_get_write_access_exclude(handle,
+ exclude_bitmap_bh);
+ if (err) {
+ brelse(exclude_bitmap_bh);
+ goto out_err;
+ }
+ }
+
bitmap_bh = ext4_read_block_bitmap(sb, ac->ac_b_ex.fe_group);
if (!bitmap_bh)
goto out_err;
@@ -2813,6 +2830,9 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
}
#endif
mb_set_bits(bitmap_bh->b_data, ac->ac_b_ex.fe_start,ac->ac_b_ex.fe_len);
+ if (exclude_bitmap_bh)
+ mb_set_bits(exclude_bitmap_bh->b_data, ac->ac_b_ex.fe_start,
+ ac->ac_b_ex.fe_len);
if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
ext4_free_blks_set(sb, gdp,
@@ -2840,6 +2860,9 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
}

err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
+ if (!err && exclude_bitmap_bh)
+ err = ext4_handle_dirty_metadata(handle, NULL,
+ exclude_bitmap_bh);
if (err)
goto out_err;
err = ext4_handle_dirty_metadata(handle, NULL, gdp_bh);
@@ -2847,6 +2870,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
out_err:
ext4_mark_super_dirty(sb);
brelse(bitmap_bh);
+ brelse(exclude_bitmap_bh);
return err;
}

@@ -4509,6 +4533,12 @@ void __ext4_free_blocks(const char *where, unsigned int line, handle_t *handle,
int err = 0;
int ret;
int maxblocks;
+ struct buffer_head *exclude_bitmap_bh = NULL;
+ int exclude_bitmap_dirty = 0;
+ /* excluded_block is determined by testing exclude bitmap */
+ int excluded_block;
+ /* excluded_file is an attribute of the inode */
+ int excluded_file = ext4_snapshot_excluded(inode);

if (bh) {
if (block)
@@ -4628,6 +4658,26 @@ do_more:
err = ext4_journal_get_write_access(handle, gd_bh);
if (err)
goto error_return;
+ /*
+ * we may be freeing blocks of snapshot/excluded file
+ * which we would need to clear from exclude bitmap -
+ * try to read exclude bitmap and if it fails
+ * skip the exclude bitmap update
+ */
+ if (EXT4_SNAPSHOTS(sb)) {
+ brelse(exclude_bitmap_bh);
+ exclude_bitmap_bh = ext4_read_exclude_bitmap(sb, block_group);
+ if (!exclude_bitmap_bh) {
+ err = -EIO;
+ goto error_return;
+ }
+ err = ext4_journal_get_write_access_exclude(handle,
+ exclude_bitmap_bh);
+ if (err)
+ goto error_return;
+ exclude_bitmap_dirty = 0;
+ }
+
#ifdef AGGRESSIVE_CHECK
{
int i;
@@ -4635,6 +4685,27 @@ do_more:
BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
}
#endif
+ if (exclude_bitmap_bh) {
+ int i;
+ if (excluded_file)
+ i = mb_find_next_zero_bit(exclude_bitmap_bh->b_data,
+ bit + count, bit) - bit;
+ else
+ i = mb_find_next_bit(exclude_bitmap_bh->b_data,
+ bit + count, bit) - bit;
+ if (i < count) {
+ EXT4_SET_FLAGS(sb, EXT4_FLAGS_FIX_EXCLUDE);
+ ext4_error(sb, "%sexcluded file (ino=%lu)"
+ " block [%d/%u] was %sexcluded!"
+ " - run fsck to fix exclude bitmap.\n",
+ excluded_file ? "" : "non-",
+ inode ? inode->i_ino : 0,
+ bit + i, block_group,
+ excluded_block ? "" : "not ");
+ if (!excluded_file)
+ excluded_block = 1;
+ }
+ }
trace_ext4_mballoc_free(sb, inode, block_group, bit, count);

err = ext4_mb_load_buddy(sb, block_group, &e4b);
@@ -4669,6 +4740,14 @@ do_more:
mb_clear_bits(bitmap_bh->b_data, bit, count);
mb_free_blocks(inode, &e4b, bit, count);
}
+ /*
+ * A free block should never be excluded from snapshot, so we
+ * always clear exclude bitmap just to be on the safe side.
+ */
+ if (exclude_bitmap_bh && (excluded_file || excluded_block)) {
+ mb_clear_bits(exclude_bitmap_bh->b_data, bit, count);
+ exclude_bitmap_dirty = 1;
+ }

ret = ext4_free_blks_count(sb, gdp) + count;
ext4_free_blks_set(sb, gdp, ret);
@@ -4688,6 +4767,12 @@ do_more:
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
+ if (exclude_bitmap_bh && exclude_bitmap_dirty) {
+ ret = ext4_handle_dirty_metadata(handle, NULL,
+ exclude_bitmap_bh);
+ if (!err)
+ err = ret;
+ }

/* And the group descriptor block */
BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
@@ -4705,6 +4790,7 @@ do_more:
error_return:
if (freed)
dquot_free_block(inode, freed);
+ brelse(exclude_bitmap_bh);
brelse(bitmap_bh);
ext4_std_error(sb, err);
return;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index dff9b5d..57c421a 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -753,6 +753,12 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
gdb_num = input->group / EXT4_DESC_PER_BLOCK(sb);
gdb_off = input->group % EXT4_DESC_PER_BLOCK(sb);

+ if (EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP)) {
+ ext4_warning(sb, "Can't resize filesystem with exclude bitmap");
+ return -ENOTSUPP;
+ }
+
if (gdb_off == 0 && !EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER)) {
ext4_warning(sb, "Can't resize non-sparse filesystem further");
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 461653c..68c4d18 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -110,6 +110,18 @@ ext4_fsblk_t ext4_inode_bitmap(struct super_block *sb,
(ext4_fsblk_t)le32_to_cpu(bg->bg_inode_bitmap_hi) << 32 : 0);
}

+ext4_fsblk_t ext4_exclude_bitmap(struct super_block *sb,
+ struct ext4_group_desc *bg)
+{
+ if (!EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP))
+ return 0;
+
+ return le32_to_cpu(bg->bg_exclude_bitmap_lo) |
+ (EXT4_DESC_SIZE(sb) >= EXT4_MIN_DESC_SIZE_64BIT ?
+ (ext4_fsblk_t)le32_to_cpu(bg->bg_exclude_bitmap_hi) << 32 : 0);
+}
+
ext4_fsblk_t ext4_inode_table(struct super_block *sb,
struct ext4_group_desc *bg)
{
@@ -2054,6 +2066,7 @@ static int ext4_check_descriptors(struct super_block *sb,
ext4_fsblk_t block_bitmap;
ext4_fsblk_t inode_bitmap;
ext4_fsblk_t inode_table;
+ ext4_fsblk_t exclude_bitmap;
int flexbg_flag = 0;
ext4_group_t i, grp = sbi->s_groups_count;

@@ -2097,10 +2110,23 @@ static int ext4_check_descriptors(struct super_block *sb,
"(block %llu)!", i, inode_table);
return 0;
}
+ if (EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP)) {
+ exclude_bitmap = ext4_exclude_bitmap(sb, gdp);
+ if (exclude_bitmap < first_block ||
+ exclude_bitmap > last_block) {
+ ext4_msg(sb, KERN_ERR,
+ "ext4_check_descriptors: "
+ "Exclude bitmap for group %u "
+ "not in group (block %llu)!",
+ i, exclude_bitmap);
+ return 0;
+ }
+ }
ext4_lock_group(sb, i);
if (!ext4_group_desc_csum_verify(sbi, i, gdp)) {
ext4_msg(sb, KERN_ERR, "ext4_check_descriptors: "
- "Checksum for group %u failed (%u!=%u)",
+ "Checksum for group %u failed (%x!=%x)",
i, le16_to_cpu(ext4_group_desc_csum(sbi, i,
gdp)), le16_to_cpu(gdp->bg_checksum));
if (!(sb->s_flags & MS_RDONLY)) {
@@ -2640,6 +2666,13 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
"features: meta_bg, 64bit");
return 0;
}
+ if (!EXT4_HAS_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_COMPAT_EXCLUDE_BITMAP)) {
+ ext4_msg(sb, KERN_ERR,
+ "exclude_bitmap feature is required "
+ "for snapshots");
+ return 0;
+ }
if (EXT4_TEST_FLAGS(sb, EXT4_FLAGS_IS_SNAPSHOT)) {
ext4_msg(sb, KERN_ERR,
"A snapshot image must be mounted read-only. "
--
1.7.0.4


2011-05-09 16:44:33

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 28/30] ext4: snapshot cleanup

From: Amir Goldstein <[email protected]>

Cleanup snapshots list and reclaim unused blocks of deleted snapshots.
Oldest snapshot can be removed from list and its blocks can be freed.
Non-oldest snapshots have to be shrunk and merged before they can be
removed from the list. All snapshot blocks must be excluded in order
to properly shrink/merge deleted old snapshots.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 16 ++++++++++++++++
fs/ext4/inode.c | 19 +++++++++++--------
2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 37c608b..7650515 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1748,6 +1748,12 @@ struct ext4_features {
struct completion f_kobj_unregister;
};

+typedef struct {
+ __le32 *p;
+ __le32 key;
+ struct buffer_head *bh;
+} Indirect;
+
/*
* Function prototypes
*/
@@ -1889,6 +1895,16 @@ extern void ext4_da_update_reserve_space(struct inode *inode,
/* snapshot_inode.c */
extern int ext4_snapshot_readpage(struct file *file, struct page *page);

+extern int ext4_block_to_path(struct inode *inode,
+ ext4_lblk_t i_block,
+ ext4_lblk_t offsets[4], int *boundary);
+extern Indirect *ext4_get_branch(struct inode *inode, int depth,
+ ext4_lblk_t *offsets,
+ Indirect chain[4], int *err);
+extern void ext4_free_branches(handle_t *handle, struct inode *inode,
+ struct buffer_head *parent_bh,
+ __le32 *first, __le32 *last,
+ int depth);
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 794b29f..d46da6a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -176,6 +176,14 @@ int ext4_truncate_restart_trans(handle_t *handle, struct inode *inode,
*/
BUG_ON(EXT4_JOURNAL(inode) == NULL);
jbd_debug(2, "restarting handle %p\n", handle);
+ /*
+ * Snapshot shrink/merge/clean do not take i_data_sem, so we cannot
+ * release it here. Luckily, snapshot files are not writable,
+ * so deadlock with ext4_map_blocks on writepage is impossible.
+ * Snapshot files also don't have preallocations.
+ */
+ if (ext4_snapshot_file(inode))
+ return ext4_journal_restart(handle, nblocks);
up_write(&EXT4_I(inode)->i_data_sem);
ret = ext4_journal_restart(handle, nblocks);
down_write(&EXT4_I(inode)->i_data_sem);
@@ -281,11 +289,6 @@ no_delete:
ext4_clear_inode(inode); /* We must guarantee clearing of inode... */
}

-typedef struct {
- __le32 *p;
- __le32 key;
- struct buffer_head *bh;
-} Indirect;

static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
{
@@ -324,7 +327,7 @@ static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
* get there at all.
*/

-static int ext4_block_to_path(struct inode *inode,
+int ext4_block_to_path(struct inode *inode,
ext4_lblk_t i_block,
ext4_lblk_t offsets[4], int *boundary)
{
@@ -440,7 +443,7 @@ static int __ext4_check_blockref(const char *function, unsigned int line,
* Need to be called with
* down_read(&EXT4_I(inode)->i_data_sem)
*/
-static Indirect *ext4_get_branch(struct inode *inode, int depth,
+Indirect *ext4_get_branch(struct inode *inode, int depth,
ext4_lblk_t *offsets,
Indirect chain[4], int *err)
{
@@ -4702,7 +4705,7 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
* stored as little-endian 32-bit) and updating @inode->i_blocks
* appropriately.
*/
-static void ext4_free_branches(handle_t *handle, struct inode *inode,
+void ext4_free_branches(handle_t *handle, struct inode *inode,
struct buffer_head *parent_bh,
__le32 *first, __le32 *last, int depth)
{
--
1.7.0.4


2011-05-09 16:44:35

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 29/30] ext4: snapshot cleanup - shrink deleted snapshots

From: Amir Goldstein <[email protected]>

Free blocks of deleted snapshots, which are not in use by an older
non-deleted snapshot. Shrinking helps reclaiming disk space
while older snapshots are currently in use (enabled).
We modify the indirect inode truncate helper functions so that they
can be used by the snapshot cleanup functions to free blocks
selectively according to a COW bitmap buffer.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 10 ++++++++++
fs/ext4/inode.c | 51 +++++++++++++++++++++++++++++++++++++--------------
2 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7650515..07629ce 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1905,6 +1905,16 @@ extern void ext4_free_branches(handle_t *handle, struct inode *inode,
struct buffer_head *parent_bh,
__le32 *first, __le32 *last,
int depth);
+extern void ext4_free_data_cow(handle_t *handle, struct inode *inode,
+ struct buffer_head *this_bh,
+ __le32 *first, __le32 *last,
+ const char *bitmap, int bit,
+ int *pfreed_blocks);
+
+#define ext4_free_data(handle, inode, bh, first, last) \
+ ext4_free_data_cow(handle, inode, bh, first, last, \
+ NULL, 0, NULL)
+
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d46da6a..e3bfee2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4546,11 +4546,15 @@ no_top:
* Return 0 on success, 1 on invalid block range
* and < 0 on fatal error.
*/
-static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
- struct buffer_head *bh,
- ext4_fsblk_t block_to_free,
- unsigned long count, __le32 *first,
- __le32 *last)
+/*
+ * ext4_clear_blocks_cow - Zero a number of block pointers (consult COW bitmap)
+ * @bitmap: COW bitmap to consult when shrinking deleted snapshot
+ * @bit: bit number representing the @first block
+ */
+static int ext4_clear_blocks_cow(handle_t *handle, struct inode *inode,
+ struct buffer_head *bh, ext4_fsblk_t block_to_free,
+ unsigned long count, __le32 *first, __le32 *last,
+ const char *bitmap, int bit)
{
__le32 *p;
int flags = EXT4_FREE_BLOCKS_FORGET | EXT4_FREE_BLOCKS_VALIDATED;
@@ -4590,8 +4594,12 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
}
}

- for (p = first; p < last; p++)
+ for (p = first; p < last; p++) {
+ if (*p && bitmap && ext4_test_bit(bit + (p - first), bitmap))
+ /* don't free block used by older snapshot */
+ continue;
*p = 0;
+ }

ext4_free_blocks(handle, inode, NULL, block_to_free, count, flags);
return 0;
@@ -4619,9 +4627,17 @@ out_err:
* @this_bh will be %NULL if @first and @last point into the inode's direct
* block pointers.
*/
-static void ext4_free_data(handle_t *handle, struct inode *inode,
+/*
+ * ext4_free_data_cow - free a list of data blocks (consult COW bitmap)
+ * @bitmap: COW bitmap to consult when shrinking deleted snapshot
+ * @bit: bit number representing the @first block
+ * @pfreed_blocks: return number of freed blocks
+ */
+void ext4_free_data_cow(handle_t *handle, struct inode *inode,
struct buffer_head *this_bh,
- __le32 *first, __le32 *last)
+ __le32 *first, __le32 *last,
+ const char *bitmap, int bit,
+ int *pfreed_blocks)
{
ext4_fsblk_t block_to_free = 0; /* Starting block # of a run */
unsigned long count = 0; /* Number of blocks in the run */
@@ -4645,6 +4661,11 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,

for (p = first; p < last; p++) {
nr = le32_to_cpu(*p);
+ if (nr && bitmap && ext4_test_bit(bit + (p - first), bitmap))
+ /* don't free block used by older snapshot */
+ nr = 0;
+ if (nr && pfreed_blocks)
+ ++(*pfreed_blocks);
if (nr) {
/* accumulate blocks to free if they're contiguous */
if (count == 0) {
@@ -4654,9 +4675,10 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
} else if (nr == block_to_free + count) {
count++;
} else {
- err = ext4_clear_blocks(handle, inode, this_bh,
- block_to_free, count,
- block_to_free_p, p);
+ err = ext4_clear_blocks_cow(handle, inode,
+ this_bh, block_to_free, count,
+ block_to_free_p, p, bitmap,
+ bit + (block_to_free_p - first));
if (err)
break;
block_to_free = nr;
@@ -4666,9 +4688,10 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
}
}

- if (!err && count > 0)
- err = ext4_clear_blocks(handle, inode, this_bh, block_to_free,
- count, block_to_free_p, p);
+ if (count > 0)
+ err = ext4_clear_blocks_cow(handle, inode, this_bh,
+ block_to_free, count, block_to_free_p, p,
+ bitmap, bit + (block_to_free_p - first));
if (err < 0)
/* fatal error */
return;
--
1.7.0.4


2011-05-09 16:44:22

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 23/30] ext4: snapshot journaled - trace COW/buffer credits

From: Amir Goldstein <[email protected]>

Extra debug prints to trace snapshot usage of buffer credits.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4_jbd2.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 26 ++++++++++++++++
fs/ext4/super.c | 2 +
3 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index e8287f4..eb88564 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -23,6 +23,7 @@ int __ext4_handle_get_bitmap_access(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
}
return err;
}
@@ -40,6 +41,7 @@ int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__, bh,
handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
}
return err;
}
@@ -91,6 +93,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
return err;
}
return 0;
@@ -108,6 +111,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
"error %d when attempting revoke", err);
}
BUFFER_TRACE(bh, "exit");
+ ext4_journal_trace(SNAP_DEBUG, where, handle, 1);
return err;
}

@@ -123,6 +127,7 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ ext4_journal_trace(SNAP_DEBUG, where, handle, -1);
}
return err;
}
@@ -198,6 +203,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
}
jbd_unlock_bh_state(bh);
}
+ ext4_journal_trace(SNAP_DEBUG, where, handle, -1);
} else {
if (inode)
mark_buffer_dirty_inode(bh, inode);
@@ -236,3 +242,77 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
sb->s_dirt = 1;
return err;
}
+
+#ifdef CONFIG_JBD2_DEBUG
+static void ext4_journal_cow_stats(int n, handle_t *handle)
+{
+ snapshot_debug(n, "COW stats: moved/copied=%d/%d, "
+ "mapped/bitmap/cached=%d/%d/%d, "
+ "bitmaps/cleared=%d/%d\n", handle->h_cow_moved,
+ handle->h_cow_copied, handle->h_cow_ok_mapped,
+ handle->h_cow_ok_bitmap, handle->h_cow_ok_jh,
+ handle->h_cow_bitmaps, handle->h_cow_excluded);
+}
+#else
+#define ext4_journal_cow_stats(n, handle)
+#endif
+
+#ifdef CONFIG_EXT4_DEBUG
+void __ext4_journal_trace(int n, const char *fn, const char *caller,
+ handle_t *handle, int nblocks)
+{
+ int active_snapshot;
+ int upper;
+ int lower;
+ int final;
+ struct super_block *sb;
+
+ sb = handle->h_transaction->t_journal->j_private;
+ if (!EXT4_SNAPSHOTS(sb))
+ return;
+
+ active_snapshot = ext4_snapshot_active(EXT4_SB(sb));
+ upper = EXT4_SNAPSHOT_START_TRANS_BLOCKS(handle->h_base_credits);
+ lower = EXT4_SNAPSHOT_TRANS_BLOCKS(handle->h_user_credits);
+ final = (nblocks == 0 && handle->h_ref == 1 &&
+ !IS_COWING(handle));
+
+ switch (snapshot_enable_debug) {
+ case SNAP_INFO:
+ /* trace final journal_stop if any credits have been used */
+ if (final && (handle->h_buffer_credits < upper ||
+ handle->h_user_credits < handle->h_base_credits))
+ break;
+ case SNAP_WARN:
+ /*
+ * trace if buffer credits are too low - lower limit is only
+ * valid if there is an active snapshot and not during COW
+ */
+ if (handle->h_buffer_credits < lower &&
+ active_snapshot && !IS_COWING(handle))
+ break;
+ case SNAP_ERR:
+ /* trace if user credits are too low */
+ if (handle->h_user_credits < 0)
+ break;
+ case 0:
+ /* no trace */
+ return;
+
+ case SNAP_DEBUG:
+ default:
+ /* trace all calls */
+ break;
+ }
+
+ snapshot_debug_l(n, IS_COWING(handle), "%s(%d): credits=%d,"
+ " limit=%d/%d, user=%d/%d, ref=%d, caller=%s\n",
+ fn, nblocks, handle->h_buffer_credits, lower, upper,
+ handle->h_user_credits, handle->h_base_credits,
+ handle->h_ref, caller);
+ if (!final)
+ return;
+
+ ext4_journal_cow_stats(n, handle);
+}
+#endif
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 1e05e2c..ff8dcd8 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -229,10 +229,36 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define ext4_handle_dirty_super(handle, sb) \
__ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))

+/*
+ * macros for ext4 to update transaction COW statistics.
+ * if the kernel was compiled without CONFIG_JBD2_DEBUG
+ * then the h_cow_* fields are not allocated in handle objects.
+ */
+#ifdef CONFIG_JBD2_DEBUG
+#define trace_cow_add(handle, name, num) \
+ (handle)->h_cow_##name += (num)
+#define trace_cow_inc(handle, name) \
+ (handle)->h_cow_##name++;
+
+#else
#define trace_cow_add(handle, name, num)
#define trace_cow_inc(handle, name)

+#endif
+#ifdef CONFIG_EXT4_DEBUG
+void __ext4_journal_trace(int debug, const char *fn, const char *caller,
+ handle_t *handle, int nblocks);
+
+#define ext4_journal_trace(n, caller, handle, nblocks) \
+ do { \
+ if ((n) <= snapshot_enable_debug) \
+ __ext4_journal_trace((n), __func__, (caller), \
+ (handle), (nblocks)); \
+ } while (0)
+
+#else
#define ext4_journal_trace(n, caller, handle, nblocks)
+#endif

handle_t *__ext4_journal_start(const char *where,
struct super_block *sb, int nblocks);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0bde939..9f68ed1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -308,6 +308,8 @@ int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
int err;
int rc;

+ ext4_journal_trace(SNAP_WARN, where, handle, 0);
+
if (!ext4_handle_valid(handle)) {
ext4_put_nojournal(handle);
return 0;
--
1.7.0.4


2011-05-09 16:44:37

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 30/30] ext4: snapshot rocompat - enable rw mount

From: Amir Goldstein <[email protected]>

Enable readwrite mount of filesystem with has_snapshot feature only
if ext4 was compiled with snapshot support.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 10 ++++++++++
2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 07629ce..143c607 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1494,6 +1494,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_INCOMPAT_FLEX_BG)
#define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
+ EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT| \
EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 68c4d18..b8ae94d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2680,6 +2680,16 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
"must run fsck -xy to make it writable.");
return 0;
}
+ } else if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT)) {
+ /*
+ * We get here when CONFIG_EXT4_FS_SNAPSHOT is not defined
+ * so EXT4_SNAPSHOTS(sb) is defined to (0)
+ */
+ ext4_msg(sb, KERN_ERR,
+ "Filesystem with has_snapshot feature cannot be "
+ "mounted RDWR without CONFIG_EXT4_FS_SNAPSHOT");
+ return 0;
}
return 1;
}
--
1.7.0.4


2011-05-09 16:44:24

by Amir G.

[permalink] [raw]
Subject: [PATCH RFC 24/30] ext4: snapshot list support

From: Amir Goldstein <[email protected]>

Implementation of multiple incremental snapshots.
Snapshot inodes are chained on a list starting at the super block,
both on-disk and in-memory, similar to the orphan inodes.
Unlink and truncate of snapshot inodes on the list is not allowed,
so an inode can never be chained on both orphan and snapshot lists.
We make use of this fact to overload the in-memory inode field
ext4_inode_info.i_orphan for the chaining of snapshots.

Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 884033f..a7bb8ed 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1207,6 +1207,7 @@ struct ext4_sb_info {
struct block_device *journal_bdev;
struct mutex s_snapshot_mutex; /* protects 2 fields below: */
struct inode *s_active_snapshot; /* [ s_snapshot_mutex ] */
+ struct list_head s_snapshot_list; /* [ s_snapshot_mutex ] */
#ifdef CONFIG_JBD2_DEBUG
struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */
wait_queue_head_t ro_wait_queue; /* For people waiting for the fs to go read-only */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9f68ed1..461653c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3496,6 +3496,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

mutex_init(&sbi->s_snapshot_mutex);
sbi->s_active_snapshot = NULL;
+ INIT_LIST_HEAD(&sbi->s_snapshot_list); /* snapshot files */

needs_recovery = (es->s_last_orphan != 0 ||
EXT4_HAS_INCOMPAT_FEATURE(sb,
--
1.7.0.4


2011-06-02 11:47:20

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 13/30] ext4: snapshot file - increase maximum file size limit to 16TB

Let it not be said that my patches don't get reviewed ;-)

Ted,

I have requested the Github support team to change my ext4-snapshots repo's
fork parent to your ext4 repo.
When they do that, I can send you a "Pull Request" for the patch
series via Github.
You will then be able to review and comment on the patches inline (and online).
Until the time that you merge my patches, you will be able to view the request
in your "Pull Requests" inbox, with all comments you made on all the patches
in the request.
Since you won't be merging the v1 series, you will have the v1 comments
as reference in your inbox when I will send a pull request for a
different v2 branch.

Let try to give that workflow a chance.
Amir.


On Mon, May 9, 2011 at 7:41 PM, <[email protected]> wrote:
> From: Amir Goldstein <[email protected]>
>
> Files larger than 2TB use Ext4 huge_file flag to store i_blocks
> in file system blocks units, so the upper limit on snapshot actual
> size is increased from 512*2^32 = 2TB to 4K*2^32 = 16TB,
> which is also the upper limit on file system size.
> To map 2^32 logical blocks, 4 triple indirect blocks are used instead
> of just one. ?The extra 3 triple indirect blocks are stored in-place
> of direct blocks, which are not in use by snapshot files.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> Signed-off-by: Yongqiang Yang <[email protected]>
> ---
> ?fs/ext4/ext4.h ?| ? 13 +++++++++++++
> ?fs/ext4/inode.c | ? 44 ++++++++++++++++++++++++++++++++++++++++++--
> ?fs/ext4/super.c | ? ?5 ++++-
> ?3 files changed, 59 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4072036..8f59322 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -333,6 +333,19 @@ struct flex_groups {
> ?#define ? ? ? ?EXT4_DIND_BLOCK ? ? ? ? ? ? ? ? (EXT4_IND_BLOCK + 1)
> ?#define ? ? ? ?EXT4_TIND_BLOCK ? ? ? ? ? ? ? ? (EXT4_DIND_BLOCK + 1)
> ?#define ? ? ? ?EXT4_N_BLOCKS ? ? ? ? ? ? ? ? ? (EXT4_TIND_BLOCK + 1)
> +/*
> + * Snapshot files have different indirection mapping that can map up to 2^32
> + * logical blocks, so they can cover the mapped filesystem block address space.
> + * Ext4 must use either 4K or 8K blocks (depending on PAGE_SIZE).
> + * With 8K blocks, 1 triple indirect block maps 2^33 logical blocks.
> + * With 4K blocks (the system default), each triple indirect block maps 2^30
> + * logical blocks, so 4 triple indirect blocks map 2^32 logical blocks.
> + * Snapshot files in small filesystems (<= 4G), use only 1 double indirect
> + * block to map the entire filesystem.
> + */
> +#define ? ? ? ?EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS 3
> +#define ? ? ? ?EXT4_SNAPSHOT_N_BLOCKS ? ? ? ? ?(EXT4_TIND_BLOCK + 1 + \
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS)
>
> ?/*
> ?* Inode flags
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index db1706f..425dabb 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -335,6 +335,7 @@ static int ext4_block_to_path(struct inode *inode,
> ? ? ? ? ? ? ? ?double_blocks = (1 << (ptrs_bits * 2));
> ? ? ? ?int n = 0;
> ? ? ? ?int final = 0;
> + ? ? ? int tind;
>
> ? ? ? ?if (i_block < direct_blocks) {
> ? ? ? ? ? ? ? ?offsets[n++] = i_block;
> @@ -354,6 +355,18 @@ static int ext4_block_to_path(struct inode *inode,
> ? ? ? ? ? ? ? ?offsets[n++] = (i_block >> ptrs_bits) & (ptrs - 1);
> ? ? ? ? ? ? ? ?offsets[n++] = i_block & (ptrs - 1);
> ? ? ? ? ? ? ? ?final = ptrs;
> + ? ? ? } else if (ext4_snapshot_file(inode) &&
> + ? ? ? ? ? ? ? ? ? ? ? (i_block >> (ptrs_bits * 3)) <
> + ? ? ? ? ? ? ? ? ? ? ? EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS + 1) {
> + ? ? ? ? ? ? ? tind = i_block >> (ptrs_bits * 3);
> + ? ? ? ? ? ? ? BUG_ON(tind == 0);
> + ? ? ? ? ? ? ? /* use up to 4 triple indirect blocks to map 2^32 blocks */
> + ? ? ? ? ? ? ? i_block -= (tind << (ptrs_bits * 3));
> + ? ? ? ? ? ? ? offsets[n++] = (EXT4_TIND_BLOCK + tind) % EXT4_NDIR_BLOCKS;
> + ? ? ? ? ? ? ? offsets[n++] = i_block >> (ptrs_bits * 2);
> + ? ? ? ? ? ? ? offsets[n++] = (i_block >> ptrs_bits) & (ptrs - 1);
> + ? ? ? ? ? ? ? offsets[n++] = i_block & (ptrs - 1);
> + ? ? ? ? ? ? ? final = ptrs;
> ? ? ? ?} else {
> ? ? ? ? ? ? ? ?ext4_warning(inode->i_sb, "block %lu > max in inode %lu",
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? i_block + direct_blocks +
> @@ -4748,6 +4761,13 @@ void ext4_truncate(struct inode *inode)
> ? ? ? ?ext4_lblk_t last_block, max_block;
> ? ? ? ?unsigned blocksize = inode->i_sb->s_blocksize;
>
> + ? ? ? /* prevent partial truncate of snapshot files */
> + ? ? ? if (ext4_snapshot_file(inode) && inode->i_size != 0) {
> + ? ? ? ? ? ? ? snapshot_debug(1, "snapshot file (%lu) cannot be partly "
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "truncated!\n", inode->i_ino);
> + ? ? ? ? ? ? ? return;
> + ? ? ? }
> +
> ? ? ? ?/* prevent truncate of files on snapshot list */
> ? ? ? ?if (ext4_snapshot_list(inode)) {
> ? ? ? ? ? ? ? ?snapshot_debug(1, "snapshot (%u) cannot be truncated!\n",
> @@ -4861,6 +4881,10 @@ do_indirects:
> ? ? ? ?/* Kill the remaining (whole) subtrees */
> ? ? ? ?switch (offsets[0]) {
> ? ? ? ?default:
> + ? ? ? ? ? ? ? if (ext4_snapshot_file(inode) &&
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? offsets[0] < EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS)
> + ? ? ? ? ? ? ? ? ? ? ? /* Freeing snapshot extra tind branches */
> + ? ? ? ? ? ? ? ? ? ? ? break;
> ? ? ? ? ? ? ? ?nr = i_data[EXT4_IND_BLOCK];
> ? ? ? ? ? ? ? ?if (nr) {
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 1);
> @@ -4882,6 +4906,19 @@ do_indirects:
> ? ? ? ? ? ? ? ?;
> ? ? ? ?}
>
> + ? ? ? if (ext4_snapshot_file(inode)) {
> + ? ? ? ? ? ? ? int i;
> +
> + ? ? ? ? ? ? ? /* Kill the remaining snapshot file triple indirect trees */
> + ? ? ? ? ? ? ? for (i = 0; i < EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS; i++) {
> + ? ? ? ? ? ? ? ? ? ? ? nr = i_data[i];
> + ? ? ? ? ? ? ? ? ? ? ? if (!nr)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? ? ? ? ? ext4_free_branches(handle, inode, NULL, &nr, &nr+1, 3);
> + ? ? ? ? ? ? ? ? ? ? ? i_data[i] = 0;
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> +
> ?out_unlock:
> ? ? ? ?up_write(&ei->i_data_sem);
> ? ? ? ?inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
> @@ -5114,7 +5151,8 @@ static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
> ? ? ? ?struct super_block *sb = inode->i_sb;
>
> ? ? ? ?if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_HUGE_FILE)) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_HUGE_FILE) ||
> + ? ? ? ? ? ? ? ? ? ? ? ext4_snapshot_file(inode)) {
> ? ? ? ? ? ? ? ?/* we are using combined 48 bit field */
> ? ? ? ? ? ? ? ?i_blocks = ((u64)le16_to_cpu(raw_inode->i_blocks_high)) << 32 |
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?le32_to_cpu(raw_inode->i_blocks_lo);
> @@ -5353,7 +5391,9 @@ static int ext4_inode_blocks_set(handle_t *handle,
> ? ? ? ? ? ? ? ?ext4_clear_inode_flag(inode, EXT4_INODE_HUGE_FILE);
> ? ? ? ? ? ? ? ?return 0;
> ? ? ? ?}
> - ? ? ? if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_HUGE_FILE))
> + ? ? ? /* snapshot files may be represented as huge files */
> + ? ? ? if (!EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_HUGE_FILE) &&
> + ? ? ? ? ? ? ? ? ? ? ? !ext4_snapshot_file(inode))
> ? ? ? ? ? ? ? ?return -EFBIG;
>
> ? ? ? ?if (i_blocks <= 0xffffffffffffULL) {
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index e3ebd7d..d26831a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2316,7 +2316,7 @@ static loff_t ext4_max_bitmap_size(int bits, int has_huge_files)
>
> ? ? ? ?res += 1LL << (bits-2);
> ? ? ? ?res += 1LL << (2*(bits-2));
> - ? ? ? res += 1LL << (3*(bits-2));
> + ? ? ? res += (1LL + EXT4_SNAPSHOT_EXTRA_TIND_BLOCKS) << (3*(bits-2));

This is plain wrong.
s_bitmap_maxbytes should not be affected by snapshots.
Instead, snapshot files maximum size should be enforced by s_maxbytes
and not by s_bitmap_maxbytes.

> ? ? ? ?res <<= bits;
> ? ? ? ?if (res > upper_limit)
> ? ? ? ? ? ? ? ?res = upper_limit;
> @@ -3259,6 +3259,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> ? ? ? ?has_huge_files = EXT4_HAS_RO_COMPAT_FEATURE(sb,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_RO_COMPAT_HUGE_FILE);
> + ? ? ? if (EXT4_SNAPSHOTS(sb))
> + ? ? ? ? ? ? ? /* Snapshot files are huge files */
> + ? ? ? ? ? ? ? has_huge_files = 1;

This should be moved under sbi->s_bitmap_maxbytes = , so it will only
affect s_maxbytes.

> ? ? ? ?sbi->s_bitmap_maxbytes = ext4_max_bitmap_size(sb->s_blocksize_bits,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?has_huge_files);
> ? ? ? ?sb->s_maxbytes = ext4_max_size(sb->s_blocksize_bits, has_huge_files);
> --
> 1.7.0.4
>
>

2011-06-02 11:52:39

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 09/30] ext4: snapshot file

Naturally, I have already fixed this issue, but in case someone was
going to review
this patch, I just wanted to bring the issue out there.

On Mon, May 9, 2011 at 7:41 PM, <[email protected]> wrote:
> From: Amir Goldstein <[email protected]>
>
> Ext4 snapshot implementation as a file inside the file system.
> Snapshot files are marked with the snapfile flag and have special
> read-only address space ops.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> Signed-off-by: Yongqiang Yang <[email protected]>
> ---
> ?fs/ext4/ext4.h ? ? ?| ? 83 +++++++++++++++++++++++++++++++++++++++++++++++++--
> ?fs/ext4/ext4_jbd2.h | ? ?2 +
> ?fs/ext4/ialloc.c ? ?| ? ?8 ++++-
> ?fs/ext4/inode.c ? ? | ? 29 ++++++++++++++++++
> ?fs/ext4/super.c ? ? | ? ?9 +++++
> ?5 files changed, 126 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 013eec2..4072036 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -361,17 +361,23 @@ struct flex_groups {
> ?#define EXT4_EXTENTS_FL ? ? ? ? ? ? ? ? ? ? ? ?0x00080000 /* Inode uses extents */
> ?#define EXT4_EA_INODE_FL ? ? ? ? ? ? ? 0x00200000 /* Inode used for large EA */
> ?#define EXT4_EOFBLOCKS_FL ? ? ? ? ? ? ?0x00400000 /* Blocks allocated beyond EOF */
> +/* snapshot persistent flags */
> +#define EXT4_SNAPFILE_FL ? ? ? ? ? ? ? 0x01000000 /* snapshot file */
> +#define EXT4_SNAPFILE_DELETED_FL ? ? ? 0x04000000 /* snapshot is deleted */
> +#define EXT4_SNAPFILE_SHRUNK_FL ? ? ? ? ? ? ? ?0x08000000 /* snapshot was shrunk */
> +/* end of snapshot flags */
> ?#define EXT4_RESERVED_FL ? ? ? ? ? ? ? 0x80000000 /* reserved for ext4 lib */
>
> -#define EXT4_FL_USER_VISIBLE ? ? ? ? ? 0x004BDFFF /* User visible flags */
> -#define EXT4_FL_USER_MODIFIABLE ? ? ? ? ? ? ? ?0x004B80FF /* User modifiable flags */
> +
> +#define EXT4_FL_USER_VISIBLE ? ? ? ? ? 0x014BDFFF /* User visible flags */
> +#define EXT4_FL_USER_MODIFIABLE ? ? ? ? ? ? ? ?0x014B80FF /* User modifiable flags */
>
> ?/* Flags that should be inherited by new inodes from their parent. */
> ?#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
> ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_SYNC_FL | EXT4_IMMUTABLE_FL | EXT4_APPEND_FL |\
> ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
> ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
> - ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL)
> + ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL | EXT4_SNAPFILE_FL)
>
> ?/* Flags that are appropriate for regular files (all but dir-specific ones). */
> ?#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
> @@ -418,6 +424,9 @@ enum {
> ? ? ? ?EXT4_INODE_EXTENTS ? ? ?= 19, ? /* Inode uses extents */
> ? ? ? ?EXT4_INODE_EA_INODE ? ? = 21, ? /* Inode used for large EA */
> ? ? ? ?EXT4_INODE_EOFBLOCKS ? ?= 22, ? /* Blocks allocated beyond EOF */
> + ? ? ? EXT4_INODE_SNAPFILE ? ? = 24, ? /* Snapshot file/dir */
> + ? ? ? EXT4_INODE_SNAPFILE_DELETED = 26, ? ? ? /* Snapshot is deleted */
> + ? ? ? EXT4_INODE_SNAPFILE_SHRUNK = 27, ? ? ? ?/* Snapshot was shrunk */
> ? ? ? ?EXT4_INODE_RESERVED ? ? = 31, ? /* reserved for ext4 lib */
> ?};
>
> @@ -464,6 +473,9 @@ static inline void ext4_check_flag_values(void)
> ? ? ? ?CHECK_FLAG_VALUE(EXTENTS);
> ? ? ? ?CHECK_FLAG_VALUE(EA_INODE);
> ? ? ? ?CHECK_FLAG_VALUE(EOFBLOCKS);
> + ? ? ? CHECK_FLAG_VALUE(SNAPFILE);
> + ? ? ? CHECK_FLAG_VALUE(SNAPFILE_DELETED);
> + ? ? ? CHECK_FLAG_VALUE(SNAPFILE_SHRUNK);
> ? ? ? ?CHECK_FLAG_VALUE(RESERVED);
> ?}
>
> @@ -803,6 +815,14 @@ struct ext4_inode_info {
> ? ? ? ?struct list_head i_orphan; ? ? ?/* unlinked but open inodes */
>
> ? ? ? ?/*
> + ? ? ? ?* In-memory snapshot list overrides i_orphan to link snapshot inodes,
> + ? ? ? ?* but unlike the real orphan list, the next snapshot inode number
> + ? ? ? ?* is stored in i_next_snapshot_ino and not in i_dtime
> + ? ? ? ?*/
> +#define i_snaplist i_orphan
> + ? ? ? __u32 ? i_next_snapshot_ino;
> +
> + ? ? ? /*
> ? ? ? ? * i_disksize keeps track of what the inode size is ON DISK, not
> ? ? ? ? * in memory. ?During truncate, i_size is set to the new size by
> ? ? ? ? * the VFS prior to calling ext4_truncate(), but the filesystem won't
> @@ -1158,6 +1178,8 @@ struct ext4_sb_info {
> ? ? ? ?u32 s_max_batch_time;
> ? ? ? ?u32 s_min_batch_time;
> ? ? ? ?struct block_device *journal_bdev;
> + ? ? ? struct mutex s_snapshot_mutex; ? ? ? ? ?/* protects 2 fields below: */
> + ? ? ? struct inode *s_active_snapshot; ? ? ? ?/* [ s_snapshot_mutex ] */
> ?#ifdef CONFIG_JBD2_DEBUG
> ? ? ? ?struct timer_list turn_ro_timer; ? ? ? ?/* For turning read-only (crash simulation) */
> ? ? ? ?wait_queue_head_t ro_wait_queue; ? ? ? ?/* For people waiting for the fs to go read-only */
> @@ -1274,8 +1296,31 @@ enum {
> ? ? ? ?EXT4_STATE_DIO_UNWRITTEN, ? ? ? /* need convert on dio done*/
> ? ? ? ?EXT4_STATE_NEWENTRY, ? ? ? ? ? ?/* File just added to dir */
> ? ? ? ?EXT4_STATE_DELALLOC_RESERVED, ? /* blks already reserved for delalloc */
> + ? ? ? EXT4_STATE_LAST
> ?};
>
> +/*
> + * Snapshot dynamic state flags (starting at offset EXT4_STATE_LAST)
> + * These flags are read by GETSNAPFLAGS ioctl and interpreted by the lssnap
> + * utility. ?Do not change these values.
> + */
> +enum {
> + ? ? ? EXT4_SNAPSTATE_LIST = 0, ? ? ? ?/* snapshot is on list (S) */
> + ? ? ? EXT4_SNAPSTATE_ENABLED = 1, ? ? /* snapshot is enabled (n) */
> + ? ? ? EXT4_SNAPSTATE_ACTIVE = 2, ? ? ?/* snapshot is active ?(a) */
> + ? ? ? EXT4_SNAPSTATE_INUSE = 3, ? ? ? /* snapshot is in-use ?(p) */
> + ? ? ? EXT4_SNAPSTATE_DELETED = 4, ? ? /* snapshot is deleted (s) */
> + ? ? ? EXT4_SNAPSTATE_SHRUNK = 5, ? ? ?/* snapshot was shrunk (h) */
> + ? ? ? EXT4_SNAPSTATE_OPEN = 6, ? ? ? ?/* snapshot is mounted (o) */
> + ? ? ? EXT4_SNAPSTATE_TAGGED = 7, ? ? ?/* snapshot is tagged ?(t) */
> + ? ? ? EXT4_SNAPSTATE_LAST
> +};
> +
> +#define EXT4_SNAPSTATE_MASK ? ? ? ? ? ?\
> + ? ? ? ((1UL << EXT4_SNAPSTATE_LAST) - 1)
> +
> +
> +/* atomic single bit funcs */
> ?#define EXT4_INODE_BIT_FNS(name, field, offset) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> ?static inline int ext4_test_inode_##name(struct inode *inode, int bit) \
> ?{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> @@ -1290,9 +1335,28 @@ static inline void ext4_clear_inode_##name(struct inode *inode, int bit) \
> ? ? ? ?clear_bit(bit + (offset), &EXT4_I(inode)->i_##field); ? ? ? ? ? \
> ?}
>
> +/* non-atomic multi bit funcs */
> +#define EXT4_INODE_FLAGS_FNS(name, field, offset) ? ? ? ? ? ? ? ? ? ? ?\
> +static inline int ext4_get_##name##_flags(struct inode *inode) ? ? ? ? \
> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> + ? ? ? return EXT4_I(inode)->i_##field >> (offset); ? ? ? ? ? ? ? ? ? ?\
> +} ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> +static inline void ext4_set_##name##_flags(struct inode *inode, ? ? ? ? ? ? ? ?\
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long flags) ? ?\
> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> + ? ? ? EXT4_I(inode)->i_##field |= (flags << (offset)); ? ? ? ? ? ? ? ?\
> +} ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> +static inline void ext4_clear_##name##_flags(struct inode *inode, ? ? ?\
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long flags) ? ?\
> +{ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> + ? ? ? EXT4_I(inode)->i_##field &= ~(flags << (offset)); ? ? ? ? ? ? ? \
> +}
> +

This was not a good idea!
We must not use non-atmoic bit ops to set bits on the inode's flags
fields, which other tasks use atomic bit ops to set.
We can do without the set and clear funcs, so remove them
and use atomic bit ops funcs instead.

> ?EXT4_INODE_BIT_FNS(flag, flags, 0)
> ?#if (BITS_PER_LONG < 64)
> ?EXT4_INODE_BIT_FNS(state, state_flags, 0)
> +EXT4_INODE_BIT_FNS(snapstate, state_flags, EXT4_STATE_LAST)
> +EXT4_INODE_FLAGS_FNS(snapstate, state_flags, EXT4_STATE_LAST)
>
> ?static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> ?{
> @@ -1300,6 +1364,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> ?}
> ?#else
> ?EXT4_INODE_BIT_FNS(state, flags, 32)
> +EXT4_INODE_BIT_FNS(snapstate, flags, 32 + EXT4_STATE_LAST)
> +EXT4_INODE_FLAGS_FNS(snapstate, flags, 32 + EXT4_STATE_LAST)
>
> ?static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> ?{
> @@ -1314,6 +1380,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> ?#endif
>
> ?#define NEXT_ORPHAN(inode) EXT4_I(inode)->i_dtime
> +#define NEXT_SNAPSHOT(inode) (EXT4_I(inode)->i_next_snapshot_ino)
>
> ?/*
> ?* Codes for operating systems
> @@ -1781,6 +1848,10 @@ extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
> ?extern qsize_t *ext4_get_reserved_space(struct inode *inode);
> ?extern void ext4_da_update_reserve_space(struct inode *inode,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int used, int quota_claim);
> +
> +/* snapshot_inode.c */
> +extern int ext4_snapshot_readpage(struct file *file, struct page *page);
> +
> ?/* ioctl.c */
> ?extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
> ?extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
> @@ -2004,6 +2075,12 @@ struct ext4_group_info {
> ? ? ? ?void ? ? ? ? ? ?*bb_bitmap;
> ?#endif
> ? ? ? ?struct rw_semaphore alloc_sem;
> + ? ? ? /*
> + ? ? ? ?* bg_cow_bitmap is reset to zero on mount time and on every snapshot
> + ? ? ? ?* take and initialized lazily on first block group write access.
> + ? ? ? ?* bg_cow_bitmap is protected by sb_bgl_lock().
> + ? ? ? ?*/
> + ? ? ? unsigned long bg_cow_bitmap; ? ?/* COW bitmap cache */
> ? ? ? ?ext4_grpblk_t ? bb_counters[]; ?/* Nr of free power-of-two-block
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * regions, index is order.
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * bb_counters[3] = 5 means
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index ea3a0a0..e0fef0d 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -369,6 +369,8 @@ static inline int ext4_snapshot_should_move_data(struct inode *inode)
> ? ? ? ? ? ? ? ?return 0;
> ? ? ? ?if (EXT4_JOURNAL(inode) == NULL)
> ? ? ? ? ? ? ? ?return 0;
> + ? ? ? if (ext4_snapshot_excluded(inode))
> + ? ? ? ? ? ? ? return 0;
> ? ? ? ?/* when a data block is journaled, it is already COWed as metadata */
> ? ? ? ?if (ext4_should_journal_data(inode))
> ? ? ? ? ? ? ? ?return 0;
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 831d49a..ba928a7 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1048,8 +1048,12 @@ got:
> ? ? ? ? ? ? ? ?goto fail_free_drop;
>
> ? ? ? ?if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_EXTENTS)) {
> - ? ? ? ? ? ? ? /* set extent flag only for directory, file and normal symlink*/
> - ? ? ? ? ? ? ? if (S_ISDIR(mode) || S_ISREG(mode) || S_ISLNK(mode)) {
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* Set extent flag only for non-snapshot file, directory
> + ? ? ? ? ? ? ? ?* and normal symlink
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? if ((S_ISREG(mode) && !ext4_snapshot_file(inode)) ||
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S_ISDIR(mode) || S_ISLNK(mode)) {
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_set_inode_flag(inode, EXT4_INODE_EXTENTS);
> ? ? ? ? ? ? ? ? ? ? ? ?ext4_ext_tree_init(handle, inode);
> ? ? ? ? ? ? ? ?}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 866ac36..4ec5f02 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4162,9 +4162,38 @@ static const struct address_space_operations ext4_da_aops = {
> ? ? ? ?.is_partially_uptodate ?= block_is_partially_uptodate,
> ? ? ? ?.error_remove_page ? ? ?= generic_error_remove_page,
> ?};
> +static int ext4_no_writepage(struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct writeback_control *wbc)
> +{
> + ? ? ? unlock_page(page);
> + ? ? ? return -EIO;
> +}
> +
> +/*
> + * Snapshot file page operations:
> + * always readpage (by page) with buffer tracked read.
> + * user cannot writepage or direct_IO to a snapshot file.
> + *
> + * snapshot file pages are written to disk after a COW operation in "ordered"
> + * mode and are never changed after that again, so there is no data corruption
> + * risk when using "ordered" mode on snapshot files.
> + * some snapshot data pages are written to disk by sync_dirty_buffer(), namely
> + * the snapshot COW bitmaps and a few initial blocks copied on snapshot_take().
> + */
> +static const struct address_space_operations ext4_snapfile_aops = {
> + ? ? ? .readpage ? ? ? ? ? ? ? = ext4_readpage,
> + ? ? ? .readpages ? ? ? ? ? ? ?= ext4_readpages,
> + ? ? ? .writepage ? ? ? ? ? ? ?= ext4_no_writepage,
> + ? ? ? .bmap ? ? ? ? ? ? ? ? ? = ext4_bmap,
> + ? ? ? .invalidatepage ? ? ? ? = ext4_invalidatepage,
> + ? ? ? .releasepage ? ? ? ? ? ?= ext4_releasepage,
> +};
>
> ?void ext4_set_aops(struct inode *inode)
> ?{
> + ? ? ? if (ext4_snapshot_file(inode))
> + ? ? ? ? ? ? ? inode->i_mapping->a_ops = &ext4_snapfile_aops;
> + ? ? ? else
> ? ? ? ?if (ext4_should_order_data(inode) &&
> ? ? ? ? ? ? ? ?test_opt(inode->i_sb, DELALLOC))
> ? ? ? ? ? ? ? ?inode->i_mapping->a_ops = &ext4_da_aops;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 2c345d1..e3ebd7d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -745,6 +745,8 @@ static void ext4_put_super(struct super_block *sb)
> ? ? ? ?destroy_workqueue(sbi->dio_unwritten_wq);
>
> ? ? ? ?lock_super(sb);
> + ? ? ? if (EXT4_SNAPSHOTS(sb))
> + ? ? ? ? ? ? ? ext4_snapshot_destroy(sb);
> ? ? ? ?if (sb->s_dirt)
> ? ? ? ? ? ? ? ?ext4_commit_super(sb, 1);
>
> @@ -3474,6 +3476,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> ? ? ? ?sb->s_root = NULL;
>
> + ? ? ? mutex_init(&sbi->s_snapshot_mutex);
> + ? ? ? sbi->s_active_snapshot = NULL;
> +
> ? ? ? ?needs_recovery = (es->s_last_orphan != 0 ||
> ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_HAS_INCOMPAT_FEATURE(sb,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?EXT4_FEATURE_INCOMPAT_RECOVER));
> @@ -3676,6 +3681,10 @@ no_journal:
> ? ? ? ? ? ? ? ?goto failed_mount4;
> ? ? ? ?};
>
> + ? ? ? if (EXT4_SNAPSHOTS(sb) &&
> + ? ? ? ? ? ? ? ? ? ? ? ext4_snapshot_load(sb, es, sb->s_flags & MS_RDONLY))
> + ? ? ? ? ? ? ? /* XXX: how can we fail and force read-only at this point? */
> + ? ? ? ? ? ? ? ext4_error(sb, "load snapshot failed\n");
> ? ? ? ?EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
> ? ? ? ?ext4_orphan_cleanup(sb, es);
> ? ? ? ?EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
> --
> 1.7.0.4
>
>

2011-06-03 00:48:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 13/30] ext4: snapshot file - increase maximum file size limit to 16TB

On Thu, Jun 02, 2011 at 02:47:19PM +0300, Amir G. wrote:
> Let it not be said that my patches don't get reviewed ;-)

Between the merge window work, and the fact that I'm in Yokohama this
week attending the Linux Con Japan conference, and in two weeks I will
be in Portland for Usenix, it's unlikely I will have any time to look
at your snapshot patches until the end of the month. Sorry, but
things have just been crazy as of late.

I had been hoping that other people would have had time to look at
your patches, but I suspect this has been a busy time for other folks
as well.

I'm certainly willing to try using the github workflow, although the
fact that it requires an on-line network connection is a little
inconvenient when one is on airplanes as much as I have been as of
late. But it may be that the upsides more than compensate for this
one disadvantage.

Please do send a pointer to the patches to be reviewed on github to
the ext4 list, so that hopefully others on this list will be able to
look at your patches.

Thanks,

- Ted

2011-06-03 04:45:37

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 13/30] ext4: snapshot file - increase maximum file size limit to 16TB

On Fri, Jun 3, 2011 at 3:48 AM, Ted Ts'o <[email protected]> wrote:
> On Thu, Jun 02, 2011 at 02:47:19PM +0300, Amir G. wrote:
>> Let it not be said that my patches don't get reviewed ;-)
>
> Between the merge window work, and the fact that I'm in Yokohama this
> week attending the Linux Con Japan conference, and in two weeks I will
> be in Portland for Usenix, it's unlikely I will have any time to look
> at your snapshot patches until the end of the month. ?Sorry, but
> things have just been crazy as of late.
>
> I had been hoping that other people would have had time to look at
> your patches, but I suspect this has been a busy time for other folks
> as well.
>
> I'm certainly willing to try using the github workflow, although the
> fact that it requires an on-line network connection is a little
> inconvenient when one is on airplanes as much as I have been as of
> late. ?But it may be that the upsides more than compensate for this
> one disadvantage.
>
> Please do send a pointer to the patches to be reviewed on github to
> the ext4 list, so that hopefully others on this list will be able to
> look at your patches.
>
> Thanks,
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

Hi All,

Please see my patches online at:
https://github.com/amir73il/ext4-snapshots/commits/for-ext4/

You'd be surprised how easy (and addictive) it is to review patches this way.
just click on one of the commits, scroll down and click a line to
insert a comment
(which I get by email and can reply by email and our thread shows up embedded
in the commit on the website).

Feel free to just leave "XXX was here" so everyone can see how it works.

Thanks,
Amir.

2011-06-06 13:08:41

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Mon, 9 May 2011, [email protected] wrote:
>
> MERGING
> -------
> These patches are based on Ted's current master branch + alloc_semp removal
> patches. Although the alloc_semp removal is an independent (and in my eyes
> a good) change, it is also required by snapshot patches, to avoid circular
> locking dependency during COW allocations.
>
> Merging with Allison's punch holes patches should be straight forward, since
> the hard part, namely Yongqiang's split extent refactoring patches, was
> already merged by Ted.
>
> Merging with Ted's big alloc patches is going to be a bit more challenging,
> since big alloc patches make a lot of renaming and refactoring. However,
> since has_snapshots and big_alloc features will never work together,
> at least testing the code together is not a big concern.

Hi Amir,

what is the reason for the snapshots to never work with big_alloc ? Just
out of curiosity.

>
> TESTING
> -------
> Apart from the extensive testing for the snapshots feature functionality, we
> also ran xfstests with snapshots and while taking a snapshot every 1 minute.

Since a lot of the tests are actually shorter than one minute, you miss a
lot of possible concurrent operations. Is there any reason why not to do
it more often ? Let's say every second ?

> More importantly, we ran xfstests with snapshots support disabled in compile
> time and with snapshot support enabled but without has_snapshot feature.
> These xfstests were run with blocksize 1K and 4K and on X86 and X86_64.
> The 1K blocksize tests are important for the alloc_semp removal patches.
> No problems were found apart from one (test 225 hung), which is already
> existing in master branch.
>
> CREDITS
> -------
> The snapshots patches originate in my implementation of the Next3 filesystem
> for CTERA networks.
> The porting of the Next3 snapshot patches to ext4 patches is attributed to
> Aditya Dani, Shardul Mangade, Piyush Nimbalkar and Harshad Shirwadkar from
> the Pune Institute of Computer Technology (PICT).
> The implementation of extents move-on-write, delayed move-on-write and much
> of the cleanup work on these patches was carried out by Yongqiang Yang from
> the Institute of Computing Technology, Chinese Academy of Sciences.
>
>
> [PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)
> [PATCH RFC 02/30] ext4: snapshot debugging support
> [PATCH RFC 03/30] ext4: snapshot hooks - inside JBD hooks
> [PATCH RFC 04/30] ext4: snapshot hooks - block bitmap access
> [PATCH RFC 05/30] ext4: snapshot hooks - delete blocks
> [PATCH RFC 06/30] ext4: snapshot hooks - move data blocks
> [PATCH RFC 07/30] ext4: snapshot hooks - direct I/O
> [PATCH RFC 08/30] ext4: snapshot hooks - move extent file data blocks
> [PATCH RFC 09/30] ext4: snapshot file
> [PATCH RFC 10/30] ext4: snapshot file - read through to block device
> [PATCH RFC 11/30] ext4: snapshot file - permissions
> [PATCH RFC 12/30] ext4: snapshot file - store on disk
> [PATCH RFC 13/30] ext4: snapshot file - increase maximum file size limit to 16TB
> [PATCH RFC 14/30] ext4: snapshot block operations
> [PATCH RFC 15/30] ext4: snapshot block operation - copy blocks to snapshot
> [PATCH RFC 16/30] ext4: snapshot block operation - move blocks to snapshot
> [PATCH RFC 17/30] ext4: snapshot control
> [PATCH RFC 18/30] ext4: snapshot control - fix new snapshot
> [PATCH RFC 19/30] ext4: snapshot control - reserve disk space for snapshot
> [PATCH RFC 20/30] ext4: snapshot journaled - increase transaction credits
> [PATCH RFC 21/30] ext4: snapshot journaled - implement journal_release_buffer()
> [PATCH RFC 22/30] ext4: snapshot journaled - bypass to save credits
> [PATCH RFC 23/30] ext4: snapshot journaled - trace COW/buffer credits
> [PATCH RFC 24/30] ext4: snapshot list support
> [PATCH RFC 25/30] ext4: snapshot race conditions - concurrent COW operations
> [PATCH RFC 26/30] ext4: snapshot race conditions - tracked reads
> [PATCH RFC 27/30] ext4: snapshot exclude - the exclude bitmap
> [PATCH RFC 28/30] ext4: snapshot cleanup
> [PATCH RFC 29/30] ext4: snapshot cleanup - shrink deleted snapshots
> [PATCH RFC 30/30] ext4: snapshot rocompat - enable rw mount
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

--

2011-06-06 14:32:10

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Mon, Jun 6, 2011 at 4:08 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 9 May 2011, [email protected] wrote:
>>
>> MERGING
>> -------
>> These patches are based on Ted's current master branch + alloc_semp removal
>> patches. Although the alloc_semp removal is an independent (and in my eyes
>> a good) change, it is also required by snapshot patches, to avoid circular
>> locking dependency during COW allocations.
>>
>> Merging with Allison's punch holes patches should be straight forward, since
>> the hard part, namely Yongqiang's split extent refactoring patches, was
>> already merged by Ted.
>>
>> Merging with Ted's big alloc patches is going to be a bit more challenging,
>> since big alloc patches make a lot of renaming and refactoring. However,
>> since has_snapshots and big_alloc features will never work together,
>> at least testing the code together is not a big concern.
>
> Hi Amir,
>
> what is the reason for the snapshots to never work with big_alloc ? Just
> out of curiosity.
>

For one reason, a snapshot file format is currently an indirect file
and big_alloc
doesn't support indirect mapped files.
I am not saying it cannot be done, but if it does, there would be
several obstacles
to cross.

>>
>> TESTING
>> -------
>> Apart from the extensive testing for the snapshots feature functionality, we
>> also ran xfstests with snapshots and while taking a snapshot every 1 minute.
>
> Since a lot of the tests are actually shorter than one minute, you miss a
> lot of possible concurrent operations.

Yes, you are right.
Actually, we ran the 1 snapshot per minute test in parallel to
phoronix test suite,
which runs tests for several minutes on each iteration.
Mixing snapshots take during xfstests was never done properly, because
it needs to be run only when the filesystem is mounted.
There is a Google summer of project that is going to focus on that task.

> Is there any reason why not to do
> it more often ? Let's say every second ?
>

Sure, it could be done, but the freeze_fs every second will kill most of the
tests and make them behave very differently then usual, so it would probably
be better to take snapshots more rarely than 1 second, but to randomize
the times of snapshot take, to try and hit and different concurrent operations
on different iterations.


>> More importantly, we ran xfstests with snapshots support disabled in compile
>> time and with snapshot support enabled but without has_snapshot feature.
>> These xfstests were run with blocksize 1K and 4K and on X86 and X86_64.
>> The 1K blocksize tests are important for the alloc_semp removal patches.
>> No problems were found apart from one (test 225 hung), which is already
>> existing in master branch.
>>
>> CREDITS
>> -------
>> The snapshots patches originate in my implementation of the Next3 filesystem
>> for CTERA networks.
>> The porting of the Next3 snapshot patches to ext4 patches is attributed to
>> Aditya Dani, Shardul Mangade, Piyush Nimbalkar and Harshad Shirwadkar from
>> the Pune Institute of Computer Technology (PICT).
>> The implementation of extents move-on-write, delayed move-on-write and much
>> of the cleanup work on these patches was carried out by Yongqiang Yang from
>> the Institute of Computing Technology, Chinese Academy of Sciences.
>>
>>
>> [PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)
>> [PATCH RFC 02/30] ext4: snapshot debugging support
>> [PATCH RFC 03/30] ext4: snapshot hooks - inside JBD hooks
>> [PATCH RFC 04/30] ext4: snapshot hooks - block bitmap access
>> [PATCH RFC 05/30] ext4: snapshot hooks - delete blocks
>> [PATCH RFC 06/30] ext4: snapshot hooks - move data blocks
>> [PATCH RFC 07/30] ext4: snapshot hooks - direct I/O
>> [PATCH RFC 08/30] ext4: snapshot hooks - move extent file data blocks
>> [PATCH RFC 09/30] ext4: snapshot file
>> [PATCH RFC 10/30] ext4: snapshot file - read through to block device
>> [PATCH RFC 11/30] ext4: snapshot file - permissions
>> [PATCH RFC 12/30] ext4: snapshot file - store on disk
>> [PATCH RFC 13/30] ext4: snapshot file - increase maximum file size limit to 16TB
>> [PATCH RFC 14/30] ext4: snapshot block operations
>> [PATCH RFC 15/30] ext4: snapshot block operation - copy blocks to snapshot
>> [PATCH RFC 16/30] ext4: snapshot block operation - move blocks to snapshot
>> [PATCH RFC 17/30] ext4: snapshot control
>> [PATCH RFC 18/30] ext4: snapshot control - fix new snapshot
>> [PATCH RFC 19/30] ext4: snapshot control - reserve disk space for snapshot
>> [PATCH RFC 20/30] ext4: snapshot journaled - increase transaction credits
>> [PATCH RFC 21/30] ext4: snapshot journaled - implement journal_release_buffer()
>> [PATCH RFC 22/30] ext4: snapshot journaled - bypass to save credits
>> [PATCH RFC 23/30] ext4: snapshot journaled - trace COW/buffer credits
>> [PATCH RFC 24/30] ext4: snapshot list support
>> [PATCH RFC 25/30] ext4: snapshot race conditions - concurrent COW operations
>> [PATCH RFC 26/30] ext4: snapshot race conditions - tracked reads
>> [PATCH RFC 27/30] ext4: snapshot exclude - the exclude bitmap
>> [PATCH RFC 28/30] ext4: snapshot cleanup
>> [PATCH RFC 29/30] ext4: snapshot cleanup - shrink deleted snapshots
>> [PATCH RFC 30/30] ext4: snapshot rocompat - enable rw mount
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>>
>
> --
>

2011-06-06 14:51:23

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)

On Mon, 9 May 2011, [email protected] wrote:

> From: Amir Goldstein <[email protected]>
>
> Built-in snapshots support for ext4.
> Requires that the filesystem has the has_snapshot and exclude_bitmap
> features and that block size is equal to system page size.
> Snapshots are not supported with 64bit and meta_bg features and the
> filesystem must be mounted with ordered data mode.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> Signed-off-by: Yongqiang Yang <[email protected]>
> ---
> fs/ext4/Kconfig | 11 +++
> fs/ext4/Makefile | 2 +
> fs/ext4/balloc.c | 1 +
> fs/ext4/ext4.h | 15 ++++
> fs/ext4/ext4_jbd2.c | 3 +
> fs/ext4/ext4_jbd2.h | 25 ++++++
> fs/ext4/extents.c | 3 +
> fs/ext4/file.c | 1 +
> fs/ext4/ialloc.c | 1 +
> fs/ext4/inode.c | 3 +
> fs/ext4/ioctl.c | 3 +
> fs/ext4/mballoc.c | 1 +
> fs/ext4/namei.c | 1 +
> fs/ext4/resize.c | 1 +
> fs/ext4/snapshot.h | 193 ++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/super.c | 43 ++++++++++
> 16 files changed, 307 insertions(+), 0 deletions(-)
> create mode 100644 fs/ext4/snapshot.h
> create mode 100644 fs/ext4/snapshot_debug.h
>
> diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
> index 9ed1bb1..8970525 100644
> --- a/fs/ext4/Kconfig
> +++ b/fs/ext4/Kconfig
> @@ -83,3 +83,14 @@ config EXT4_DEBUG
>
> If you select Y here, then you will be able to turn on debugging
> with a command such as "echo 1 > /sys/kernel/debug/ext4/mballoc-debug"
> +
> +config EXT4_FS_SNAPSHOT
> + bool "EXT4 snapshots (Experimental)"
> + depends on EXT4_FS && EXPERIMENTAL
> + default n
> + help
> + Built-in snapshots support for ext4.
> + Requires that the filesystem has the has_snapshot and exclude_bitmap
> + features and that block size is equal to system page size.
> + Snapshots are not supported with 64bit and meta_bg features and the
> + filesystem must be mounted with ordered data mode.

What exactly do you mean by not supported with 64bit feature ? Maybe I
am being dense, but I do not get it.

> diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
> index 058b54d..16a779d 100644
> --- a/fs/ext4/Makefile
> +++ b/fs/ext4/Makefile
> @@ -19,3 +19,5 @@ ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o page-io.o \
> ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
> ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
> ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
> +ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot.o snapshot_ctl.o
> +ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_inode.o snapshot_buffer.o
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 1288f80..350f502 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -20,6 +20,7 @@
> #include "ext4.h"
> #include "ext4_jbd2.h"
> #include "mballoc.h"
> +#include "snapshot.h"
>
> /*
> * balloc.c contains the blocks allocation and deallocation routines
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index f495b22..ca25e67 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -886,6 +886,20 @@ struct ext4_inode_info {
> #define EXT2_FLAGS_SIGNED_HASH 0x0001 /* Signed dirhash in use */
> #define EXT2_FLAGS_UNSIGNED_HASH 0x0002 /* Unsigned dirhash in use */
> #define EXT2_FLAGS_TEST_FILESYS 0x0004 /* to test development code */
> +#define EXT4_FLAGS_IS_SNAPSHOT 0x0010 /* Is a snapshot image */
> +#define EXT4_FLAGS_FIX_SNAPSHOT 0x0020 /* Corrupted snapshot */
> +#define EXT4_FLAGS_FIX_EXCLUDE 0x0040 /* Bad exclude bitmap */

Would not it be better to call it EXT4_FLAGS_BAD_SNAPSHOT and
EXT4_FLAGS_BAD_EXCLUDE_BMAP ?

> +
> +#define EXT4_SET_FLAGS(sb, mask) \
> + do { \
> + EXT4_SB(sb)->s_es->s_flags |= cpu_to_le32(mask); \
> + } while (0)
> +#define EXT4_CLEAR_FLAGS(sb, mask) \
> + do { \
> + EXT4_SB(sb)->s_es->s_flags &= ~cpu_to_le32(mask);\
> + } while (0)
> +#define EXT4_TEST_FLAGS(sb, mask) \
> + (EXT4_SB(sb)->s_es->s_flags & cpu_to_le32(mask))
>
> /*
> * Mount flags
> @@ -1351,6 +1365,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> #define EXT4_FEATURE_RO_COMPAT_GDT_CSUM 0x0010
> #define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
> #define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
> +#define EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT 0x0080 /* Ext4 has snapshots */
>
> #define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
> #define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 6e272ef..560020d 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -1,8 +1,11 @@
> /*
> * Interface between ext4 and JBD
> + *
> + * Snapshot metadata COW hooks, Amir Goldstein <[email protected]>, 2011
> */
>
> #include "ext4_jbd2.h"
> +#include "snapshot.h"
>
> #include <trace/events/ext4.h>
>
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index e25e99b..8ffffb1 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -10,6 +10,8 @@
> * option, any later version, incorporated herein by reference.
> *
> * Ext4-specific journaling extensions.
> + *
> + * Snapshot extra COW credits, Amir Goldstein <[email protected]>, 2011
> */
>
> #ifndef _EXT4_JBD2_H
> @@ -18,6 +20,7 @@
> #include <linux/fs.h>
> #include <linux/jbd2.h>
> #include "ext4.h"
> +#include "snapshot.h"
>
> #define EXT4_JOURNAL(inode) (EXT4_SB((inode)->i_sb)->s_journal)
>
> @@ -272,6 +275,11 @@ static inline int ext4_should_journal_data(struct inode *inode)
> return 0;
> if (!S_ISREG(inode->i_mode))
> return 1;
> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
> + if (EXT4_SNAPSHOTS(inode->i_sb))
> + /* snapshots enforce ordered data */
> + return 0;
> +#endif
> if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
> return 1;
> if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> @@ -285,6 +293,11 @@ static inline int ext4_should_order_data(struct inode *inode)
> return 0;
> if (!S_ISREG(inode->i_mode))
> return 0;
> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
> + if (EXT4_SNAPSHOTS(inode->i_sb))
> + /* snapshots enforce ordered data */
> + return 1;
> +#endif
> if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> return 0;
> if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
> @@ -298,6 +311,11 @@ static inline int ext4_should_writeback_data(struct inode *inode)
> return 0;
> if (EXT4_JOURNAL(inode) == NULL)
> return 1;
> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
> + if (EXT4_SNAPSHOTS(inode->i_sb))
> + /* snapshots enforce ordered data */
> + return 0;
> +#endif
> if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
> return 0;
> if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
> @@ -320,6 +338,11 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> return 0;
> if (!S_ISREG(inode->i_mode))
> return 0;
> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
> + if (EXT4_SNAPSHOTS(inode->i_sb))
> + /* XXX: should snapshots support dioread_nolock? */
> + return 0;
> +#endif
> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
> return 0;
> if (ext4_should_journal_data(inode))
> @@ -327,4 +350,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> return 1;
> }

Since EXT4_SNAPSHOTS(sb) returns 0 when not configured in, I do not
think those ifdefs are needed.

>
> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
> +#endif
> #endif /* _EXT4_JBD2_H */
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 7296cd1..0c3ea93 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -18,6 +18,8 @@
> * You should have received a copy of the GNU General Public Licens
> * along with this program; if not, write to the Free Software
> * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-
> + *
> + * Snapshot move-on-write (MOW), Yongqiang Yang <[email protected]>, 2011
> */
>
> /*
> @@ -43,6 +45,7 @@
> #include <linux/fiemap.h>
> #include "ext4_jbd2.h"
> #include "ext4_extents.h"
> +#include "snapshot.h"
>
> static int ext4_ext_truncate_extend_restart(handle_t *handle,
> struct inode *inode,
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 7b80d54..60b3b19 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -28,6 +28,7 @@
> #include "ext4_jbd2.h"
> #include "xattr.h"
> #include "acl.h"
> +#include "snapshot.h"
>
> /*
> * Called when an inode is released. Note that this is different
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 2fd3b0e..831d49a 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -28,6 +28,7 @@
> #include "ext4_jbd2.h"
> #include "xattr.h"
> #include "acl.h"
> +#include "snapshot.h"
>
> #include <trace/events/ext4.h>
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 4ccb6eb..a597ff1 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -20,6 +20,8 @@
> * ([email protected])
> *
> * Assorted race fixes, rewrite of ext4_get_block() by Al Viro, 2000
> + *
> + * Snapshot inode extensions, Amir Goldstein <[email protected]>, 2011
> */
>
> #include <linux/module.h>
> @@ -49,6 +51,7 @@
> #include "ext4_extents.h"
>
> #include <trace/events/ext4.h>
> +#include "snapshot.h"
>
> #define MPAGE_DA_EXTENT_TAIL 0x01
>
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index eb3bc2f..a426332 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -5,6 +5,8 @@
> * Remy Card ([email protected])
> * Laboratoire MASI - Institut Blaise Pascal
> * Universite Pierre et Marie Curie (Paris VI)
> + *
> + * Snapshot control API, Amir Goldstein <[email protected]>, 2011
> */
>
> #include <linux/fs.h>
> @@ -17,6 +19,7 @@
> #include <asm/uaccess.h>
> #include "ext4_jbd2.h"
> #include "ext4.h"
> +#include "snapshot.h"
>
> long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> {
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 2be85af..4952b7b 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -25,6 +25,7 @@
> #include <linux/debugfs.h>
> #include <linux/slab.h>
> #include <trace/events/ext4.h>
> +#include "snapshot.h"
>
> /*
> * MUSTDO:
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index ad87584..b70fa13 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -39,6 +39,7 @@
>
> #include "xattr.h"
> #include "acl.h"
> +#include "snapshot.h"
>
> /*
> * define how far ahead to read directories while searching them.
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index a11c00a..ee9b999 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -15,6 +15,7 @@
> #include <linux/slab.h>
>
> #include "ext4_jbd2.h"
> +#include "snapshot.h"

It would be better for reviewers if you'll add those includes when you
start using them.

>
> #define outside(b, first, last) ((b) < (first) || (b) >= (last))
> #define inside(b, first, last) ((b) >= (first) && (b) < (last))
> diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
> new file mode 100644
> index 0000000..a927090
> --- /dev/null
> +++ b/fs/ext4/snapshot.h
> @@ -0,0 +1,193 @@
> +/*
> + * linux/fs/ext4/snapshot.h
> + *
> + * Written by Amir Goldstein <[email protected]>, 2008
> + *
> + * Copyright (C) 2008-2011 CTERA Networks
> + *
> + * This file is part of the Linux kernel and is made available under
> + * the terms of the GNU General Public License, version 2, or at your
> + * option, any later version, incorporated herein by reference.
> + *
> + * Ext4 snapshot extensions.

This is great place to write more about snapshot design and
implementation. If it is added later in a different file, then ignore it
:).

> + */
> +
> +#ifndef _LINUX_EXT4_SNAPSHOT_H
> +#define _LINUX_EXT4_SNAPSHOT_H
> +
> +#include <linux/version.h>
> +#include <linux/delay.h>
> +#include "ext4.h"
> +
> +
> +/*
> + * use signed 64bit for snapshot image addresses
> + * negative addresses are used to reference snapshot meta blocks
> + */
> +#define ext4_snapblk_t long long

typedef signed long long int ext4_snapblk_t maybe ?

> +
> +/*
> + * We assert that file system block size == page size (on mount time)
> + * and that the first file system block is block 0 (on snapshot create).
> + * Snapshot inode direct blocks are reserved for snapshot meta blocks.
> + * Snapshot inode single indirect blocks are not used.
> + * Snapshot image starts at the first double indirect block, so all blocks in
> + * Snapshot image block group blocks are mapped by a single DIND block:
> + * 4k: 32k blocks_per_group = 32 IND (4k) blocks = 32 groups per DIND
> + * 8k: 64k blocks_per_group = 32 IND (8k) blocks = 64 groups per DIND
> + * 16k: 128k blocks_per_group = 32 IND (16k) blocks = 128 groups per DIND
> + */
> +#define SNAPSHOT_BLOCK_SIZE PAGE_SIZE
> +#define SNAPSHOT_BLOCK_SIZE_BITS PAGE_SHIFT
> +#define SNAPSHOT_ADDR_PER_BLOCK (SNAPSHOT_BLOCK_SIZE / sizeof(__u32))

> +#define SNAPSHOT_ADDR_PER_BLOCK_BITS (SNAPSHOT_BLOCK_SIZE_BITS - 2)

#define SNAPSHOT_ADDR_PER_BLOCK (1 << SNAPSHOT_BLOCK_SIZE_BITS )

> +#define SNAPSHOT_DIR_BLOCKS EXT4_NDIR_BLOCKS
> +#define SNAPSHOT_IND_BLOCKS SNAPSHOT_ADDR_PER_BLOCK
> +
> +#define SNAPSHOT_BLOCKS_PER_GROUP_BITS (SNAPSHOT_BLOCK_SIZE_BITS + 3)
> +#define SNAPSHOT_BLOCKS_PER_GROUP \
> + (1<<SNAPSHOT_BLOCKS_PER_GROUP_BITS) /* 8*PAGE_SIZE */

> +#define SNAPSHOT_BLOCK_GROUP(block) \
> + ((block)>>SNAPSHOT_BLOCKS_PER_GROUP_BITS)
> +#define SNAPSHOT_BLOCK_GROUP_OFFSET(block) \
> + ((block)&(SNAPSHOT_BLOCKS_PER_GROUP-1))

formating is wrong.

> +#define SNAPSHOT_BLOCK_TUPLE(block) \
> + (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP_OFFSET(block), \
> + (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP(block)

This is confusing, but is you're using it really often, so be it.

> +#define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS \
> + (SNAPSHOT_BLOCKS_PER_GROUP_BITS-SNAPSHOT_ADDR_PER_BLOCK_BITS)
> +#define SNAPSHOT_IND_PER_BLOCK_GROUP \
> + (1<<SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */
> +#define SNAPSHOT_DIND_BLOCK_GROUPS_BITS \
> + (SNAPSHOT_ADDR_PER_BLOCK_BITS-SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
> +#define SNAPSHOT_DIND_BLOCK_GROUPS \
> + (1<<SNAPSHOT_DIND_BLOCK_GROUPS_BITS)

formating

> +
> +#define SNAPSHOT_BLOCK_OFFSET \
> + (SNAPSHOT_DIR_BLOCKS+SNAPSHOT_IND_BLOCKS)
> +#define SNAPSHOT_BLOCK(iblock) \
> + ((ext4_snapblk_t)(iblock) - SNAPSHOT_BLOCK_OFFSET)
> +#define SNAPSHOT_IBLOCK(block) \
> + (ext4_fsblk_t)((block) + SNAPSHOT_BLOCK_OFFSET)

I do not see SNAPSHOT_BLOCK() defined anywhere.

> +
> +
> +
> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
> +#define EXT4_SNAPSHOT_VERSION "ext4 snapshot v1.0.13-6 (2-May-2010)"
> +
> +#define SNAPSHOT_BYTES_OFFSET \
> + (SNAPSHOT_BLOCK_OFFSET << SNAPSHOT_BLOCK_SIZE_BITS)
> +#define SNAPSHOT_ISIZE(size) \
> + ((size) + SNAPSHOT_BYTES_OFFSET)
> +/* Snapshot block device size is recorded in i_disksize */
> +#define SNAPSHOT_SET_SIZE(inode, size) \
> + (EXT4_I(inode)->i_disksize = SNAPSHOT_ISIZE(size))

I do not have a clue what that means. And to be honest I am getting a
bit lost in macros, could you add some comments explaining what it is
used for ?

> +#define SNAPSHOT_SIZE(inode) \
> + (EXT4_I(inode)->i_disksize - SNAPSHOT_BYTES_OFFSET)
> +#define SNAPSHOT_SET_BLOCKS(inode, blocks) \
> + SNAPSHOT_SET_SIZE((inode), \
> + (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS)
> +#define SNAPSHOT_BLOCKS(inode) \
> + (ext4_fsblk_t)(SNAPSHOT_SIZE(inode) >> SNAPSHOT_BLOCK_SIZE_BITS)
> +/* Snapshot shrink/merge/clean progress is exported via i_size */
> +#define SNAPSHOT_PROGRESS(inode) \
> + (ext4_fsblk_t)((inode)->i_size >> SNAPSHOT_BLOCK_SIZE_BITS)
> +#define SNAPSHOT_SET_ENABLED(inode) \
> + i_size_write((inode), SNAPSHOT_SIZE(inode))
> +#define SNAPSHOT_SET_PROGRESS(inode, blocks) \
> + snapshot_size_extend((inode), (blocks))
> +/* Disabled/deleted snapshot i_size is 1 block, to allow read of super block */
> +#define SNAPSHOT_SET_DISABLED(inode) \
> + snapshot_size_truncate((inode), 1)
> +/* Removed snapshot i_size and i_disksize are 0, since all blocks were freed */
> +#define SNAPSHOT_SET_REMOVED(inode) \
> + do { \
> + EXT4_I(inode)->i_disksize = 0; \
> + snapshot_size_truncate((inode), 0); \
> + } while (0)
> +
> +static inline void snapshot_size_extend(struct inode *inode,
> + ext4_fsblk_t blocks)
> +{
> + i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
> +}
> +
> +static inline void snapshot_size_truncate(struct inode *inode,
> + ext4_fsblk_t blocks)
> +{
> + loff_t i_size = (loff_t)blocks << SNAPSHOT_BLOCK_SIZE_BITS;
> +
> + i_size_write(inode, i_size);
> + truncate_inode_pages(&inode->i_data, i_size);
> +}
> +
> +/* Is ext4 configured for snapshots support? */
> +static inline int EXT4_SNAPSHOTS(struct super_block *sb)
> +{
> + return EXT4_HAS_RO_COMPAT_FEATURE(sb,
> + EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT);
> +}
> +
> +#define ext4_snapshot_cow(handle, inode, block, bh, cow) 0
> +
> +#define ext4_snapshot_move(handle, inode, block, pcount, move) (0)
> +
> +/*
> + * Block access functions
> + */
> +
> +
> +
> +/* snapshot_ctl.c */
> +
> +
> +static inline int init_ext4_snapshot(void)
> +{
> + return 0;
> +}
> +
> +static inline void exit_ext4_snapshot(void)
> +{
> +}
> +
> +
> +
> +
> +
> +#else /* CONFIG_EXT4_FS_SNAPSHOT */
> +
> +/* Snapshot NOP macros */
> +#define EXT4_SNAPSHOTS(sb) (0)
> +#define SNAPMAP_ISCOW(cmd) (0)
> +#define SNAPMAP_ISMOVE(cmd) (0)
> +#define SNAPMAP_ISSYNC(cmd) (0)
> +#define IS_COWING(handle) (0)
> +
> +#define ext4_snapshot_load(sb, es, ro) (0)
> +#define ext4_snapshot_destroy(sb)
> +#define init_ext4_snapshot() (0)
> +#define exit_ext4_snapshot()
> +#define ext4_snapshot_active(sbi) (0)
> +#define ext4_snapshot_file(inode) (0)
> +#define ext4_snapshot_should_move_data(inode) (0)
> +#define ext4_snapshot_test_excluded(handle, inode, block_to_free, count) (0)
> +#define ext4_snapshot_list(inode) (0)
> +#define ext4_snapshot_get_flags(ei, filp)
> +#define ext4_snapshot_set_flags(handle, inode, flags) (0)
> +#define ext4_snapshot_take(inode) (0)
> +#define ext4_snapshot_update(inode_i_sb, cleanup, zero) (0)
> +#define ext4_snapshot_has_active(sb) (NULL)
> +#define ext4_snapshot_get_bitmap_access(handle, sb, grp, bh) (0)
> +#define ext4_snapshot_get_write_access(handle, inode, bh) (0)
> +#define ext4_snapshot_get_create_access(handle, bh) (0)
> +#define ext4_snapshot_excluded(ac_inode) (0)
> +#define ext4_snapshot_get_delete_access(handle, inode, block, pcount) (0)
> +
> +#define ext4_snapshot_get_move_access(handle, inode, block, pcount, move) (0)
> +#define ext4_snapshot_start_pending_cow(sbh)
> +#define ext4_snapshot_end_pending_cow(sbh)
> +#define ext4_snapshot_is_active(inode) (0)
> +#define ext4_snapshot_mow_in_tid(inode) (1)
> +
> +#endif /* CONFIG_EXT4_FS_SNAPSHOT */
> +#endif /* _LINUX_EXT4_SNAPSHOT_H */
> diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
> new file mode 100644
> index 0000000..e69de29
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 414167a..2c345d1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -48,6 +48,7 @@
> #include "xattr.h"
> #include "acl.h"
> #include "mballoc.h"
> +#include "snapshot.h"
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/ext4.h>
> @@ -2612,6 +2613,24 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
> return 0;
> }
> }
> + /* Enforce snapshots requirements: */
> + if (EXT4_SNAPSHOTS(sb)) {
> + if (EXT4_HAS_INCOMPAT_FEATURE(sb,
> + EXT4_FEATURE_INCOMPAT_META_BG|
> + EXT4_FEATURE_INCOMPAT_64BIT)) {
> + ext4_msg(sb, KERN_ERR,
> + "has_snapshot feature cannot be mixed with "
> + "features: meta_bg, 64bit");
> + return 0;

So what if the fs is mounted as readonly and has those incompatible features
? Is it ok ?

> + }
> + if (EXT4_TEST_FLAGS(sb, EXT4_FLAGS_IS_SNAPSHOT)) {
> + ext4_msg(sb, KERN_ERR,
> + "A snapshot image must be mounted read-only. "
> + "If this is an exported snapshot image, you "
> + "must run fsck -xy to make it writable.");
> + return 0;
> + }
> + }
> return 1;
> }
>
> @@ -3194,6 +3213,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
>
> + /* Enforce snapshots blocksize == pagesize */
> + if (EXT4_SNAPSHOTS(sb) && blocksize != PAGE_SIZE) {
> + ext4_msg(sb, KERN_ERR,
> + "snapshots require that filesystem blocksize "
> + "(%d) be equal to system page size (%lu)",
> + blocksize, PAGE_SIZE);
> + goto failed_mount;
> + }

I would rather see this test after the test for supported filesystem
blocksize. It is only logical to check for the superset first.

> +
> if (blocksize < EXT4_MIN_BLOCK_SIZE ||
> blocksize > EXT4_MAX_BLOCK_SIZE) {
> ext4_msg(sb, KERN_ERR,
> @@ -3540,6 +3568,15 @@ no_journal:
> goto failed_mount_wq;
> }
>
> + /* Enforce journal ordered mode with snapshots */
> + if (EXT4_SNAPSHOTS(sb) && !(sb->s_flags & MS_RDONLY) &&
> + (!EXT4_SB(sb)->s_journal ||
> + test_opt(sb, DATA_FLAGS) != EXT4_MOUNT_ORDERED_DATA)) {
> + ext4_msg(sb, KERN_ERR,
> + "snapshots require journal ordered mode");
> + goto failed_mount4;
> + }
> +
> /*
> * The jbd2_journal_load will have done any necessary log recovery,
> * so we can safely mount the rest of the filesystem now.
> @@ -4878,10 +4915,15 @@ static int __init ext4_init_fs(void)
> err = register_filesystem(&ext4_fs_type);
> if (err)
> goto out;
> + err = init_ext4_snapshot();
> + if (err)
> + goto out_fs;

Is it really necessary to init snapshots after the filesystem
registration ? I do not see reason why.

>
> ext4_li_info = NULL;
> mutex_init(&ext4_li_mtx);
> return 0;
> +out_fs:
> + unregister_filesystem(&ext4_fs_type);
> out:
> unregister_as_ext2();
> unregister_as_ext3();
> @@ -4905,6 +4947,7 @@ out7:
>
> static void __exit ext4_exit_fs(void)
> {
> + exit_ext4_snapshot();
> ext4_destroy_lazyinit_thread();
> unregister_as_ext2();
> unregister_as_ext3();
>

Thanks!
-Lukas

2011-06-06 15:08:20

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 02/30] ext4: snapshot debugging support

On Mon, 9 May 2011, [email protected] wrote:

> From: Amir Goldstein <[email protected]>
>
> Control snapshot debug level via debugfs entry /ext4/snapshot-debug
> and induce delay tests via debugfs entries /ext4/test-XXX-delay-msec.

Wouldn't you rather consider adding fixed tracepoints ? I think
tracepoints would be useful regardless on this debufs interface.

>
> Signed-off-by: Amir Goldstein <[email protected]>
> Signed-off-by: Yongqiang Yang <[email protected]>
> ---
> fs/ext4/Makefile | 1 +
> fs/ext4/mballoc.c | 3 +
> fs/ext4/snapshot.h | 9 ++++
> fs/ext4/snapshot_debug.h | 105 ++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 118 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
> index 16a779d..9981306 100644
> --- a/fs/ext4/Makefile
> +++ b/fs/ext4/Makefile
> @@ -21,3 +21,4 @@ ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
> ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
> ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot.o snapshot_ctl.o
> ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_inode.o snapshot_buffer.o
> +ext4-$(CONFIG_EXT4_FS_SNAPSHOT) += snapshot_debug.o
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 4952b7b..42961bf 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2657,10 +2657,13 @@ static void __init ext4_create_debugfs_entry(void)
> S_IRUGO | S_IWUSR,
> debugfs_dir,
> &mb_enable_debug);
> + if (debugfs_dir)
> + ext4_snapshot_create_debugfs_entry(debugfs_dir);
> }
>
> static void ext4_remove_debugfs_entry(void)
> {
> + ext4_snapshot_remove_debugfs_entry();

I do not see it defined anywhere.

> debugfs_remove(debugfs_debug);
> debugfs_remove(debugfs_dir);
> }
> diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
> index a927090..52bfa52 100644
> --- a/fs/ext4/snapshot.h
> +++ b/fs/ext4/snapshot.h
> @@ -18,6 +18,7 @@
> #include <linux/version.h>
> #include <linux/delay.h>
> #include "ext4.h"
> +#include "snapshot_debug.h"
>
>
> /*
> @@ -109,6 +110,14 @@
> static inline void snapshot_size_extend(struct inode *inode,
> ext4_fsblk_t blocks)
> {
> +#ifdef CONFIG_EXT4_DEBUG
> + ext4_fsblk_t old_blocks = SNAPSHOT_PROGRESS(inode);
> + ext4_fsblk_t max_blocks = SNAPSHOT_BLOCKS(inode);
> +
> + /* sleep total of tunable delay unit over 100% progress */

What is this good for, it is not clear from the description.

> + snapshot_test_delay_progress(SNAPTEST_DELETE,
> + old_blocks, blocks, max_blocks);
> +#endif
> i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
> }
>
> diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
> index e69de29..f893eb1 100644
> --- a/fs/ext4/snapshot_debug.h
> +++ b/fs/ext4/snapshot_debug.h
> @@ -0,0 +1,105 @@
> +/*
> + * linux/fs/ext4/snapshot_debug.h
> + *
> + * Written by Amir Goldstein <[email protected]>, 2008
> + *
> + * Copyright (C) 2008-2011 CTERA Networks
> + *
> + * This file is part of the Linux kernel and is made available under
> + * the terms of the GNU General Public License, version 2, or at your
> + * option, any later version, incorporated herein by reference.
> + *
> + * Ext4 snapshot debugging.
> + */
> +
> +#ifndef _LINUX_EXT4_SNAPSHOT_DEBUG_H
> +#define _LINUX_EXT4_SNAPSHOT_DEBUG_H
> +
> +#if defined(CONFIG_EXT4_FS_SNAPSHOT) && defined(CONFIG_EXT4_DEBUG)
> +#include <linux/delay.h>
> +
> +#define SNAPSHOT_INDENT_MAX 4
> +#define SNAPSHOT_INDENT_STR "\t\t\t\t"
> +
> +#define SNAPTEST_TAKE 0
> +#define SNAPTEST_DELETE 1
> +#define SNAPTEST_COW 2
> +#define SNAPTEST_READ 3
> +#define SNAPTEST_BITMAP 4
> +#define SNAPSHOT_TESTS_NUM 5
> +
> +extern const char *snapshot_indent;
> +extern u8 snapshot_enable_debug;
> +extern u16 snapshot_enable_test[SNAPSHOT_TESTS_NUM];
> +extern u8 cow_cache_enabled;
> +
> +#define snapshot_test_delay(i) \
> + do { \
> + if (snapshot_enable_test[i]) \
> + msleep_interruptible(snapshot_enable_test[i]); \
> + } while (0)
> +
> +/*
> + * Sleep 1ms every 'blocks_per_ms', amounting to the total test delay
> + * over 100% of progress (when 'to' reaches 'max').
> + * snapshot_enable_test[i] (msec) is limited to 64K and max (blocks_count)
> + * is likely much more than 64K, so 'blocks_per_ms' is likely non zero.
> + */

Oh, here is a good place for explaining the purpose.

> +#define snapshot_test_delay_progress(i, from, to, max) \
> + do { \
> + if (snapshot_enable_test[i] && \
> + (max) > snapshot_enable_test[i] && \
> + (from) <= (to) && (to) <= (max)) { \
> + unsigned long blocks_per_ms = \
> + do_div((max), snapshot_enable_test[i]); \
> + unsigned long x = do_div((from), blocks_per_ms);\
> + unsigned long y = do_div((to), blocks_per_ms); \
> + if (y > x) \
> + msleep_interruptible(y - x); \
> + } \
> + } while (0)
> +
> +#define snapshot_debug_l(n, l, f, a...) \
> + do { \
> + if ((n) <= snapshot_enable_debug && \
> + (l) <= SNAPSHOT_INDENT_MAX) { \
> + printk(KERN_DEBUG "snapshot: %s" f, \
> + snapshot_indent - (l), \
> + ## a); \
> + } \
> + } while (0)

This can be done by tracepoints maybe ?

> +
> +#define snapshot_debug(n, f, a...) snapshot_debug_l(n, 0, f, ## a)
> +
> +#define snapshot_debug_once(n, f, a...) \

formating

> + do { \
> + static bool __once; \
> + if (!__once) { \
> + snapshot_debug(n, f, ## a); \
> + __once = true; \
> + } \
> + } while (0)
> +
> +extern void ext4_snapshot_create_debugfs_entry(struct dentry *debugfs_dir);
> +extern void ext4_snapshot_remove_debugfs_entry(void);
> +
> +#else
> +#define snapshot_enable_debug (0)
> +#define snapshot_test_delay(i)
> +#define snapshot_test_delay_progress(i, from, to, max)
> +#define snapshot_debug(n, f, a...)
> +#define snapshot_debug_l(n, l, f, a...)
> +#define snapshot_debug_once(n, f, a...)
> +#define ext4_snapshot_create_debugfs_entry(d)
> +#define ext4_snapshot_remove_debugfs_entry()
> +#endif
> +
> +
> +/* debug levels */
> +#define SNAP_ERR 1 /* errors and summary */
> +#define SNAP_WARN 2 /* warnings */

It seems to me that those two levels should be displayed no matter what
via standard functions.

> +#define SNAP_INFO 3 /* info */
> +#define SNAP_DEBUG 4 /* debug */

And this two levels can be done in tracepoints.

> +#define SNAP_DUMP 5 /* dump snapshot file */

Via e2fsprogs debugfs maybe ?

> +
> +#endif /* _LINUX_EXT4_SNAPSHOT_DEBUG_H */
>


2011-06-06 15:31:53

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 6/6/11 9:32 AM, Amir G. wrote:
> On Mon, Jun 6, 2011 at 4:08 PM, Lukas Czerner <[email protected]> wrote:
>> On Mon, 9 May 2011, [email protected] wrote:
>>>
>>> MERGING
>>> -------
>>> These patches are based on Ted's current master branch + alloc_semp removal
>>> patches. Although the alloc_semp removal is an independent (and in my eyes
>>> a good) change, it is also required by snapshot patches, to avoid circular
>>> locking dependency during COW allocations.
>>>
>>> Merging with Allison's punch holes patches should be straight forward, since
>>> the hard part, namely Yongqiang's split extent refactoring patches, was
>>> already merged by Ted.
>>>
>>> Merging with Ted's big alloc patches is going to be a bit more challenging,
>>> since big alloc patches make a lot of renaming and refactoring. However,
>>> since has_snapshots and big_alloc features will never work together,
>>> at least testing the code together is not a big concern.
>>
>> Hi Amir,
>>
>> what is the reason for the snapshots to never work with big_alloc ? Just
>> out of curiosity.
>>
>
> For one reason, a snapshot file format is currently an indirect file
> and big_alloc
> doesn't support indirect mapped files.
> I am not saying it cannot be done, but if it does, there would be
> several obstacles
> to cross.

I know I'm kind of just throwing a bomb out here, but I am very concerned
about the ever-growing feature (in)compatibility matrix in ext4.

Take for example dioread_nolock caveats:

"However this does not work with nobh
option and the mount will fail. Nor does it work with
data journaling and dioread_nolock option will be
ignored with kernel warning. Note that dioread_nolock
code path is only used for extent-based files."

If ext4 matches the lifespan of ext3, in 10 years I fear that it will look
more like a collection of various individuals' pet projects, rather than
any kind of well-designed, cohesive project.

How long can we really keep adding features which are semi- or wholly-
incompatible with other features?

Consider this a cry in the wilderness for less rushed feature introduction,
and a more holistic approach to ext4 design...

-Eric

2011-06-06 15:53:46

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 03/30] ext4: snapshot hooks - inside JBD hooks

On Mon, 9 May 2011, [email protected] wrote:

> From: Amir Goldstein <[email protected]>
>
> Before every metadata buffer write, the journal API is called,
> namely, one of the ext4_journal_get_XXX_access() functions.
> We use these journal hooks to call the snapshot API, namely
> ext4_snapshot_get_XXX_access(), to COW the metadata buffer before
> it is modified for the first time.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> Signed-off-by: Yongqiang Yang <[email protected]>
> ---
> fs/ext4/ext4_jbd2.c | 9 +++++++--
> fs/ext4/ext4_jbd2.h | 15 +++++++++++----
> fs/ext4/extents.c | 3 ++-
> fs/ext4/inode.c | 22 +++++++++++++++-------
> fs/ext4/move_extent.c | 3 ++-
> 5 files changed, 37 insertions(+), 15 deletions(-)
>
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 560020d..833969b 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -23,13 +23,16 @@ int __ext4_journal_get_undo_access(const char *where, unsigned int line,
> return err;
> }
>
> -int __ext4_journal_get_write_access(const char *where, unsigned int line,
> - handle_t *handle, struct buffer_head *bh)
> +int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
> + handle_t *handle, struct inode *inode,
> + struct buffer_head *bh, int exclude)
> {
> int err = 0;
>
> if (ext4_handle_valid(handle)) {
> err = jbd2_journal_get_write_access(handle, bh);
> + if (!err && !exclude)
> + err = ext4_snapshot_get_write_access(handle, inode, bh);

Agh, this is not defined anywhere again. Actually it is defined only if
snapshot is not configured in. It is quite painful to review when half
of the code is missing really. And also something like this will break
bisecting for all of us. Is not there really the other way ?

Also, could you document the new parameters ?

Anyway:
if (!err && !exclude && inode)

> if (err)
> ext4_journal_abort_handle(where, line, __func__, bh,
> handle, err);
> @@ -111,6 +114,8 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
>
> if (ext4_handle_valid(handle)) {
> err = jbd2_journal_get_create_access(handle, bh);
> + if (!err)
> + err = ext4_snapshot_get_create_access(handle, bh);
> if (err)
> ext4_journal_abort_handle(where, line, __func__,
> bh, handle, err);
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 8ffffb1..75662f7 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -132,9 +132,9 @@ void ext4_journal_abort_handle(const char *caller, unsigned int line,
> int __ext4_journal_get_undo_access(const char *where, unsigned int line,
> handle_t *handle, struct buffer_head *bh);
>
> -int __ext4_journal_get_write_access(const char *where, unsigned int line,
> - handle_t *handle, struct buffer_head *bh);
> -
> +int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
> + handle_t *handle, struct inode *inode,
> + struct buffer_head *bh, int exclude);
> int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
> int is_metadata, struct inode *inode,
> struct buffer_head *bh, ext4_fsblk_t blocknr);
> @@ -151,8 +151,15 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
>
> #define ext4_journal_get_undo_access(handle, bh) \
> __ext4_journal_get_undo_access(__func__, __LINE__, (handle), (bh))
> +#define ext4_journal_get_write_access_exclude(handle, bh) \
> + __ext4_journal_get_write_access_inode(__func__, __LINE__, \
> + (handle), NULL, (bh), 1)
> #define ext4_journal_get_write_access(handle, bh) \
> - __ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
> + __ext4_journal_get_write_access_inode(__func__, __LINE__, \
> + (handle), NULL, (bh), 0)
> +#define ext4_journal_get_write_access_inode(handle, inode, bh) \
> + __ext4_journal_get_write_access_inode(__func__, __LINE__, \
> + (handle), (inode), (bh), 0)

Could you add some comments so everyone knows when to use the _exclude
helper and when not?

> #define ext4_forget(handle, is_metadata, inode, bh, block_nr) \
> __ext4_forget(__func__, __LINE__, (handle), (is_metadata), (inode), \
> (bh), (block_nr))
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 0c3ea93..c8cab3d 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -77,7 +77,8 @@ static int ext4_ext_get_access(handle_t *handle, struct inode *inode,
> {
> if (path->p_bh) {
> /* path points to block */
> - return ext4_journal_get_write_access(handle, path->p_bh);
> + return ext4_journal_get_write_access_inode(handle,
> + inode, path->p_bh);
> }
> /* path points to leaf/index in inode body */
> /* we use in-core data, no need to protect them */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index a597ff1..b848072 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -874,7 +874,8 @@ static int ext4_splice_branch(handle_t *handle, struct inode *inode,
> */
> if (where->bh) {
> BUFFER_TRACE(where->bh, "get_write_access");
> - err = ext4_journal_get_write_access(handle, where->bh);
> + err = ext4_journal_get_write_access_inode(handle, inode,
> + where->bh);
> if (err)
> goto err_out;
> }
> @@ -4172,7 +4173,8 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
> goto out_err;
> if (bh) {
> BUFFER_TRACE(bh, "retaking write access");
> - err = ext4_journal_get_write_access(handle, bh);
> + err = ext4_journal_get_write_access_inode(handle,
> + inode, bh);
> if (unlikely(err))
> goto out_err;
> }
> @@ -4223,7 +4225,8 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
>
> if (this_bh) { /* For indirect block */
> BUFFER_TRACE(this_bh, "get_write_access");
> - err = ext4_journal_get_write_access(handle, this_bh);
> + err = ext4_journal_get_write_access_inode(handle, inode,
> + this_bh);
> /* Important: if we can't update the indirect pointers
> * to the blocks, we can't free them. */
> if (err)
> @@ -4386,8 +4389,8 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
> * pointed to by an indirect block: journal it
> */
> BUFFER_TRACE(parent_bh, "get_write_access");
> - if (!ext4_journal_get_write_access(handle,
> - parent_bh)){
> + if (!ext4_journal_get_write_access_inode(
> + handle, inode, parent_bh)){
> *p = 0;
> BUFFER_TRACE(parent_bh,
> "call ext4_handle_dirty_metadata");
> @@ -4759,9 +4762,14 @@ has_buffer:
>
> int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
> {
> - /* We have all inode data except xattrs in memory here. */
> - return __ext4_get_inode_loc(inode, iloc,
> + int in_mem = (!EXT4_SNAPSHOTS(inode->i_sb) &&
> !ext4_test_inode_state(inode, EXT4_STATE_XATTR));
> +
> + /*
> + * We have all inode's data except xattrs in memory here,
> + * but we must always read-in the entire inode block for COW.
> + */
> + return __ext4_get_inode_loc(inode, iloc, in_mem);
> }
>
> void ext4_set_inode_flags(struct inode *inode)
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index b9f3e78..ad5409a 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -421,7 +421,8 @@ mext_insert_extents(handle_t *handle, struct inode *orig_inode,
>
> if (depth) {
> /* Register to journal */
> - ret = ext4_journal_get_write_access(handle, orig_path->p_bh);
> + ret = ext4_journal_get_write_access_inode(handle,
> + orig_inode, orig_path->p_bh);
> if (ret)
> return ret;
> }
>

--

2011-06-06 16:05:38

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Mon, 6 Jun 2011, Eric Sandeen wrote:

> On 6/6/11 9:32 AM, Amir G. wrote:
> > On Mon, Jun 6, 2011 at 4:08 PM, Lukas Czerner <[email protected]> wrote:
> >> On Mon, 9 May 2011, [email protected] wrote:
> >>>
> >>> MERGING
> >>> -------
> >>> These patches are based on Ted's current master branch + alloc_semp removal
> >>> patches. Although the alloc_semp removal is an independent (and in my eyes
> >>> a good) change, it is also required by snapshot patches, to avoid circular
> >>> locking dependency during COW allocations.
> >>>
> >>> Merging with Allison's punch holes patches should be straight forward, since
> >>> the hard part, namely Yongqiang's split extent refactoring patches, was
> >>> already merged by Ted.
> >>>
> >>> Merging with Ted's big alloc patches is going to be a bit more challenging,
> >>> since big alloc patches make a lot of renaming and refactoring. However,
> >>> since has_snapshots and big_alloc features will never work together,
> >>> at least testing the code together is not a big concern.
> >>
> >> Hi Amir,
> >>
> >> what is the reason for the snapshots to never work with big_alloc ? Just
> >> out of curiosity.
> >>
> >
> > For one reason, a snapshot file format is currently an indirect file
> > and big_alloc
> > doesn't support indirect mapped files.
> > I am not saying it cannot be done, but if it does, there would be
> > several obstacles
> > to cross.
>
> I know I'm kind of just throwing a bomb out here, but I am very concerned
> about the ever-growing feature (in)compatibility matrix in ext4.
>
> Take for example dioread_nolock caveats:
>
> "However this does not work with nobh
> option and the mount will fail. Nor does it work with
> data journaling and dioread_nolock option will be
> ignored with kernel warning. Note that dioread_nolock
> code path is only used for extent-based files."
>
> If ext4 matches the lifespan of ext3, in 10 years I fear that it will look
> more like a collection of various individuals' pet projects, rather than
> any kind of well-designed, cohesive project.
>
> How long can we really keep adding features which are semi- or wholly-
> incompatible with other features?
>
> Consider this a cry in the wilderness for less rushed feature introduction,
> and a more holistic approach to ext4 design...

Well, we can also start ditching some unused features and tunnables, or
make it default and remove it from documentation so people will not use
it and we can get rid of some of the options in the future. For examle

orlov
oldalloc
bsddf
minixdf

seems like a good start from the first glance...

Thanks!
-Lukas

>
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2011-06-06 16:08:03

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 03/30] ext4: snapshot hooks - inside JBD hooks

On Mon, Jun 6, 2011 at 6:53 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 9 May 2011, [email protected] wrote:
>
>> From: Amir Goldstein <[email protected]>
>>
>> Before every metadata buffer write, the journal API is called,
>> namely, one of the ext4_journal_get_XXX_access() functions.
>> We use these journal hooks to call the snapshot API, namely
>> ext4_snapshot_get_XXX_access(), to COW the metadata buffer before
>> it is modified for the first time.
>>
>> Signed-off-by: Amir Goldstein <[email protected]>
>> Signed-off-by: Yongqiang Yang <[email protected]>
>> ---
>> ?fs/ext4/ext4_jbd2.c ? | ? ?9 +++++++--
>> ?fs/ext4/ext4_jbd2.h ? | ? 15 +++++++++++----
>> ?fs/ext4/extents.c ? ? | ? ?3 ++-
>> ?fs/ext4/inode.c ? ? ? | ? 22 +++++++++++++++-------
>> ?fs/ext4/move_extent.c | ? ?3 ++-
>> ?5 files changed, 37 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
>> index 560020d..833969b 100644
>> --- a/fs/ext4/ext4_jbd2.c
>> +++ b/fs/ext4/ext4_jbd2.c
>> @@ -23,13 +23,16 @@ int __ext4_journal_get_undo_access(const char *where, unsigned int line,
>> ? ? ? return err;
>> ?}
>>
>> -int __ext4_journal_get_write_access(const char *where, unsigned int line,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle_t *handle, struct buffer_head *bh)
>> +int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle_t *handle, struct inode *inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh, int exclude)
>> ?{
>> ? ? ? int err = 0;
>>
>> ? ? ? if (ext4_handle_valid(handle)) {
>> ? ? ? ? ? ? ? err = jbd2_journal_get_write_access(handle, bh);
>> + ? ? ? ? ? ? if (!err && !exclude)
>> + ? ? ? ? ? ? ? ? ? ? err = ext4_snapshot_get_write_access(handle, inode, bh);
>
> Agh, this is not defined anywhere again. Actually it is defined only if
> snapshot is not configured in. It is quite painful to review when half
> of the code is missing really. And also something like this will break
> bisecting for all of us. Is not there really the other way ?

Hi Lukas,

Thanks for the review :-)
I just have time to one quick answer now.
I will address the rest of your comments later.

As you understand, this patch series was stripped off of snapshot
code, so it only compiles with EXT4_FS_SNAPSHOT not defined.

I have no problem, of course, sending out the full patch series,
but my intention was, as I wrote in the introduction, that the first
review would concentrate on the question:
Will these patches introduce significant changes when snapshot
code is not compiled or if the feature is off.

Would you like me to post the full (40) patches?
If you rather trying out the github online review,
I can also post the patches there.

Thanks,
Amir.

>
> Also, could you document the new parameters ?
>
> Anyway:
> ? ? ? ?if (!err && !exclude && inode)
>
>> ? ? ? ? ? ? ? if (err)
>> ? ? ? ? ? ? ? ? ? ? ? ext4_journal_abort_handle(where, line, __func__, bh,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle, err);
>> @@ -111,6 +114,8 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
>>
>> ? ? ? if (ext4_handle_valid(handle)) {
>> ? ? ? ? ? ? ? err = jbd2_journal_get_create_access(handle, bh);
>> + ? ? ? ? ? ? if (!err)
>> + ? ? ? ? ? ? ? ? ? ? err = ext4_snapshot_get_create_access(handle, bh);
>> ? ? ? ? ? ? ? if (err)
>> ? ? ? ? ? ? ? ? ? ? ? ext4_journal_abort_handle(where, line, __func__,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bh, handle, err);
>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> index 8ffffb1..75662f7 100644
>> --- a/fs/ext4/ext4_jbd2.h
>> +++ b/fs/ext4/ext4_jbd2.h
>> @@ -132,9 +132,9 @@ void ext4_journal_abort_handle(const char *caller, unsigned int line,
>> ?int __ext4_journal_get_undo_access(const char *where, unsigned int line,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle_t *handle, struct buffer_head *bh);
>>
>> -int __ext4_journal_get_write_access(const char *where, unsigned int line,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle_t *handle, struct buffer_head *bh);
>> -
>> +int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle_t *handle, struct inode *inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh, int exclude);
>> ?int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
>> ? ? ? ? ? ? ? ? int is_metadata, struct inode *inode,
>> ? ? ? ? ? ? ? ? struct buffer_head *bh, ext4_fsblk_t blocknr);
>> @@ -151,8 +151,15 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
>>
>> ?#define ext4_journal_get_undo_access(handle, bh) \
>> ? ? ? __ext4_journal_get_undo_access(__func__, __LINE__, (handle), (bh))
>> +#define ext4_journal_get_write_access_exclude(handle, bh) \
>> + ? ? __ext4_journal_get_write_access_inode(__func__, __LINE__, \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(handle), NULL, (bh), 1)
>> ?#define ext4_journal_get_write_access(handle, bh) \
>> - ? ? __ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
>> + ? ? __ext4_journal_get_write_access_inode(__func__, __LINE__, \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(handle), NULL, (bh), 0)
>> +#define ext4_journal_get_write_access_inode(handle, inode, bh) \
>> + ? ? __ext4_journal_get_write_access_inode(__func__, __LINE__, \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (handle), (inode), (bh), 0)
>
> Could you add some comments so everyone knows when to use the _exclude
> helper and when not?
>
>> ?#define ext4_forget(handle, is_metadata, inode, bh, block_nr) \
>> ? ? ? __ext4_forget(__func__, __LINE__, (handle), (is_metadata), (inode), \
>> ? ? ? ? ? ? ? ? ? ? (bh), (block_nr))
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 0c3ea93..c8cab3d 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -77,7 +77,8 @@ static int ext4_ext_get_access(handle_t *handle, struct inode *inode,
>> ?{
>> ? ? ? if (path->p_bh) {
>> ? ? ? ? ? ? ? /* path points to block */
>> - ? ? ? ? ? ? return ext4_journal_get_write_access(handle, path->p_bh);
>> + ? ? ? ? ? ? return ext4_journal_get_write_access_inode(handle,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? inode, path->p_bh);
>> ? ? ? }
>> ? ? ? /* path points to leaf/index in inode body */
>> ? ? ? /* we use in-core data, no need to protect them */
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index a597ff1..b848072 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -874,7 +874,8 @@ static int ext4_splice_branch(handle_t *handle, struct inode *inode,
>> ? ? ? ?*/
>> ? ? ? if (where->bh) {
>> ? ? ? ? ? ? ? BUFFER_TRACE(where->bh, "get_write_access");
>> - ? ? ? ? ? ? err = ext4_journal_get_write_access(handle, where->bh);
>> + ? ? ? ? ? ? err = ext4_journal_get_write_access_inode(handle, inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?where->bh);
>> ? ? ? ? ? ? ? if (err)
>> ? ? ? ? ? ? ? ? ? ? ? goto err_out;
>> ? ? ? }
>> @@ -4172,7 +4173,8 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
>> ? ? ? ? ? ? ? ? ? ? ? goto out_err;
>> ? ? ? ? ? ? ? if (bh) {
>> ? ? ? ? ? ? ? ? ? ? ? BUFFER_TRACE(bh, "retaking write access");
>> - ? ? ? ? ? ? ? ? ? ? err = ext4_journal_get_write_access(handle, bh);
>> + ? ? ? ? ? ? ? ? ? ? err = ext4_journal_get_write_access_inode(handle,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? inode, bh);
>> ? ? ? ? ? ? ? ? ? ? ? if (unlikely(err))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out_err;
>> ? ? ? ? ? ? ? }
>> @@ -4223,7 +4225,8 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
>>
>> ? ? ? if (this_bh) { ? ? ? ? ? ? ? ? ? ? ? ? ?/* For indirect block */
>> ? ? ? ? ? ? ? BUFFER_TRACE(this_bh, "get_write_access");
>> - ? ? ? ? ? ? err = ext4_journal_get_write_access(handle, this_bh);
>> + ? ? ? ? ? ? err = ext4_journal_get_write_access_inode(handle, inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?this_bh);
>> ? ? ? ? ? ? ? /* Important: if we can't update the indirect pointers
>> ? ? ? ? ? ? ? ?* to the blocks, we can't free them. */
>> ? ? ? ? ? ? ? if (err)
>> @@ -4386,8 +4389,8 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* pointed to by an indirect block: journal it
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BUFFER_TRACE(parent_bh, "get_write_access");
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!ext4_journal_get_write_access(handle,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?parent_bh)){
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!ext4_journal_get_write_access_inode(
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle, inode, parent_bh)){
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? *p = 0;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BUFFER_TRACE(parent_bh,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "call ext4_handle_dirty_metadata");
>> @@ -4759,9 +4762,14 @@ has_buffer:
>>
>> ?int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
>> ?{
>> - ? ? /* We have all inode data except xattrs in memory here. */
>> - ? ? return __ext4_get_inode_loc(inode, iloc,
>> + ? ? int in_mem = (!EXT4_SNAPSHOTS(inode->i_sb) &&
>> ? ? ? ? ? ? ? !ext4_test_inode_state(inode, EXT4_STATE_XATTR));
>> +
>> + ? ? /*
>> + ? ? ?* We have all inode's data except xattrs in memory here,
>> + ? ? ?* but we must always read-in the entire inode block for COW.
>> + ? ? ?*/
>> + ? ? return __ext4_get_inode_loc(inode, iloc, in_mem);
>> ?}
>>
>> ?void ext4_set_inode_flags(struct inode *inode)
>> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
>> index b9f3e78..ad5409a 100644
>> --- a/fs/ext4/move_extent.c
>> +++ b/fs/ext4/move_extent.c
>> @@ -421,7 +421,8 @@ mext_insert_extents(handle_t *handle, struct inode *orig_inode,
>>
>> ? ? ? if (depth) {
>> ? ? ? ? ? ? ? /* Register to journal */
>> - ? ? ? ? ? ? ret = ext4_journal_get_write_access(handle, orig_path->p_bh);
>> + ? ? ? ? ? ? ret = ext4_journal_get_write_access_inode(handle,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? orig_inode, orig_path->p_bh);
>> ? ? ? ? ? ? ? if (ret)
>> ? ? ? ? ? ? ? ? ? ? ? return ret;
>> ? ? ? }
>>
>
> --
>

2011-06-06 16:33:26

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 2011-06-06, at 9:31 AM, Eric Sandeen <[email protected]> wrote:
> On 6/6/11 9:32 AM, Amir G. wrote:
>>
>> For one reason, a snapshot file format is currently an indirect file
>> and big_alloc
>> doesn't support indirect mapped files.
>> I am not saying it cannot be done, but if it does, there would be
>> several obstacles
>> to cross.
>
> I know I'm kind of just throwing a bomb out here, but I am very concerned
> about the ever-growing feature (in)compatibility matrix in ext4.

I tend to agree. A new feature like this for ext4 that does not work the default features of ext4 (extents) means that it will not be usable for the majority of users, but will make the code complex for all of the developers.

Has any thought gone into how this feature could be implemented for extent-mapped files? It seems that part of the problem is because the snapshot "file" needs to be able to map the whole filesystem, which neither indirect-mapped nor extent-mapped files can do without changes.

The current change is to allow indirect-mapped files to have an extra triple-indirect block, which works up to 2^32 blocks (the same limit as extent-mapped files) but this is not useful for filesystems over 2^32 blocks, which is another reason that ext4 was introduced.

So, it seems the reason the 64bit feature is unsupported is really for filesystems larger than the maximum file size, and not for any other reason. Is that correct? Would that mean Ted's bigalloc patches will avoid this problem, or do they not actually increase the maximum file size?

> Take for example dioread_nolock caveats:
>
> "However this does not work with nobh
> option and the mount will fail.

Does anyone actually use nobh? I recall it was a performance tweak fir ext3, but i think it was eclipsed by other improvements in ext4. If nobody is using it anymore, we might consider removing it entirely, since it was only a mount-time option and did not affect the on-disk format.

Does smolt return the filesystem mount options?

> Nor does it work with
> data journaling and dioread_nolock option will be
> ignored with kernel warning. Note that dioread_nolock
> code path is only used for extent-based files."

Does this mean that dioread_nolock isn't needed for indirect-mapped files, or that it will work incorrectly on indirect-mapped files, or only that they will use some less efficient code path? I don't recall the details if this option, but it seems that it was related to unwritten extents, in which case it is irrelevant to indirect-mapped files.

> If ext4 matches the lifespan of ext3, in 10 years I fear that it will look
> more like a collection of various individuals' pet projects, rather than
> any kind of well-designed, cohesive project.
>
> How long can we really keep adding features which are semi- or wholly-
> incompatible with other features?
>
> Consider this a cry in the wilderness for less rushed feature introduction,
> and a more holistic approach to ext4 design...

I agree. I am far less concerned with new features that are only available to users of newly-formatted ext4 filesystems. What worries me is a feature that in NOT usable on new filesystems and may be dead code in a couple of years.

I'd be a lot more confident in its acceptance if there was at least a design for how to move forward with this feature for filesystems with extents and 64bit support. I'd be happy with some co-requirement that bigalloc is needed for filesystems larger than 2^32 blocks (for example), so that there is never a need to have a snapshot with more than 2^32 blocks.

Doing this design work may point out some other solution which does not require the 4*triple-indirect block hack in the first place, and will improve the code in the long run.

Cheers, Andreas

2011-06-06 17:07:36

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 6/6/11 11:33 AM, Andreas Dilger wrote:
> On 2011-06-06, at 9:31 AM, Eric Sandeen <[email protected]> wrote:
>> On 6/6/11 9:32 AM, Amir G. wrote:
>>>
>>> For one reason, a snapshot file format is currently an indirect file
>>> and big_alloc
>>> doesn't support indirect mapped files.
>>> I am not saying it cannot be done, but if it does, there would be
>>> several obstacles
>>> to cross.
>>
>> I know I'm kind of just throwing a bomb out here, but I am very concerned
>> about the ever-growing feature (in)compatibility matrix in ext4.
>
> I tend to agree. A new feature like this for ext4 that does not work
> the default features of ext4 (extents) means that it will not be
> usable for the majority of users, but will make the code complex for
> all of the developers.

It's my understanding that the limitation is only on the snapshot file
itself, right? But that bigalloc can't work in the presence of any
non-extent-mapped files?

I guess this is indicative of problems with The Matrix if we are
already confusing ourselves. ;)

> Has any thought gone into how this feature could be implemented for
> extent-mapped files? It seems that part of the problem is because the
> snapshot "file" needs to be able to map the whole filesystem, which
> neither indirect-mapped nor extent-mapped files can do without
> changes.
>
> The current change is to allow indirect-mapped files to have an extra
> triple-indirect block, which works up to 2^32 blocks (the same limit
> as extent-mapped files) but this is not useful for filesystems over
> 2^32 blocks, which is another reason that ext4 was introduced.
>
> So, it seems the reason the 64bit feature is unsupported is really
> for filesystems larger than the maximum file size, and not for any
> other reason. Is that correct? Would that mean Ted's bigalloc patches
> will avoid this problem, or do they not actually increase the maximum
> file size?
>
>> Take for example dioread_nolock caveats:
>>
>> "However this does not work with nobh
>> option and the mount will fail.
>
> Does anyone actually use nobh? I recall it was a performance tweak
> fir ext3, but i think it was eclipsed by other improvements in ext4.
> If nobody is using it anymore, we might consider removing it
> entirely, since it was only a mount-time option and did not affect
> the on-disk format.

As Lukas said, removing old cruft would help a fair bit, and this would
seem reasonable to remove.

> Does smolt return the filesystem mount options?

not currently, no.

>> Nor does it work with
>> data journaling and dioread_nolock option will be
>> ignored with kernel warning. Note that dioread_nolock
>> code path is only used for extent-based files."
>
> Does this mean that dioread_nolock isn't needed for indirect-mapped
> files, or that it will work incorrectly on indirect-mapped files, or
> only that they will use some less efficient code path? I don't recall
> the details if this option, but it seems that it was related to
> unwritten extents, in which case it is irrelevant to indirect-mapped
> files.

It uses unwritten extents, which cannot exist on indirect-mapped files.
So, you must fall back to the old locking in that case.

>> If ext4 matches the lifespan of ext3, in 10 years I fear that it will look
>> more like a collection of various individuals' pet projects, rather than
>> any kind of well-designed, cohesive project.
>>
>> How long can we really keep adding features which are semi- or wholly-
>> incompatible with other features?
>>
>> Consider this a cry in the wilderness for less rushed feature introduction,
>> and a more holistic approach to ext4 design...
>
> I agree. I am far less concerned with new features that are only
> available to users of newly-formatted ext4 filesystems. What worries
> me is a feature that in NOT usable on new filesystems and may be dead
> code in a couple of years.
>
> I'd be a lot more confident in its acceptance if there was at least a
> design for how to move forward with this feature for filesystems with
> extents and 64bit support. I'd be happy with some co-requirement that
> bigalloc is needed for filesystems larger than 2^32 blocks (for
> example), so that there is never a need to have a snapshot with more
> than 2^32 blocks.

Yes, mutually exclusive (and well-planned) design points would be more
reasonable, I think.

> Doing this design work may point out some other solution which does
> not require the 4*triple-indirect block hack in the first place, and
> will improve the code in the long run.
>
> Cheers, Andreas--
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-06-06 18:25:50

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Mon, Jun 6, 2011 at 7:33 PM, Andreas Dilger <[email protected]> wrote:
> On 2011-06-06, at 9:31 AM, Eric Sandeen <[email protected]> wrote:
>> On 6/6/11 9:32 AM, Amir G. wrote:
>>>
>>> For one reason, a snapshot file format is currently an indirect file
>>> and big_alloc
>>> doesn't support indirect mapped files.
>>> I am not saying it cannot be done, but if it does, there would be
>>> several obstacles
>>> to cross.
>>
>> I know I'm kind of just throwing a bomb out here, but I am very concerned
>> about the ever-growing feature (in)compatibility matrix in ext4.
>
> I tend to agree. A new feature like this for ext4 that does not work the default features of ext4 (extents) means that it will not be usable for the majority of users, but will make the code complex for all of the developers.

Snapshots support extent mapped files, otherwise they would not be
considered for merging.
As Ted put it, as long as the feature support the 'default'/'common'
configuration options it
could be merged.

>
> Has any thought gone into how this feature could be implemented for extent-mapped files? ?It seems that part of the problem is because the snapshot "file" needs to be able to map the whole filesystem, which neither indirect-mapped nor extent-mapped files can do without changes.
>
> The current change is to allow indirect-mapped files to have an extra triple-indirect block, which works up to 2^32 blocks (the same limit as extent-mapped files) but this is not useful for filesystems over 2^32 blocks, which is another reason that ext4 was introduced.
>
> So, it seems the reason the 64bit feature is unsupported is really for filesystems larger than the maximum file size, and not for any other reason. Is that correct?

That is correct, see:
http://sourceforge.net/apps/mediawiki/next3/index.php?title=Ext4_snapshots_TODO#Large_file_system_size_.2864bit_support.29

> Would that mean Ted's bigalloc patches will avoid this problem, or do they not actually increase the maximum file size?

They don't increase the maximum file size, because with big_alloc the
extents map still has blocksize (e.g. 4k)
resolution.
Besides, one of the main principles of ext4 snapshots is
Move-on-write, which means there is no
read IO for COW.
There is no way to implement COW efficiently, if you have to read in
an entire (~1M) cluster when
writing a single (~4K) block. you may as well use LVM snapshots for that...


>
>> Take for example dioread_nolock caveats:
>>
>> ? ? ? ? ?"However this does not work with nobh
>> ? ? ? ? ? option and the mount will fail.
>
> Does anyone actually use nobh? ?I recall it was a performance tweak fir ext3, but i think it was eclipsed by other improvements in ext4. ?If nobody is using it anymore, we might consider removing it entirely, since it was only a mount-time option and did not affect the on-disk format.
>
> Does smolt return the filesystem mount options?
>
>> Nor does it work with
>> ? ? ? ? ? data journaling and dioread_nolock option will be
>> ? ? ? ? ? ignored with kernel warning. Note that dioread_nolock
>> ? ? ? ? ? code path is only used for extent-based files."
>
> Does this mean that dioread_nolock isn't needed for indirect-mapped files, or that it will work incorrectly on indirect-mapped files, or only that they will use some less efficient code path? I don't recall the details if this option, but it seems that it was related to unwritten extents, in which case it is irrelevant to indirect-mapped files.
>
>> If ext4 matches the lifespan of ext3, in 10 years I fear that it will look
>> more like a collection of various individuals' pet projects, rather than
>> any kind of well-designed, cohesive project.
>>
>> How long can we really keep adding features which are semi- or wholly-
>> incompatible with other features?
>>
>> Consider this a cry in the wilderness for less rushed feature introduction,
>> and a more holistic approach to ext4 design...
>
> I agree. I am far less concerned with new features that are only available to users of newly-formatted ext4 filesystems. What worries me is a feature that in NOT usable on new filesystems and may be dead code in a couple of years.
>
> I'd be a lot more confident in its acceptance if there was at least a design for how to move forward with this feature for filesystems with extents and 64bit support. I'd be happy with some co-requirement that bigalloc is needed for filesystems larger than 2^32 blocks (for example), so that there is never a need to have a snapshot with more than 2^32 blocks.
>
> Doing this design work may point out some other solution which does not require the 4*triple-indirect block hack in the first place, and will improve the code in the long run.

The design in this case is quite one-way-to-go - that is defining a
new extent format with 48bit logical addresses.
There are 2 reasons I used the 4 tind blocks hack:
1. Historic - the patches come from next3 which needed 16TB volume support
2. KISS - I don't know if you noticed, but the amount of lines in this
hack is very
small. both for ext4 and for libext2, the blk_to_path logic for
indirect mapped files
is very easy to modify, which makes the patch very easy to review.
see for yourself:
https://github.com/amir73il/e2fsprogs-snapshots/commit/75025f02f099157794a75f22f86851707c1061b8

Amir.

2011-06-06 19:01:15

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 03/30] ext4: snapshot hooks - inside JBD hooks

On Mon, Jun 6, 2011 at 6:53 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 9 May 2011, [email protected] wrote:
>
>> From: Amir Goldstein <[email protected]>
>>
>> Before every metadata buffer write, the journal API is called,
>> namely, one of the ext4_journal_get_XXX_access() functions.
>> We use these journal hooks to call the snapshot API, namely
>> ext4_snapshot_get_XXX_access(), to COW the metadata buffer before
>> it is modified for the first time.
>>
>> Signed-off-by: Amir Goldstein <[email protected]>
>> Signed-off-by: Yongqiang Yang <[email protected]>
>> ---
>> ?fs/ext4/ext4_jbd2.c ? | ? ?9 +++++++--
>> ?fs/ext4/ext4_jbd2.h ? | ? 15 +++++++++++----
>> ?fs/ext4/extents.c ? ? | ? ?3 ++-
>> ?fs/ext4/inode.c ? ? ? | ? 22 +++++++++++++++-------
>> ?fs/ext4/move_extent.c | ? ?3 ++-
>> ?5 files changed, 37 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
>> index 560020d..833969b 100644
>> --- a/fs/ext4/ext4_jbd2.c
>> +++ b/fs/ext4/ext4_jbd2.c
>> @@ -23,13 +23,16 @@ int __ext4_journal_get_undo_access(const char *where, unsigned int line,
>> ? ? ? return err;
>> ?}
>>
>> -int __ext4_journal_get_write_access(const char *where, unsigned int line,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle_t *handle, struct buffer_head *bh)
>> +int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle_t *handle, struct inode *inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh, int exclude)
>> ?{
>> ? ? ? int err = 0;
>>
>> ? ? ? if (ext4_handle_valid(handle)) {
>> ? ? ? ? ? ? ? err = jbd2_journal_get_write_access(handle, bh);
>> + ? ? ? ? ? ? if (!err && !exclude)
>> + ? ? ? ? ? ? ? ? ? ? err = ext4_snapshot_get_write_access(handle, inode, bh);
>
> Agh, this is not defined anywhere again. Actually it is defined only if
> snapshot is not configured in. It is quite painful to review when half
> of the code is missing really. And also something like this will break
> bisecting for all of us. Is not there really the other way ?

Oh, sorry for the pain, I was trying to make life easier for reviewers, but that
often results in the opposite outcome...
These 'core' patches are not meant for merging, just for initial review.
In the 'full' patches, ext4_snapshot_get_write_access() is defined for
both configurations.


>
> Also, could you document the new parameters ?
>
> Anyway:
> ? ? ? ?if (!err && !exclude && inode)
>

It's documented in the snapshot hooks that inode can be NULL, so in the 'full'
patches that should be clear. sorry :-/
I promise to post the 'full' series soon.

>> ? ? ? ? ? ? ? if (err)
>> ? ? ? ? ? ? ? ? ? ? ? ext4_journal_abort_handle(where, line, __func__, bh,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle, err);
>> @@ -111,6 +114,8 @@ int __ext4_journal_get_create_access(const char *where, unsigned int line,
>>
>> ? ? ? if (ext4_handle_valid(handle)) {
>> ? ? ? ? ? ? ? err = jbd2_journal_get_create_access(handle, bh);
>> + ? ? ? ? ? ? if (!err)
>> + ? ? ? ? ? ? ? ? ? ? err = ext4_snapshot_get_create_access(handle, bh);
>> ? ? ? ? ? ? ? if (err)
>> ? ? ? ? ? ? ? ? ? ? ? ext4_journal_abort_handle(where, line, __func__,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bh, handle, err);
>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> index 8ffffb1..75662f7 100644
>> --- a/fs/ext4/ext4_jbd2.h
>> +++ b/fs/ext4/ext4_jbd2.h
>> @@ -132,9 +132,9 @@ void ext4_journal_abort_handle(const char *caller, unsigned int line,
>> ?int __ext4_journal_get_undo_access(const char *where, unsigned int line,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle_t *handle, struct buffer_head *bh);
>>
>> -int __ext4_journal_get_write_access(const char *where, unsigned int line,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle_t *handle, struct buffer_head *bh);
>> -
>> +int __ext4_journal_get_write_access_inode(const char *where, unsigned int line,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle_t *handle, struct inode *inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh, int exclude);
>> ?int __ext4_forget(const char *where, unsigned int line, handle_t *handle,
>> ? ? ? ? ? ? ? ? int is_metadata, struct inode *inode,
>> ? ? ? ? ? ? ? ? struct buffer_head *bh, ext4_fsblk_t blocknr);
>> @@ -151,8 +151,15 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
>>
>> ?#define ext4_journal_get_undo_access(handle, bh) \
>> ? ? ? __ext4_journal_get_undo_access(__func__, __LINE__, (handle), (bh))
>> +#define ext4_journal_get_write_access_exclude(handle, bh) \
>> + ? ? __ext4_journal_get_write_access_inode(__func__, __LINE__, \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(handle), NULL, (bh), 1)
>> ?#define ext4_journal_get_write_access(handle, bh) \
>> - ? ? __ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
>> + ? ? __ext4_journal_get_write_access_inode(__func__, __LINE__, \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(handle), NULL, (bh), 0)
>> +#define ext4_journal_get_write_access_inode(handle, inode, bh) \
>> + ? ? __ext4_journal_get_write_access_inode(__func__, __LINE__, \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (handle), (inode), (bh), 0)
>
> Could you add some comments so everyone knows when to use the _exclude
> helper and when not?

Will do. nobody should use it except for exclude bitmap write access.
I should change the name to get_exclude_bitmap_access to make that
more clear.

>
>> ?#define ext4_forget(handle, is_metadata, inode, bh, block_nr) \
>> ? ? ? __ext4_forget(__func__, __LINE__, (handle), (is_metadata), (inode), \
>> ? ? ? ? ? ? ? ? ? ? (bh), (block_nr))
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 0c3ea93..c8cab3d 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -77,7 +77,8 @@ static int ext4_ext_get_access(handle_t *handle, struct inode *inode,
>> ?{
>> ? ? ? if (path->p_bh) {
>> ? ? ? ? ? ? ? /* path points to block */
>> - ? ? ? ? ? ? return ext4_journal_get_write_access(handle, path->p_bh);
>> + ? ? ? ? ? ? return ext4_journal_get_write_access_inode(handle,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? inode, path->p_bh);
>> ? ? ? }
>> ? ? ? /* path points to leaf/index in inode body */
>> ? ? ? /* we use in-core data, no need to protect them */
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index a597ff1..b848072 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -874,7 +874,8 @@ static int ext4_splice_branch(handle_t *handle, struct inode *inode,
>> ? ? ? ?*/
>> ? ? ? if (where->bh) {
>> ? ? ? ? ? ? ? BUFFER_TRACE(where->bh, "get_write_access");
>> - ? ? ? ? ? ? err = ext4_journal_get_write_access(handle, where->bh);
>> + ? ? ? ? ? ? err = ext4_journal_get_write_access_inode(handle, inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?where->bh);
>> ? ? ? ? ? ? ? if (err)
>> ? ? ? ? ? ? ? ? ? ? ? goto err_out;
>> ? ? ? }
>> @@ -4172,7 +4173,8 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
>> ? ? ? ? ? ? ? ? ? ? ? goto out_err;
>> ? ? ? ? ? ? ? if (bh) {
>> ? ? ? ? ? ? ? ? ? ? ? BUFFER_TRACE(bh, "retaking write access");
>> - ? ? ? ? ? ? ? ? ? ? err = ext4_journal_get_write_access(handle, bh);
>> + ? ? ? ? ? ? ? ? ? ? err = ext4_journal_get_write_access_inode(handle,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? inode, bh);
>> ? ? ? ? ? ? ? ? ? ? ? if (unlikely(err))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? goto out_err;
>> ? ? ? ? ? ? ? }
>> @@ -4223,7 +4225,8 @@ static void ext4_free_data(handle_t *handle, struct inode *inode,
>>
>> ? ? ? if (this_bh) { ? ? ? ? ? ? ? ? ? ? ? ? ?/* For indirect block */
>> ? ? ? ? ? ? ? BUFFER_TRACE(this_bh, "get_write_access");
>> - ? ? ? ? ? ? err = ext4_journal_get_write_access(handle, this_bh);
>> + ? ? ? ? ? ? err = ext4_journal_get_write_access_inode(handle, inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?this_bh);
>> ? ? ? ? ? ? ? /* Important: if we can't update the indirect pointers
>> ? ? ? ? ? ? ? ?* to the blocks, we can't free them. */
>> ? ? ? ? ? ? ? if (err)
>> @@ -4386,8 +4389,8 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* pointed to by an indirect block: journal it
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BUFFER_TRACE(parent_bh, "get_write_access");
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!ext4_journal_get_write_access(handle,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?parent_bh)){
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (!ext4_journal_get_write_access_inode(
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? handle, inode, parent_bh)){
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? *p = 0;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BUFFER_TRACE(parent_bh,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "call ext4_handle_dirty_metadata");
>> @@ -4759,9 +4762,14 @@ has_buffer:
>>
>> ?int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
>> ?{
>> - ? ? /* We have all inode data except xattrs in memory here. */
>> - ? ? return __ext4_get_inode_loc(inode, iloc,
>> + ? ? int in_mem = (!EXT4_SNAPSHOTS(inode->i_sb) &&
>> ? ? ? ? ? ? ? !ext4_test_inode_state(inode, EXT4_STATE_XATTR));
>> +
>> + ? ? /*
>> + ? ? ?* We have all inode's data except xattrs in memory here,
>> + ? ? ?* but we must always read-in the entire inode block for COW.
>> + ? ? ?*/
>> + ? ? return __ext4_get_inode_loc(inode, iloc, in_mem);
>> ?}
>>
>> ?void ext4_set_inode_flags(struct inode *inode)
>> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
>> index b9f3e78..ad5409a 100644
>> --- a/fs/ext4/move_extent.c
>> +++ b/fs/ext4/move_extent.c
>> @@ -421,7 +421,8 @@ mext_insert_extents(handle_t *handle, struct inode *orig_inode,
>>
>> ? ? ? if (depth) {
>> ? ? ? ? ? ? ? /* Register to journal */
>> - ? ? ? ? ? ? ret = ext4_journal_get_write_access(handle, orig_path->p_bh);
>> + ? ? ? ? ? ? ret = ext4_journal_get_write_access_inode(handle,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? orig_inode, orig_path->p_bh);
>> ? ? ? ? ? ? ? if (ret)
>> ? ? ? ? ? ? ? ? ? ? ? return ret;
>> ? ? ? }
>>
>
> --
>

2011-06-06 19:58:49

by Lukáš Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

Dne 6.6.2011 18:42, Eric Sandeen napsal(a):
>>> Take for example dioread_nolock caveats:
>>>
>>> "However this does not work with nobh
>>> option and the mount will fail.
>> Does anyone actually use nobh? I recall it was a performance tweak
>> fir ext3, but i think it was eclipsed by other improvements in ext4.
>> If nobody is using it anymore, we might consider removing it
>> entirely, since it was only a mount-time option and did not affect
>> the on-disk format.
> As Lukas said, removing old cruft would help a fair bit, and this would
> seem reasonable to remove.
Nobh has been nooped already quite a long time ago with
206f7ab4f49a2021fcb8687f25395be77711ddee so we should remove it from
documentation as well.

And what about other options ? Any idea what else we can remove ? Most
of the options does not get any testing at all!

Thanks!
-Lukas

2011-06-06 20:40:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Mon, Jun 06, 2011 at 06:05:24PM +0200, Lukas Czerner wrote:
>
> Well, we can also start ditching some unused features and tunnables, or
> make it default and remove it from documentation so people will not use
> it and we can get rid of some of the options in the future. For examle
>
> orlov
> oldalloc
> bsddf
> minixdf

I tried deprecated bsddf/minixdf, and got a complaint from a user who
said they were using it. Linus's rule is, "though shalt not break
backwards compatibility", so I have up on it. Realistically, the
difference between these two is so small it's not really a big deal.

Dropping the old allocator is probably a good idea at this point. I
very much doubt anyone is using it.

- Ted

2011-06-06 20:55:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
> > For one reason, a snapshot file format is currently an indirect file
> > and big_alloc
> > doesn't support indirect mapped files.
> > I am not saying it cannot be done, but if it does, there would be
> > several obstacles
> > to cross.
>
> I know I'm kind of just throwing a bomb out here, but I am very concerned
> about the ever-growing feature (in)compatibility matrix in ext4.

bigalloc doesn't support indirect blocks mainly because it was faster
to get things working if I didn't have to worry about indirect blocks.
It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
I'll get around to it at some point.

dioread_nolock is something that I had hoped to clean up by now, by
making this the default way we do all buffered writebacks, for all
block sizes.

> Take for example dioread_nolock caveats:
>
> "However this does not work with nobh
> option and the mount will fail. Nor does it work with
> data journaling and dioread_nolock option will be
> ignored with kernel warning. Note that dioread_nolock
> code path is only used for extent-based files."

Hey, at least we got rid of nobh! :-)

> If ext4 matches the lifespan of ext3, in 10 years I fear that it will look
> more like a collection of various individuals' pet projects, rather than
> any kind of well-designed, cohesive project.
>
> How long can we really keep adding features which are semi- or wholly-
> incompatible with other features?
>
> Consider this a cry in the wilderness for less rushed feature introduction,
> and a more holistic approach to ext4 design...

It's something I do worry about; and I do share your concern. At the
same time, the reality is that we are a little like the Old Dutch
Masters, who had take into account the preference of their patrons
(i.e., in our case, those who pay our paychecks :-).

In the case of dioread_nolock, I allowed dioread_nolock in even though
it was a not a complete solution since internally, we had critical
business for it, and in my judgement, (a) it wasn't that horrible
(most of the horrible code paths was already being used for AIO/DIO),
and (b) I had a plan for how to clean it up eventually. The
fs/ext4/page_io.c implementation was in fact the first part of my
cleanup plan, so we've made some progress; it's just not gone as fast
as I would like.

Snapshots are an example of a feature where I am very much worried
about taking on technical debt. On the other hand, there are a lot of
people who are quite excited of it as a feature, so I'm hoping we can
clean it up enough we don't put a huge maintenance burden on
ourselves.

It should be possible to make snapshots work on bigalloc file systems,
once support is added for indirect blocks. The COW granulaity will
have to be done at the cluster level, of course, though. So from a
design perspective it should be possible to make things knit together.

- Ted

2011-06-07 05:17:34

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 2011-06-06, at 2:55 PM, Ted Ts'o wrote:
> On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
>>> For one reason, a snapshot file format is currently an indirect file
>>> and big_alloc doesn't support indirect mapped files.
>>> I am not saying it cannot be done, but if it does, there would be
>>> several obstacles to cross.
>>
>> I know I'm kind of just throwing a bomb out here, but I am very concerned
>> about the ever-growing feature (in)compatibility matrix in ext4.
>
> bigalloc doesn't support indirect blocks mainly because it was faster
> to get things working if I didn't have to worry about indirect blocks.
> It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
> I'll get around to it at some point.

My main concern isn't about whether bigalloc grows support for indirect-
mapped files, but rather the opposite - that snapshots gain support for
extent-mapped files. In fact, since extent-mapped files can be 16TB in
size, it might make sense that the snapshots are _always_ extent-mapped
files, and we don't need to deal with the new block-mapped files with
4-triple-indirect blocks layout at all? Since snapshots are only going
into ext4, and ext4 + e2fsprogs already support extents, there wouldn't
be any issue about compatibility?

The only concern might be that mapping fragmented files into extents is
more effort, which makes me wonder about whether we should introduce the
"block-mapped extents" that I proposed in the past, to allow efficient
mapping of files (or parts thereof) that are highly fragmented, but still
keeping the benefits of extents (internal redundancy, 48-bit physical
block numbers, and while we are adding a new extent format it could be
designed to add 48-bit logical block numbers.

In another email Amir G. wrote:
> Andreas Dilger wrote:

>> I'd be a lot more confident in its acceptance if there was at least a design for how to move forward with this feature for filesystems with extents and 64bit support. I'd be happy with some co-requirement that bigalloc is needed for filesystems larger than 2^32 blocks (for example), so that there is never a need to have a snapshot with more than 2^32 blocks.
>>
>> Doing this design work may point out some other solution which does not require the 4*triple-indirect block hack in the first place, and will improve the code in the long run.
>
> The design in this case is quite one-way-to-go - that is defining a
> new extent format with 48bit logical addresses.

Agreed. Is this something in your upcoming development plans, or just a
feature that might be implemented some day?

> There are 2 reasons I used the 4 tind blocks hack:
> 1. Historic - the patches come from next3 which needed 16TB volume support
> 2. KISS - I don't know if you noticed, but the amount of lines in this
> hack is very small. both for ext4 and for libext2, the blk_to_path logic
> for indirect mapped files is very easy to modify, which makes the patch
> very easy to review.

Cheers, Andreas






2011-06-07 05:58:08

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 7, 2011 at 8:17 AM, Andreas Dilger <[email protected]> wrote:
> On 2011-06-06, at 2:55 PM, Ted Ts'o wrote:
>> On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
>>>> For one reason, a snapshot file format is currently an indirect file
>>>> and big_alloc doesn't support indirect mapped files.
>>>> I am not saying it cannot be done, but if it does, there would be
>>>> several obstacles to cross.
>>>
>>> I know I'm kind of just throwing a bomb out here, but I am very concerned
>>> about the ever-growing feature (in)compatibility matrix in ext4.
>>
>> bigalloc doesn't support indirect blocks mainly because it was faster
>> to get things working if I didn't have to worry about indirect blocks.
>> It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
>> I'll get around to it at some point.
>
> My main concern isn't about whether bigalloc grows support for indirect-
> mapped files, but rather the opposite - that snapshots gain support for
> extent-mapped files. ?In fact, since extent-mapped files can be 16TB in
> size, it might make sense that the snapshots are _always_ extent-mapped
> files, and we don't need to deal with the new block-mapped files with
> 4-triple-indirect blocks layout at all? ?Since snapshots are only going
> into ext4, and ext4 + e2fsprogs already support extents, there wouldn't
> be any issue about compatibility?
>
> The only concern might be that mapping fragmented files into extents is
> more effort, which makes me wonder about whether we should introduce the
> "block-mapped extents" that I proposed in the past, to allow efficient
> mapping of files (or parts thereof) that are highly fragmented, but still
> keeping the benefits of extents (internal redundancy, 48-bit physical
> block numbers, and while we are adding a new extent format it could be
> designed to add 48-bit logical block numbers.
>

You are right about snapshot file being a highly fragmented file by design,
so single block mapping is an advantage. The down side is that deleting
an extent mapped file, requires mapping all blocks one-by-one to snapshot
file, which is not efficient and makes deletes slow.
So having a format optimized for both single and multi block mapping would be
best.

The reason I DO NOT want to change the snapshot file format at this moment
is that it will make us lose all the stabilization that snapshot feature gained
during 1 year in production as next3.
You see, ext4_free_blocks() cares not if blocks are deleted from indirect or
extent mapped files and from there on, the code that maps those blocks to
the special snapshot file is the same in next3 and ext4.

> In another email Amir G. wrote:
>> Andreas Dilger wrote:
>
>>> I'd be a lot more confident in its acceptance if there was at least a design for how to move forward with this feature for filesystems with extents and 64bit support. I'd be happy with some co-requirement that bigalloc is needed for filesystems larger than 2^32 blocks (for example), so that there is never a need to have a snapshot with more than 2^32 blocks.
>>>
>>> Doing this design work may point out some other solution which does not require the 4*triple-indirect block hack in the first place, and will improve the code in the long run.
>>
>> The design in this case is quite one-way-to-go - that is defining a
>> new extent format with 48bit logical addresses.
>
> Agreed. ?Is this something in your upcoming development plans, or just a
> feature that might be implemented some day?

To be honest, for me it was always in 'some day' category, but my
patron has already
asked me about supporting snapshots on >16TB volumes (with the move to ext4),
so that day may be coming after all.

>
>> There are 2 reasons I used the 4 tind blocks hack:
>> 1. Historic - the patches come from next3 which needed 16TB volume support
>> 2. KISS - I don't know if you noticed, but the amount of lines in this
>> ? ?hack is very small. both for ext4 and for libext2, the blk_to_path logic
>> ? ?for indirect mapped files is very easy to modify, which makes the patch
>> ? ?very easy to review.
>
> Cheers, Andreas
>
>
>
>
>
>

2011-06-07 06:40:35

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Mon, Jun 6, 2011 at 11:55 PM, Ted Ts'o <[email protected]> wrote:
> On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
>> > For one reason, a snapshot file format is currently an indirect file
>> > and big_alloc
>> > doesn't support indirect mapped files.
>> > I am not saying it cannot be done, but if it does, there would be
>> > several obstacles
>> > to cross.
>>
>> I know I'm kind of just throwing a bomb out here, but I am very concerned
>> about the ever-growing feature (in)compatibility matrix in ext4.
>
> bigalloc doesn't support indirect blocks mainly because it was faster
> to get things working if I didn't have to worry about indirect blocks.
> It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
> I'll get around to it at some point.
>
> dioread_nolock is something that I had hoped to clean up by now, by
> making this the default way we do all buffered writebacks, for all
> block sizes.
>
>> Take for example dioread_nolock caveats:
>>
>> ? ? ? ? ? "However this does not work with nobh
>> ? ? ? ? ? ?option and the mount will fail. Nor does it work with
>> ? ? ? ? ? ?data journaling and dioread_nolock option will be
>> ? ? ? ? ? ?ignored with kernel warning. Note that dioread_nolock
>> ? ? ? ? ? ?code path is only used for extent-based files."
>
> Hey, at least we got rid of nobh! ?:-)
>
>> If ext4 matches the lifespan of ext3, in 10 years I fear that it will look
>> more like a collection of various individuals' pet projects, rather than
>> any kind of well-designed, cohesive project.
>>
>> How long can we really keep adding features which are semi- or wholly-
>> incompatible with other features?
>>
>> Consider this a cry in the wilderness for less rushed feature introduction,
>> and a more holistic approach to ext4 design...
>
> It's something I do worry about; and I do share your concern. ?At the
> same time, the reality is that we are a little like the Old Dutch
> Masters, who had take into account the preference of their patrons
> (i.e., in our case, those who pay our paychecks :-).
>
> In the case of dioread_nolock, I allowed dioread_nolock in even though
> it was a not a complete solution since internally, we had critical
> business for it, and in my judgement, (a) it wasn't that horrible
> (most of the horrible code paths was already being used for AIO/DIO),
> and (b) I had a plan for how to clean it up eventually. ?The
> fs/ext4/page_io.c implementation was in fact the first part of my
> cleanup plan, so we've made some progress; it's just not gone as fast
> as I would like.
>
> Snapshots are an example of a feature where I am very much worried
> about taking on technical debt. ?On the other hand, there are a lot of
> people who are quite excited of it as a feature, so I'm hoping we can
> clean it up enough we don't put a huge maintenance burden on
> ourselves.
>
> It should be possible to make snapshots work on bigalloc file systems,
> once support is added for indirect blocks. ?The COW granulaity will
> have to be done at the cluster level, of course, though. ?So from a
> design perspective it should be possible to make things knit together.
>

The question is why *should* we knit those 2 features together?
There is a patron for snapshots and there is a patron for big_alloc,
but is there a patron for both mixed? I don't thinks so.
*This* is a maintenance burden, because none of the patrons will pay
us to support this configuration.

What's so bad about mutually excluding features, which are useful
on their own? For one thing, it keeps the test matrix smaller, so it is
actually realistic to follow it through.

I hear Eric's cry in the wilderness and I can relate to it.
Maybe is the distros that need to take action about it and declare
only a reduces set of features matrix 'supported'.
We can even try to write generic mutual exclusion rules, which
will be enforced by mke2fs/tune2fs and ext4.

Amir.

2011-06-07 09:28:14

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)

On Mon, Jun 6, 2011 at 5:50 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 9 May 2011, [email protected] wrote:
>
>> From: Amir Goldstein <[email protected]>
>>
>> Built-in snapshots support for ext4.
>> Requires that the filesystem has the has_snapshot and exclude_bitmap
>> features and that block size is equal to system page size.
>> Snapshots are not supported with 64bit and meta_bg features and the
>> filesystem must be mounted with ordered data mode.
>>
>> Signed-off-by: Amir Goldstein <[email protected]>
>> Signed-off-by: Yongqiang Yang <[email protected]>
>> ---
>> ?fs/ext4/Kconfig ? ? ? ? ?| ? 11 +++
>> ?fs/ext4/Makefile ? ? ? ? | ? ?2 +
>> ?fs/ext4/balloc.c ? ? ? ? | ? ?1 +
>> ?fs/ext4/ext4.h ? ? ? ? ? | ? 15 ++++
>> ?fs/ext4/ext4_jbd2.c ? ? ?| ? ?3 +
>> ?fs/ext4/ext4_jbd2.h ? ? ?| ? 25 ++++++
>> ?fs/ext4/extents.c ? ? ? ?| ? ?3 +
>> ?fs/ext4/file.c ? ? ? ? ? | ? ?1 +
>> ?fs/ext4/ialloc.c ? ? ? ? | ? ?1 +
>> ?fs/ext4/inode.c ? ? ? ? ?| ? ?3 +
>> ?fs/ext4/ioctl.c ? ? ? ? ?| ? ?3 +
>> ?fs/ext4/mballoc.c ? ? ? ?| ? ?1 +
>> ?fs/ext4/namei.c ? ? ? ? ?| ? ?1 +
>> ?fs/ext4/resize.c ? ? ? ? | ? ?1 +
>> ?fs/ext4/snapshot.h ? ? ? | ?193 ++++++++++++++++++++++++++++++++++++++++++++++
>> ?fs/ext4/super.c ? ? ? ? ?| ? 43 ++++++++++
>> ?16 files changed, 307 insertions(+), 0 deletions(-)
>> ?create mode 100644 fs/ext4/snapshot.h
>> ?create mode 100644 fs/ext4/snapshot_debug.h
>>
>> diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
>> index 9ed1bb1..8970525 100644
>> --- a/fs/ext4/Kconfig
>> +++ b/fs/ext4/Kconfig
>> @@ -83,3 +83,14 @@ config EXT4_DEBUG
>>
>> ? ? ? ? If you select Y here, then you will be able to turn on debugging
>> ? ? ? ? with a command such as "echo 1 > /sys/kernel/debug/ext4/mballoc-debug"
>> +
>> +config EXT4_FS_SNAPSHOT
>> + ? ? bool "EXT4 snapshots (Experimental)"
>> + ? ? depends on EXT4_FS && EXPERIMENTAL
>> + ? ? default n
>> + ? ? help
>> + ? ? ? Built-in snapshots support for ext4.
>> + ? ? ? Requires that the filesystem has the has_snapshot and exclude_bitmap
>> + ? ? ? features and that block size is equal to system page size.
>> + ? ? ? Snapshots are not supported with 64bit and meta_bg features and the
>> + ? ? ? filesystem must be mounted with ordered data mode.
>
> What exactly do you mean by not supported with 64bit feature ? Maybe I
> am being dense, but I do not get it.

I mean snapshots and 64bit features are mutually exclusive.
Is that what you got or do I need to make it more clear?

>
>> diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
>> index 058b54d..16a779d 100644
>> --- a/fs/ext4/Makefile
>> +++ b/fs/ext4/Makefile
>> @@ -19,3 +19,5 @@ ext4-y ? ? ?:= balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o page-io.o \
>> ?ext4-$(CONFIG_EXT4_FS_XATTR) ? ? ? ? += xattr.o xattr_user.o xattr_trusted.o
>> ?ext4-$(CONFIG_EXT4_FS_POSIX_ACL) ? ? += acl.o
>> ?ext4-$(CONFIG_EXT4_FS_SECURITY) ? ? ? ? ? ? ?+= xattr_security.o
>> +ext4-$(CONFIG_EXT4_FS_SNAPSHOT) ? ? ? ? ? ? ?+= snapshot.o snapshot_ctl.o
>> +ext4-$(CONFIG_EXT4_FS_SNAPSHOT) ? ? ? ? ? ? ?+= snapshot_inode.o snapshot_buffer.o
>> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
>> index 1288f80..350f502 100644
>> --- a/fs/ext4/balloc.c
>> +++ b/fs/ext4/balloc.c
>> @@ -20,6 +20,7 @@
>> ?#include "ext4.h"
>> ?#include "ext4_jbd2.h"
>> ?#include "mballoc.h"
>> +#include "snapshot.h"
>>
>> ?/*
>> ? * balloc.c contains the blocks allocation and deallocation routines
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index f495b22..ca25e67 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -886,6 +886,20 @@ struct ext4_inode_info {
>> ?#define EXT2_FLAGS_SIGNED_HASH ? ? ? ? ? ? ? 0x0001 ?/* Signed dirhash in use */
>> ?#define EXT2_FLAGS_UNSIGNED_HASH ? ? 0x0002 ?/* Unsigned dirhash in use */
>> ?#define EXT2_FLAGS_TEST_FILESYS ? ? ? ? ? ? ?0x0004 ?/* to test development code */
>> +#define EXT4_FLAGS_IS_SNAPSHOT ? ? ? ? ? ? ? 0x0010 /* Is a snapshot image */
>> +#define EXT4_FLAGS_FIX_SNAPSHOT ? ? ? ? ? ? ?0x0020 /* Corrupted snapshot */
>> +#define EXT4_FLAGS_FIX_EXCLUDE ? ? ? ? ? ? ? 0x0040 /* Bad exclude bitmap */
>
> Would not it be better to call it EXT4_FLAGS_BAD_SNAPSHOT and
> EXT4_FLAGS_BAD_EXCLUDE_BMAP ?

Sure. Will do.

>
>> +
>> +#define EXT4_SET_FLAGS(sb, mask) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? EXT4_SB(sb)->s_es->s_flags |= cpu_to_le32(mask); \
>> + ? ? } while (0)
>> +#define EXT4_CLEAR_FLAGS(sb, mask) ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? EXT4_SB(sb)->s_es->s_flags &= ~cpu_to_le32(mask);\
>> + ? ? } while (0)
>> +#define EXT4_TEST_FLAGS(sb, mask) ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (EXT4_SB(sb)->s_es->s_flags & cpu_to_le32(mask))
>>
>> ?/*
>> ? * Mount flags
>> @@ -1351,6 +1365,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
>> ?#define EXT4_FEATURE_RO_COMPAT_GDT_CSUM ? ? ? ? ? ? ?0x0010
>> ?#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK ? ? 0x0020
>> ?#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE ? 0x0040
>> +#define EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT ?0x0080 /* Ext4 has snapshots */
>>
>> ?#define EXT4_FEATURE_INCOMPAT_COMPRESSION ? ?0x0001
>> ?#define EXT4_FEATURE_INCOMPAT_FILETYPE ? ? ? ? ? ? ? 0x0002
>> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
>> index 6e272ef..560020d 100644
>> --- a/fs/ext4/ext4_jbd2.c
>> +++ b/fs/ext4/ext4_jbd2.c
>> @@ -1,8 +1,11 @@
>> ?/*
>> ? * Interface between ext4 and JBD
>> + *
>> + * Snapshot metadata COW hooks, Amir Goldstein <[email protected]>, 2011
>> ? */
>>
>> ?#include "ext4_jbd2.h"
>> +#include "snapshot.h"
>>
>> ?#include <trace/events/ext4.h>
>>
>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>> index e25e99b..8ffffb1 100644
>> --- a/fs/ext4/ext4_jbd2.h
>> +++ b/fs/ext4/ext4_jbd2.h
>> @@ -10,6 +10,8 @@
>> ? * option, any later version, incorporated herein by reference.
>> ? *
>> ? * Ext4-specific journaling extensions.
>> + *
>> + * Snapshot extra COW credits, Amir Goldstein <[email protected]>, 2011
>> ? */
>>
>> ?#ifndef _EXT4_JBD2_H
>> @@ -18,6 +20,7 @@
>> ?#include <linux/fs.h>
>> ?#include <linux/jbd2.h>
>> ?#include "ext4.h"
>> +#include "snapshot.h"
>>
>> ?#define EXT4_JOURNAL(inode) ?(EXT4_SB((inode)->i_sb)->s_journal)
>>
>> @@ -272,6 +275,11 @@ static inline int ext4_should_journal_data(struct inode *inode)
>> ? ? ? ? ? ? ? return 0;
>> ? ? ? if (!S_ISREG(inode->i_mode))
>> ? ? ? ? ? ? ? return 1;
>> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
>> + ? ? if (EXT4_SNAPSHOTS(inode->i_sb))
>> + ? ? ? ? ? ? /* snapshots enforce ordered data */
>> + ? ? ? ? ? ? return 0;
>> +#endif
>> ? ? ? if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
>> ? ? ? ? ? ? ? return 1;
>> ? ? ? if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
>> @@ -285,6 +293,11 @@ static inline int ext4_should_order_data(struct inode *inode)
>> ? ? ? ? ? ? ? return 0;
>> ? ? ? if (!S_ISREG(inode->i_mode))
>> ? ? ? ? ? ? ? return 0;
>> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
>> + ? ? if (EXT4_SNAPSHOTS(inode->i_sb))
>> + ? ? ? ? ? ? /* snapshots enforce ordered data */
>> + ? ? ? ? ? ? return 1;
>> +#endif
>> ? ? ? if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
>> ? ? ? ? ? ? ? return 0;
>> ? ? ? if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_ORDERED_DATA)
>> @@ -298,6 +311,11 @@ static inline int ext4_should_writeback_data(struct inode *inode)
>> ? ? ? ? ? ? ? return 0;
>> ? ? ? if (EXT4_JOURNAL(inode) == NULL)
>> ? ? ? ? ? ? ? return 1;
>> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
>> + ? ? if (EXT4_SNAPSHOTS(inode->i_sb))
>> + ? ? ? ? ? ? /* snapshots enforce ordered data */
>> + ? ? ? ? ? ? return 0;
>> +#endif
>> ? ? ? if (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA))
>> ? ? ? ? ? ? ? return 0;
>> ? ? ? if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
>> @@ -320,6 +338,11 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>> ? ? ? ? ? ? ? return 0;
>> ? ? ? if (!S_ISREG(inode->i_mode))
>> ? ? ? ? ? ? ? return 0;
>> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
>> + ? ? if (EXT4_SNAPSHOTS(inode->i_sb))
>> + ? ? ? ? ? ? /* XXX: should snapshots support dioread_nolock? */
>> + ? ? ? ? ? ? return 0;
>> +#endif
>> ? ? ? if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
>> ? ? ? ? ? ? ? return 0;
>> ? ? ? if (ext4_should_journal_data(inode))
>> @@ -327,4 +350,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
>> ? ? ? return 1;
>> ?}
>
> Since EXT4_SNAPSHOTS(sb) returns 0 when not configured in, I do not
> think those ifdefs are needed.
>

true. will fix.

>>
>> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
>> +#endif
>> ?#endif ? ? ? /* _EXT4_JBD2_H */
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 7296cd1..0c3ea93 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -18,6 +18,8 @@
>> ? * You should have received a copy of the GNU General Public Licens
>> ? * along with this program; if not, write to the Free Software
>> ? * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA ?02111-
>> + *
>> + * Snapshot move-on-write (MOW), Yongqiang Yang <[email protected]>, 2011
>> ? */
>>
>> ?/*
>> @@ -43,6 +45,7 @@
>> ?#include <linux/fiemap.h>
>> ?#include "ext4_jbd2.h"
>> ?#include "ext4_extents.h"
>> +#include "snapshot.h"
>>
>> ?static int ext4_ext_truncate_extend_restart(handle_t *handle,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct inode *inode,
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 7b80d54..60b3b19 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -28,6 +28,7 @@
>> ?#include "ext4_jbd2.h"
>> ?#include "xattr.h"
>> ?#include "acl.h"
>> +#include "snapshot.h"
>>
>> ?/*
>> ? * Called when an inode is released. Note that this is different
>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>> index 2fd3b0e..831d49a 100644
>> --- a/fs/ext4/ialloc.c
>> +++ b/fs/ext4/ialloc.c
>> @@ -28,6 +28,7 @@
>> ?#include "ext4_jbd2.h"
>> ?#include "xattr.h"
>> ?#include "acl.h"
>> +#include "snapshot.h"
>>
>> ?#include <trace/events/ext4.h>
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 4ccb6eb..a597ff1 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -20,6 +20,8 @@
>> ? * ? ([email protected])
>> ? *
>> ? * ?Assorted race fixes, rewrite of ext4_get_block() by Al Viro, 2000
>> + *
>> + * ?Snapshot inode extensions, Amir Goldstein <[email protected]>, 2011
>> ? */
>>
>> ?#include <linux/module.h>
>> @@ -49,6 +51,7 @@
>> ?#include "ext4_extents.h"
>>
>> ?#include <trace/events/ext4.h>
>> +#include "snapshot.h"
>>
>> ?#define MPAGE_DA_EXTENT_TAIL 0x01
>>
>> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
>> index eb3bc2f..a426332 100644
>> --- a/fs/ext4/ioctl.c
>> +++ b/fs/ext4/ioctl.c
>> @@ -5,6 +5,8 @@
>> ? * Remy Card ([email protected])
>> ? * Laboratoire MASI - Institut Blaise Pascal
>> ? * Universite Pierre et Marie Curie (Paris VI)
>> + *
>> + * Snapshot control API, Amir Goldstein <[email protected]>, 2011
>> ? */
>>
>> ?#include <linux/fs.h>
>> @@ -17,6 +19,7 @@
>> ?#include <asm/uaccess.h>
>> ?#include "ext4_jbd2.h"
>> ?#include "ext4.h"
>> +#include "snapshot.h"
>>
>> ?long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>> ?{
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 2be85af..4952b7b 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -25,6 +25,7 @@
>> ?#include <linux/debugfs.h>
>> ?#include <linux/slab.h>
>> ?#include <trace/events/ext4.h>
>> +#include "snapshot.h"
>>
>> ?/*
>> ? * MUSTDO:
>> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
>> index ad87584..b70fa13 100644
>> --- a/fs/ext4/namei.c
>> +++ b/fs/ext4/namei.c
>> @@ -39,6 +39,7 @@
>>
>> ?#include "xattr.h"
>> ?#include "acl.h"
>> +#include "snapshot.h"
>>
>> ?/*
>> ? * define how far ahead to read directories while searching them.
>> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
>> index a11c00a..ee9b999 100644
>> --- a/fs/ext4/resize.c
>> +++ b/fs/ext4/resize.c
>> @@ -15,6 +15,7 @@
>> ?#include <linux/slab.h>
>>
>> ?#include "ext4_jbd2.h"
>> +#include "snapshot.h"
>
> It would be better for reviewers if you'll add those includes when you
> start using them.
>

OK. I can do that.

>>
>> ?#define outside(b, first, last) ? ? ?((b) < (first) || (b) >= (last))
>> ?#define inside(b, first, last) ? ? ? ((b) >= (first) && (b) < (last))
>> diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
>> new file mode 100644
>> index 0000000..a927090
>> --- /dev/null
>> +++ b/fs/ext4/snapshot.h
>> @@ -0,0 +1,193 @@
>> +/*
>> + * linux/fs/ext4/snapshot.h
>> + *
>> + * Written by Amir Goldstein <[email protected]>, 2008
>> + *
>> + * Copyright (C) 2008-2011 CTERA Networks
>> + *
>> + * This file is part of the Linux kernel and is made available under
>> + * the terms of the GNU General Public License, version 2, or at your
>> + * option, any later version, incorporated herein by reference.
>> + *
>> + * Ext4 snapshot extensions.
>
> This is great place to write more about snapshot design and
> implementation. If it is added later in a different file, then ignore it
> :).

the inline documentation is scattered among several patches.
I should probably also add Documentation/ext4_snapshots.txt
with some design and overview information.
And perhaps insert a short short version of it here ;-)

>
>> + */
>> +
>> +#ifndef _LINUX_EXT4_SNAPSHOT_H
>> +#define _LINUX_EXT4_SNAPSHOT_H
>> +
>> +#include <linux/version.h>
>> +#include <linux/delay.h>
>> +#include "ext4.h"
>> +
>> +
>> +/*
>> + * use signed 64bit for snapshot image addresses
>> + * negative addresses are used to reference snapshot meta blocks
>> + */
>> +#define ext4_snapblk_t long long
>
> typedef signed long long int ext4_snapblk_t maybe ?

1. checkpatch doesn't like adding new typedef to the kernel
2. I am in th process of removing that define altogether

>
>> +
>> +/*
>> + * We assert that file system block size == page size (on mount time)
>> + * and that the first file system block is block 0 (on snapshot create).
>> + * Snapshot inode direct blocks are reserved for snapshot meta blocks.
>> + * Snapshot inode single indirect blocks are not used.
>> + * Snapshot image starts at the first double indirect block, so all blocks in
>> + * Snapshot image block group blocks are mapped by a single DIND block:
>> + * 4k: 32k blocks_per_group = 32 IND (4k) blocks = 32 groups per DIND
>> + * 8k: 64k blocks_per_group = 32 IND (8k) blocks = 64 groups per DIND
>> + * 16k: 128k blocks_per_group = 32 IND (16k) blocks = 128 groups per DIND
>> + */
>> +#define SNAPSHOT_BLOCK_SIZE ? ? ? ? ?PAGE_SIZE
>> +#define SNAPSHOT_BLOCK_SIZE_BITS ? ? PAGE_SHIFT
>> +#define ? ? ?SNAPSHOT_ADDR_PER_BLOCK ? ? ? ? (SNAPSHOT_BLOCK_SIZE / sizeof(__u32))
>
>> +#define SNAPSHOT_ADDR_PER_BLOCK_BITS (SNAPSHOT_BLOCK_SIZE_BITS - 2)
>
> #define SNAPSHOT_ADDR_PER_BLOCK ? ? ? ? (1 << SNAPSHOT_BLOCK_SIZE_BITS )

Thanks.

>
>> +#define SNAPSHOT_DIR_BLOCKS ? ? ? ? ?EXT4_NDIR_BLOCKS
>> +#define SNAPSHOT_IND_BLOCKS ? ? ? ? ?SNAPSHOT_ADDR_PER_BLOCK
>> +
>> +#define SNAPSHOT_BLOCKS_PER_GROUP_BITS ? ? ? (SNAPSHOT_BLOCK_SIZE_BITS + 3)
>> +#define SNAPSHOT_BLOCKS_PER_GROUP ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? (1<<SNAPSHOT_BLOCKS_PER_GROUP_BITS) /* 8*PAGE_SIZE */
>
>> +#define SNAPSHOT_BLOCK_GROUP(block) ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ((block)>>SNAPSHOT_BLOCKS_PER_GROUP_BITS)
>> +#define SNAPSHOT_BLOCK_GROUP_OFFSET(block) ? ? ? ? ? ? ? ? ? \
>> + ? ? ((block)&(SNAPSHOT_BLOCKS_PER_GROUP-1))
>
> formating is wrong.
>

what do you mean?

>> +#define SNAPSHOT_BLOCK_TUPLE(block) ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP_OFFSET(block), ? ? ? \
>> + ? ? (ext4_fsblk_t)SNAPSHOT_BLOCK_GROUP(block)
>
> This is confusing, but is you're using it really often, so be it.

I use it in debug prints.
if I turn them into fixed trace points, this define can go away.
checkpatch doesn't like it either...

>
>> +#define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS ? ? ? ? ? ? ? ? ? ?\
>> + ? ? (SNAPSHOT_BLOCKS_PER_GROUP_BITS-SNAPSHOT_ADDR_PER_BLOCK_BITS)
>> +#define SNAPSHOT_IND_PER_BLOCK_GROUP ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (1<<SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */
>> +#define SNAPSHOT_DIND_BLOCK_GROUPS_BITS ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? (SNAPSHOT_ADDR_PER_BLOCK_BITS-SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
>> +#define SNAPSHOT_DIND_BLOCK_GROUPS ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (1<<SNAPSHOT_DIND_BLOCK_GROUPS_BITS)
>
> formating

?

>
>> +
>> +#define SNAPSHOT_BLOCK_OFFSET ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? (SNAPSHOT_DIR_BLOCKS+SNAPSHOT_IND_BLOCKS)
>> +#define SNAPSHOT_BLOCK(iblock) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ((ext4_snapblk_t)(iblock) - SNAPSHOT_BLOCK_OFFSET)
>> +#define SNAPSHOT_IBLOCK(block) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (ext4_fsblk_t)((block) + SNAPSHOT_BLOCK_OFFSET)
>
> I do not see SNAPSHOT_BLOCK() defined anywhere.
>

Do you mean you don't see it used anywhere?
It is used by later patches, but I do need to document it's meaning here.

>> +
>> +
>> +
>> +#ifdef CONFIG_EXT4_FS_SNAPSHOT
>> +#define EXT4_SNAPSHOT_VERSION "ext4 snapshot v1.0.13-6 (2-May-2010)"
>> +
>> +#define SNAPSHOT_BYTES_OFFSET ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? (SNAPSHOT_BLOCK_OFFSET << SNAPSHOT_BLOCK_SIZE_BITS)
>> +#define SNAPSHOT_ISIZE(size) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ((size) + SNAPSHOT_BYTES_OFFSET)
>> +/* Snapshot block device size is recorded in i_disksize */
>> +#define SNAPSHOT_SET_SIZE(inode, size) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (EXT4_I(inode)->i_disksize = SNAPSHOT_ISIZE(size))
>
> I do not have a clue what that means. And to be honest I am getting a
> bit lost in macros, could you add some comments explaining what it is
> used for ?

Sure, I'll do that.

>
>> +#define SNAPSHOT_SIZE(inode) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (EXT4_I(inode)->i_disksize - SNAPSHOT_BYTES_OFFSET)
>> +#define SNAPSHOT_SET_BLOCKS(inode, blocks) ? ? ? ? ? ? ? ? ? \
>> + ? ? SNAPSHOT_SET_SIZE((inode), ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS)
>> +#define SNAPSHOT_BLOCKS(inode) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (ext4_fsblk_t)(SNAPSHOT_SIZE(inode) >> SNAPSHOT_BLOCK_SIZE_BITS)
>> +/* Snapshot shrink/merge/clean progress is exported via i_size */
>> +#define SNAPSHOT_PROGRESS(inode) ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? (ext4_fsblk_t)((inode)->i_size >> SNAPSHOT_BLOCK_SIZE_BITS)
>> +#define SNAPSHOT_SET_ENABLED(inode) ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? i_size_write((inode), SNAPSHOT_SIZE(inode))
>> +#define SNAPSHOT_SET_PROGRESS(inode, blocks) ? ? ? ? ? ? ? ? \
>> + ? ? snapshot_size_extend((inode), (blocks))
>> +/* Disabled/deleted snapshot i_size is 1 block, to allow read of super block */
>> +#define SNAPSHOT_SET_DISABLED(inode) ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? snapshot_size_truncate((inode), 1)
>> +/* Removed snapshot i_size and i_disksize are 0, since all blocks were freed */
>> +#define SNAPSHOT_SET_REMOVED(inode) ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? EXT4_I(inode)->i_disksize = 0; ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? snapshot_size_truncate((inode), 0); ? ? ? ? ? ? \
>> + ? ? } while (0)
>> +
>> +static inline void snapshot_size_extend(struct inode *inode,
>> + ? ? ? ? ? ? ? ? ? ? ext4_fsblk_t blocks)
>> +{
>> + ? ? i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
>> +}
>> +
>> +static inline void snapshot_size_truncate(struct inode *inode,
>> + ? ? ? ? ? ? ? ? ? ? ext4_fsblk_t blocks)
>> +{
>> + ? ? loff_t i_size = (loff_t)blocks << SNAPSHOT_BLOCK_SIZE_BITS;
>> +
>> + ? ? i_size_write(inode, i_size);
>> + ? ? truncate_inode_pages(&inode->i_data, i_size);
>> +}
>> +
>> +/* Is ext4 configured for snapshots support? */
>> +static inline int EXT4_SNAPSHOTS(struct super_block *sb)
>> +{
>> + ? ? return EXT4_HAS_RO_COMPAT_FEATURE(sb,
>> + ? ? ? ? ? ? ? ? ? ? EXT4_FEATURE_RO_COMPAT_HAS_SNAPSHOT);
>> +}
>> +
>> +#define ext4_snapshot_cow(handle, inode, block, bh, cow) 0
>> +
>> +#define ext4_snapshot_move(handle, inode, block, pcount, move) (0)
>> +
>> +/*
>> + * Block access functions
>> + */
>> +
>> +
>> +
>> +/* snapshot_ctl.c */
>> +
>> +
>> +static inline int init_ext4_snapshot(void)
>> +{
>> + ? ? return 0;
>> +}
>> +
>> +static inline void exit_ext4_snapshot(void)
>> +{
>> +}
>> +
>> +
>> +
>> +
>> +
>> +#else /* CONFIG_EXT4_FS_SNAPSHOT */
>> +
>> +/* Snapshot NOP macros */
>> +#define EXT4_SNAPSHOTS(sb) (0)
>> +#define SNAPMAP_ISCOW(cmd) ? (0)
>> +#define SNAPMAP_ISMOVE(cmd) ? ? (0)
>> +#define SNAPMAP_ISSYNC(cmd) ?(0)
>> +#define IS_COWING(handle) ? ?(0)
>> +
>> +#define ext4_snapshot_load(sb, es, ro) (0)
>> +#define ext4_snapshot_destroy(sb)
>> +#define init_ext4_snapshot() (0)
>> +#define exit_ext4_snapshot()
>> +#define ext4_snapshot_active(sbi) (0)
>> +#define ext4_snapshot_file(inode) (0)
>> +#define ext4_snapshot_should_move_data(inode) (0)
>> +#define ext4_snapshot_test_excluded(handle, inode, block_to_free, count) (0)
>> +#define ext4_snapshot_list(inode) (0)
>> +#define ext4_snapshot_get_flags(ei, filp)
>> +#define ext4_snapshot_set_flags(handle, inode, flags) (0)
>> +#define ext4_snapshot_take(inode) (0)
>> +#define ext4_snapshot_update(inode_i_sb, cleanup, zero) (0)
>> +#define ext4_snapshot_has_active(sb) (NULL)
>> +#define ext4_snapshot_get_bitmap_access(handle, sb, grp, bh) (0)
>> +#define ext4_snapshot_get_write_access(handle, inode, bh) (0)
>> +#define ext4_snapshot_get_create_access(handle, bh) (0)
>> +#define ext4_snapshot_excluded(ac_inode) (0)
>> +#define ext4_snapshot_get_delete_access(handle, inode, block, pcount) (0)
>> +
>> +#define ext4_snapshot_get_move_access(handle, inode, block, pcount, move) (0)
>> +#define ext4_snapshot_start_pending_cow(sbh)
>> +#define ext4_snapshot_end_pending_cow(sbh)
>> +#define ext4_snapshot_is_active(inode) ? ? ? ? ? ? ? (0)
>> +#define ext4_snapshot_mow_in_tid(inode) ? ? ? ? ? ? ?(1)
>> +
>> +#endif /* CONFIG_EXT4_FS_SNAPSHOT */
>> +#endif ? ? ? /* _LINUX_EXT4_SNAPSHOT_H */
>> diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
>> new file mode 100644
>> index 0000000..e69de29
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 414167a..2c345d1 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -48,6 +48,7 @@
>> ?#include "xattr.h"
>> ?#include "acl.h"
>> ?#include "mballoc.h"
>> +#include "snapshot.h"
>>
>> ?#define CREATE_TRACE_POINTS
>> ?#include <trace/events/ext4.h>
>> @@ -2612,6 +2613,24 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
>> ? ? ? ? ? ? ? ? ? ? ? return 0;
>> ? ? ? ? ? ? ? }
>> ? ? ? }
>> + ? ? /* Enforce snapshots requirements: */
>> + ? ? if (EXT4_SNAPSHOTS(sb)) {
>> + ? ? ? ? ? ? if (EXT4_HAS_INCOMPAT_FEATURE(sb,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_FEATURE_INCOMPAT_META_BG|
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? EXT4_FEATURE_INCOMPAT_64BIT)) {
>> + ? ? ? ? ? ? ? ? ? ? ext4_msg(sb, KERN_ERR,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "has_snapshot feature cannot be mixed with "
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "features: meta_bg, 64bit");
>> + ? ? ? ? ? ? ? ? ? ? return 0;
>
> So what if the fs is mounted as readonly and has those incompatible features
> ? Is it ok ?

It should be fine because:
1. you cannot take new snapshots when readonly
2. no changes to fs = no COW = no changes to snapshot

So if a future kernel will support mixing snapshots and 64it (I hope it will),
then is should be ok to mount the 64bit fs with snapshots using this kernel.
maybe the snapshots could not be read from, but there should be no problem
reading the fs.

>
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? if (EXT4_TEST_FLAGS(sb, EXT4_FLAGS_IS_SNAPSHOT)) {
>> + ? ? ? ? ? ? ? ? ? ? ext4_msg(sb, KERN_ERR,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "A snapshot image must be mounted read-only. "
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "If this is an exported snapshot image, you "
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "must run fsck -xy to make it writable.");
>> + ? ? ? ? ? ? ? ? ? ? return 0;
>> + ? ? ? ? ? ? }
>> + ? ? }
>> ? ? ? return 1;
>> ?}
>>
>> @@ -3194,6 +3213,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>>
>> ? ? ? blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
>>
>> + ? ? /* Enforce snapshots blocksize == pagesize */
>> + ? ? if (EXT4_SNAPSHOTS(sb) && blocksize != PAGE_SIZE) {
>> + ? ? ? ? ? ? ext4_msg(sb, KERN_ERR,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "snapshots require that filesystem blocksize "
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "(%d) be equal to system page size (%lu)",
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? blocksize, PAGE_SIZE);
>> + ? ? ? ? ? ? goto failed_mount;
>> + ? ? }
>
> I would rather see this test after the test for supported filesystem
> blocksize. It is only logical to check for the superset first.

right. makes sense.

>
>> +
>> ? ? ? if (blocksize < EXT4_MIN_BLOCK_SIZE ||
>> ? ? ? ? ? blocksize > EXT4_MAX_BLOCK_SIZE) {
>> ? ? ? ? ? ? ? ext4_msg(sb, KERN_ERR,
>> @@ -3540,6 +3568,15 @@ no_journal:
>> ? ? ? ? ? ? ? goto failed_mount_wq;
>> ? ? ? }
>>
>> + ? ? /* Enforce journal ordered mode with snapshots */
>> + ? ? if (EXT4_SNAPSHOTS(sb) && !(sb->s_flags & MS_RDONLY) &&
>> + ? ? ? ? ? ? (!EXT4_SB(sb)->s_journal ||
>> + ? ? ? ? ? ? ?test_opt(sb, DATA_FLAGS) != EXT4_MOUNT_ORDERED_DATA)) {
>> + ? ? ? ? ? ? ext4_msg(sb, KERN_ERR,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? "snapshots require journal ordered mode");
>> + ? ? ? ? ? ? goto failed_mount4;
>> + ? ? }
>> +
>> ? ? ? /*
>> ? ? ? ?* The jbd2_journal_load will have done any necessary log recovery,
>> ? ? ? ?* so we can safely mount the rest of the filesystem now.
>> @@ -4878,10 +4915,15 @@ static int __init ext4_init_fs(void)
>> ? ? ? err = register_filesystem(&ext4_fs_type);
>> ? ? ? if (err)
>> ? ? ? ? ? ? ? goto out;
>> + ? ? err = init_ext4_snapshot();
>> + ? ? if (err)
>> + ? ? ? ? ? ? goto out_fs;
>
> Is it really necessary to init snapshots after the filesystem
> registration ? I do not see reason why.

no, not really.
it actually only registers debugfs entries,
which can go somewhere else.
I will remove both init and exit funcs.

>
>>
>> ? ? ? ext4_li_info = NULL;
>> ? ? ? mutex_init(&ext4_li_mtx);
>> ? ? ? return 0;
>> +out_fs:
>> + ? ? unregister_filesystem(&ext4_fs_type);
>> ?out:
>> ? ? ? unregister_as_ext2();
>> ? ? ? unregister_as_ext3();
>> @@ -4905,6 +4947,7 @@ out7:
>>
>> ?static void __exit ext4_exit_fs(void)
>> ?{
>> + ? ? exit_ext4_snapshot();
>> ? ? ? ext4_destroy_lazyinit_thread();
>> ? ? ? unregister_as_ext2();
>> ? ? ? unregister_as_ext3();
>>
>
> Thanks!
> -Lukas
>

I will fix some of the things you pointed out and post the 'full' series.
Thank you!

Amir.

2011-06-07 09:59:31

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 02/30] ext4: snapshot debugging support

On Mon, Jun 6, 2011 at 6:08 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 9 May 2011, [email protected] wrote:
>
>> From: Amir Goldstein <[email protected]>
>>
>> Control snapshot debug level via debugfs entry /ext4/snapshot-debug
>> and induce delay tests via debugfs entries /ext4/test-XXX-delay-msec.
>
> Wouldn't you rather consider adding fixed tracepoints ? I think
> tracepoints would be useful regardless on ?this debufs interface.

I think you are right.
I was not aware of tracepoints back when I wrote this debugging interface.
some of the debug print should definitely be converted to tracepoints,
especially the ones in the functions XXX_trace_cow and XXX_journal_trace...

>
>>
>> Signed-off-by: Amir Goldstein <[email protected]>
>> Signed-off-by: Yongqiang Yang <[email protected]>
>> ---
>> ?fs/ext4/Makefile ? ? ? ? | ? ?1 +
>> ?fs/ext4/mballoc.c ? ? ? ?| ? ?3 +
>> ?fs/ext4/snapshot.h ? ? ? | ? ?9 ++++
>> ?fs/ext4/snapshot_debug.h | ?105 ++++++++++++++++++++++++++++++++++++++++++++++
>> ?4 files changed, 118 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
>> index 16a779d..9981306 100644
>> --- a/fs/ext4/Makefile
>> +++ b/fs/ext4/Makefile
>> @@ -21,3 +21,4 @@ ext4-$(CONFIG_EXT4_FS_POSIX_ACL) ? ?+= acl.o
>> ?ext4-$(CONFIG_EXT4_FS_SECURITY) ? ? ? ? ? ? ?+= xattr_security.o
>> ?ext4-$(CONFIG_EXT4_FS_SNAPSHOT) ? ? ? ? ? ? ?+= snapshot.o snapshot_ctl.o
>> ?ext4-$(CONFIG_EXT4_FS_SNAPSHOT) ? ? ? ? ? ? ?+= snapshot_inode.o snapshot_buffer.o
>> +ext4-$(CONFIG_EXT4_FS_SNAPSHOT) ? ? ? ? ? ? ?+= snapshot_debug.o
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 4952b7b..42961bf 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2657,10 +2657,13 @@ static void __init ext4_create_debugfs_entry(void)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S_IRUGO | S_IWUSR,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? debugfs_dir,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &mb_enable_debug);
>> + ? ? if (debugfs_dir)
>> + ? ? ? ? ? ? ext4_snapshot_create_debugfs_entry(debugfs_dir);
>> ?}
>>
>> ?static void ext4_remove_debugfs_entry(void)
>> ?{
>> + ? ? ext4_snapshot_remove_debugfs_entry();
>
> I do not see it defined anywhere.

there must be a NOP macro somewhere.
the real function is in the omitted file snapshot_debug.c.

>
>> ? ? ? debugfs_remove(debugfs_debug);
>> ? ? ? debugfs_remove(debugfs_dir);
>> ?}
>> diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
>> index a927090..52bfa52 100644
>> --- a/fs/ext4/snapshot.h
>> +++ b/fs/ext4/snapshot.h
>> @@ -18,6 +18,7 @@
>> ?#include <linux/version.h>
>> ?#include <linux/delay.h>
>> ?#include "ext4.h"
>> +#include "snapshot_debug.h"
>>
>>
>> ?/*
>> @@ -109,6 +110,14 @@
>> ?static inline void snapshot_size_extend(struct inode *inode,
>> ? ? ? ? ? ? ? ? ? ? ? ext4_fsblk_t blocks)
>> ?{
>> +#ifdef CONFIG_EXT4_DEBUG
>> + ? ? ext4_fsblk_t old_blocks = SNAPSHOT_PROGRESS(inode);
>> + ? ? ext4_fsblk_t max_blocks = SNAPSHOT_BLOCKS(inode);
>> +
>> + ? ? /* sleep total of tunable delay unit over 100% progress */
>
> What is this good for, it is not clear from the description.

the whole delay tunables are used for debugging concurrent
operations.
it is convenient to set the amount of delay for the entire operation
(like snapshot delete) and have these macros take care of
inserting small delay unit during the delete process.
Will add explanation.

>
>> + ? ? snapshot_test_delay_progress(SNAPTEST_DELETE,
>> + ? ? ? ? ? ? ? ? ? ? old_blocks, blocks, max_blocks);
>> +#endif
>> ? ? ? i_size_write((inode), (loff_t)(blocks) << SNAPSHOT_BLOCK_SIZE_BITS);
>> ?}
>>
>> diff --git a/fs/ext4/snapshot_debug.h b/fs/ext4/snapshot_debug.h
>> index e69de29..f893eb1 100644
>> --- a/fs/ext4/snapshot_debug.h
>> +++ b/fs/ext4/snapshot_debug.h
>> @@ -0,0 +1,105 @@
>> +/*
>> + * linux/fs/ext4/snapshot_debug.h
>> + *
>> + * Written by Amir Goldstein <[email protected]>, 2008
>> + *
>> + * Copyright (C) 2008-2011 CTERA Networks
>> + *
>> + * This file is part of the Linux kernel and is made available under
>> + * the terms of the GNU General Public License, version 2, or at your
>> + * option, any later version, incorporated herein by reference.
>> + *
>> + * Ext4 snapshot debugging.
>> + */
>> +
>> +#ifndef _LINUX_EXT4_SNAPSHOT_DEBUG_H
>> +#define _LINUX_EXT4_SNAPSHOT_DEBUG_H
>> +
>> +#if defined(CONFIG_EXT4_FS_SNAPSHOT) && defined(CONFIG_EXT4_DEBUG)
>> +#include <linux/delay.h>
>> +
>> +#define SNAPSHOT_INDENT_MAX 4
>> +#define SNAPSHOT_INDENT_STR "\t\t\t\t"
>> +
>> +#define SNAPTEST_TAKE ? ? ? ?0
>> +#define SNAPTEST_DELETE ? ? ?1
>> +#define SNAPTEST_COW 2
>> +#define SNAPTEST_READ ? ? ? ?3
>> +#define SNAPTEST_BITMAP ? ? ?4
>> +#define SNAPSHOT_TESTS_NUM ? 5
>> +
>> +extern const char *snapshot_indent;
>> +extern u8 snapshot_enable_debug;
>> +extern u16 snapshot_enable_test[SNAPSHOT_TESTS_NUM];
>> +extern u8 cow_cache_enabled;
>> +
>> +#define snapshot_test_delay(i) ? ? ? ? ? ? ? ? ? ?\
>> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? if (snapshot_enable_test[i]) ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? msleep_interruptible(snapshot_enable_test[i]); \
>> + ? ? } while (0)
>> +
>> +/*
>> + * Sleep 1ms every 'blocks_per_ms', amounting to the total test delay
>> + * over 100% of progress (when 'to' reaches 'max').
>> + * snapshot_enable_test[i] (msec) is limited to 64K and max (blocks_count)
>> + * is likely much more than 64K, so 'blocks_per_ms' is likely non zero.
>> + */
>
> Oh, here is a good place for explaining the purpose.
>
>> +#define snapshot_test_delay_progress(i, from, to, max) ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? if (snapshot_enable_test[i] && ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (max) > snapshot_enable_test[i] && ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (from) <= (to) && (to) <= (max)) { ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? unsigned long blocks_per_ms = ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? do_div((max), snapshot_enable_test[i]); \
>> + ? ? ? ? ? ? ? ? ? ? unsigned long x = do_div((from), blocks_per_ms);\
>> + ? ? ? ? ? ? ? ? ? ? unsigned long y = do_div((to), blocks_per_ms); ?\
>> + ? ? ? ? ? ? ? ? ? ? if (y > x) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? msleep_interruptible(y - x); ? ? ? ? ? ?\
>> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? } while (0)
>> +
>> +#define snapshot_debug_l(n, l, f, a...) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? if ((n) <= snapshot_enable_debug && ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? (l) <= SNAPSHOT_INDENT_MAX) { ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? printk(KERN_DEBUG "snapshot: %s" f, ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?snapshot_indent - (l), ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?## a); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? } while (0)
>
> This can be done by tracepoints maybe ?

can you add a generic string arg to a tracepoint?

>
>> +
>> +#define snapshot_debug(n, f, a...) ? snapshot_debug_l(n, 0, f, ## a)
>> +
>> +#define snapshot_debug_once(n, f, a...) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>
> formating

what do you mean?

>
>> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? static bool __once; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? if (!__once) { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? ? ? ? ? snapshot_debug(n, f, ## a); ? ? ? ? ? ? ? ? ? ? \
>> + ? ? ? ? ? ? ? ? ? ? __once = true; ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> + ? ? } while (0)
>> +
>> +extern void ext4_snapshot_create_debugfs_entry(struct dentry *debugfs_dir);
>> +extern void ext4_snapshot_remove_debugfs_entry(void);
>> +
>> +#else
>> +#define snapshot_enable_debug (0)
>> +#define snapshot_test_delay(i)
>> +#define snapshot_test_delay_progress(i, from, to, max)
>> +#define snapshot_debug(n, f, a...)
>> +#define snapshot_debug_l(n, l, f, a...)
>> +#define snapshot_debug_once(n, f, a...)
>> +#define ext4_snapshot_create_debugfs_entry(d)
>> +#define ext4_snapshot_remove_debugfs_entry()
>> +#endif
>> +
>> +
>> +/* debug levels */
>> +#define SNAP_ERR ? ? 1 /* errors and summary */
>> +#define SNAP_WARN ? ?2 /* warnings */
>
> It seems to me that those two levels should be displayed no matter what
> via standard functions.

I agree.

>
>> +#define SNAP_INFO ? ?3 /* info */
>> +#define SNAP_DEBUG ? 4 /* debug */
>
> And this two levels can be done in tracepoints.

I will look into it.

>
>> +#define SNAP_DUMP ? ?5 /* dump snapshot file */
>
> Via e2fsprogs debugfs maybe ?
>

certainly. I was not indenting to post to debug patch that uses this level.

>> +
>> +#endif ? ? ? /* _LINUX_EXT4_SNAPSHOT_DEBUG_H */
>>
>
>

2011-06-07 10:09:43

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, 7 Jun 2011, Amir G. wrote:

> On Tue, Jun 7, 2011 at 8:17 AM, Andreas Dilger <[email protected]> wrote:
> > On 2011-06-06, at 2:55 PM, Ted Ts'o wrote:
> >> On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
> >>>> For one reason, a snapshot file format is currently an indirect file
> >>>> and big_alloc doesn't support indirect mapped files.
> >>>> I am not saying it cannot be done, but if it does, there would be
> >>>> several obstacles to cross.
> >>>
> >>> I know I'm kind of just throwing a bomb out here, but I am very concerned
> >>> about the ever-growing feature (in)compatibility matrix in ext4.
> >>
> >> bigalloc doesn't support indirect blocks mainly because it was faster
> >> to get things working if I didn't have to worry about indirect blocks.
> >> It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
> >> I'll get around to it at some point.
> >
> > My main concern isn't about whether bigalloc grows support for indirect-
> > mapped files, but rather the opposite - that snapshots gain support for
> > extent-mapped files. ?In fact, since extent-mapped files can be 16TB in
> > size, it might make sense that the snapshots are _always_ extent-mapped
> > files, and we don't need to deal with the new block-mapped files with
> > 4-triple-indirect blocks layout at all? ?Since snapshots are only going
> > into ext4, and ext4 + e2fsprogs already support extents, there wouldn't
> > be any issue about compatibility?
> >
> > The only concern might be that mapping fragmented files into extents is
> > more effort, which makes me wonder about whether we should introduce the
> > "block-mapped extents" that I proposed in the past, to allow efficient
> > mapping of files (or parts thereof) that are highly fragmented, but still
> > keeping the benefits of extents (internal redundancy, 48-bit physical
> > block numbers, and while we are adding a new extent format it could be
> > designed to add 48-bit logical block numbers.
> >
>
> You are right about snapshot file being a highly fragmented file by design,
> so single block mapping is an advantage. The down side is that deleting
> an extent mapped file, requires mapping all blocks one-by-one to snapshot
> file, which is not efficient and makes deletes slow.
> So having a format optimized for both single and multi block mapping would be
> best.
>
> The reason I DO NOT want to change the snapshot file format at this moment
> is that it will make us lose all the stabilization that snapshot feature gained
> during 1 year in production as next3.
> You see, ext4_free_blocks() cares not if blocks are deleted from indirect or
> extent mapped files and from there on, the code that maps those blocks to
> the special snapshot file is the same in next3 and ext4.
>

But the problem is, that you will not be able to change it in the future
or at least not without adding more incompatibility flags, which is
exactly the point of this thread. I just wonder if it would not be
better to do it now, because now is the right time. Although I do not
know how much work will that require.

Thanks!
-Lukas

2011-06-07 10:42:27

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)

On Tue, 7 Jun 2011, Amir G. wrote:

> >> +config EXT4_FS_SNAPSHOT
> >> + ? ? bool "EXT4 snapshots (Experimental)"
> >> + ? ? depends on EXT4_FS && EXPERIMENTAL
> >> + ? ? default n
> >> + ? ? help
> >> + ? ? ? Built-in snapshots support for ext4.
> >> + ? ? ? Requires that the filesystem has the has_snapshot and exclude_bitmap
> >> + ? ? ? features and that block size is equal to system page size.
> >> + ? ? ? Snapshots are not supported with 64bit and meta_bg features and the
> >> + ? ? ? filesystem must be mounted with ordered data mode.
> >
> > What exactly do you mean by not supported with 64bit feature ? Maybe I
> > am being dense, but I do not get it.
>
> I mean snapshots and 64bit features are mutually exclusive.
> Is that what you got or do I need to make it more clear?

Oh, I did not notice that it belongs to the "feature" word. Thats
probably just my English:)

>
> >>
> >> ?#define outside(b, first, last) ? ? ?((b) < (first) || (b) >= (last))
> >> ?#define inside(b, first, last) ? ? ? ((b) >= (first) && (b) < (last))
> >> diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
> >> new file mode 100644
> >> index 0000000..a927090
> >> --- /dev/null
> >> +++ b/fs/ext4/snapshot.h
> >> @@ -0,0 +1,193 @@
> >> +/*
> >> + * linux/fs/ext4/snapshot.h
> >> + *
> >> + * Written by Amir Goldstein <[email protected]>, 2008
> >> + *
> >> + * Copyright (C) 2008-2011 CTERA Networks
> >> + *
> >> + * This file is part of the Linux kernel and is made available under
> >> + * the terms of the GNU General Public License, version 2, or at your
> >> + * option, any later version, incorporated herein by reference.
> >> + *
> >> + * Ext4 snapshot extensions.
> >
> > This is great place to write more about snapshot design and
> > implementation. If it is added later in a different file, then ignore it
> > :).
>
> the inline documentation is scattered among several patches.
> I should probably also add Documentation/ext4_snapshots.txt
> with some design and overview information.
> And perhaps insert a short short version of it here ;-)

Documentation/filesystems/ext4_snapshots.txt would be the most welcome,
thanks.

>
> >
> >> + */
> >> +
> >> +#ifndef _LINUX_EXT4_SNAPSHOT_H
> >> +#define _LINUX_EXT4_SNAPSHOT_H
> >> +
> >> +#include <linux/version.h>
> >> +#include <linux/delay.h>
> >> +#include "ext4.h"
> >> +
> >> +
> >> +/*
> >> + * use signed 64bit for snapshot image addresses
> >> + * negative addresses are used to reference snapshot meta blocks
> >> + */
> >> +#define ext4_snapblk_t long long
> >
> > typedef signed long long int ext4_snapblk_t maybe ?
>
> 1. checkpatch doesn't like adding new typedef to the kernel

Yes, I suppose that the reason is so that people does not add new
typedefs like crazy, but when it is well reasoned I do not think it is a
problem.

> 2. I am in th process of removing that define altogether

And use what instead ? ext4 typedefs helped people to realize what type
to use for what operation and if this type is used often enough and does
make sense (which I do not know since I have not seen the whole series
yet), it can help as well.

> >
> >> +#define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? (SNAPSHOT_BLOCKS_PER_GROUP_BITS-SNAPSHOT_ADDR_PER_BLOCK_BITS)
> >> +#define SNAPSHOT_IND_PER_BLOCK_GROUP ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? (1<<SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */
> >> +#define SNAPSHOT_DIND_BLOCK_GROUPS_BITS ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? (SNAPSHOT_ADDR_PER_BLOCK_BITS-SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
> >> +#define SNAPSHOT_DIND_BLOCK_GROUPS ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? (1<<SNAPSHOT_DIND_BLOCK_GROUPS_BITS)
> >
> > formating
>
> ?

#define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS \
(SNAPSHOT_BLOCKS_PER_GROUP_BITS - SNAPSHOT_ADDR_PER_BLOCK_BITS)
#define SNAPSHOT_IND_PER_BLOCK_GROUP \
(1 << SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */ <- 32 what ?
#define SNAPSHOT_DIND_BLOCK_GROUPS_BITS \
(SNAPSHOT_ADDR_PER_BLOCK_BITS - SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
#define SNAPSHOT_DIND_BLOCK_GROUPS \
(1 << SNAPSHOT_DIND_BLOCK_GROUPS_BITS)

>
> >
> >> +
> >> +#define SNAPSHOT_BLOCK_OFFSET ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? (SNAPSHOT_DIR_BLOCKS+SNAPSHOT_IND_BLOCKS)
> >> +#define SNAPSHOT_BLOCK(iblock) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? ((ext4_snapblk_t)(iblock) - SNAPSHOT_BLOCK_OFFSET)
> >> +#define SNAPSHOT_IBLOCK(block) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? (ext4_fsblk_t)((block) + SNAPSHOT_BLOCK_OFFSET)
> >
> > I do not see SNAPSHOT_BLOCK() defined anywhere.
> >
>
> Do you mean you don't see it used anywhere?
> It is used by later patches, but I do need to document it's meaning here.

I have missed the define before SNAPSHOT_IBLOCK sorry.

Thanks!
-Lukas

2011-06-07 10:49:12

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 02/30] ext4: snapshot debugging support

On Tue, 7 Jun 2011, Amir G. wrote:

> >
> >> +#define snapshot_test_delay_progress(i, from, to, max) ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? ? ? ? ? if (snapshot_enable_test[i] && ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (max) > snapshot_enable_test[i] && ? ? ?\
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (from) <= (to) && (to) <= (max)) { ? ? ?\
> >> + ? ? ? ? ? ? ? ? ? ? unsigned long blocks_per_ms = ? ? ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? do_div((max), snapshot_enable_test[i]); \
> >> + ? ? ? ? ? ? ? ? ? ? unsigned long x = do_div((from), blocks_per_ms);\
> >> + ? ? ? ? ? ? ? ? ? ? unsigned long y = do_div((to), blocks_per_ms); ?\
> >> + ? ? ? ? ? ? ? ? ? ? if (y > x) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? msleep_interruptible(y - x); ? ? ? ? ? ?\
> >> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? } while (0)
> >> +
> >> +#define snapshot_debug_l(n, l, f, a...) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? do { ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> >> + ? ? ? ? ? ? if ((n) <= snapshot_enable_debug && ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? ? ? (l) <= SNAPSHOT_INDENT_MAX) { ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? ? ? ? ? printk(KERN_DEBUG "snapshot: %s" f, ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?snapshot_indent - (l), ? ? ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?## a); ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? ? ? ? ? } ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> >> + ? ? } while (0)
> >
> > This can be done by tracepoints maybe ?
>
> can you add a generic string arg to a tracepoint?

Yes, you can print whatever you want with tracepoint printk. See
include/trace/events/ext4.h, Documentation/trace/tracepoints.txt and
Documentation/trace/tracepoints-analysis.txt

Thanks!
-Lukas

2011-06-07 11:24:53

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 05/30] ext4: snapshot hooks - delete blocks

On Mon, 9 May 2011, [email protected] wrote:

> From: Amir Goldstein <[email protected]>
>
> Before deleting file blocks in ext4_free_blocks(),
> we call the snapshot API ext4_snapshot_get_delete_access(),
> to optionally move the block to the snapshot file instead of
> freeing them.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> Signed-off-by: Yongqiang Yang <[email protected]>
> ---
> fs/ext4/ext4.h | 10 +++++++---
> fs/ext4/mballoc.c | 30 +++++++++++++++++++++++++++---
> 2 files changed, 34 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index ca25e67..4e9e46a 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1729,9 +1729,13 @@ extern int ext4_mb_reserve_blocks(struct super_block *, int);
> extern void ext4_discard_preallocations(struct inode *);
> extern int __init ext4_init_mballoc(void);
> extern void ext4_exit_mballoc(void);
> -extern void ext4_free_blocks(handle_t *handle, struct inode *inode,
> - struct buffer_head *bh, ext4_fsblk_t block,
> - unsigned long count, int flags);
> +extern void __ext4_free_blocks(const char *where, unsigned int line,
> + handle_t *handle, struct inode *inode,
> + struct buffer_head *bh, ext4_fsblk_t block,
> + unsigned long count, int flags);
> +#define ext4_free_blocks(handle, inode, bh, block, count, flags) \
> + __ext4_free_blocks(__func__, __LINE__ , (handle), (inode), (bh), \
> + (block), (count), (flags))
> extern int ext4_mb_add_groupinfo(struct super_block *sb,
> ext4_group_t i, struct ext4_group_desc *desc);
> extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index e8bfd8d..3b1c6d1 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4445,9 +4445,9 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
> * @count: number of blocks to count
> * @metadata: Are these metadata blocks
> */
> -void ext4_free_blocks(handle_t *handle, struct inode *inode,
> - struct buffer_head *bh, ext4_fsblk_t block,
> - unsigned long count, int flags)
> +void __ext4_free_blocks(const char *where, unsigned int line, handle_t *handle,
> + struct inode *inode, struct buffer_head *bh,
> + ext4_fsblk_t block, unsigned long count, int flags)
> {
> struct buffer_head *bitmap_bh = NULL;
> struct super_block *sb = inode->i_sb;
> @@ -4461,6 +4461,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
> struct ext4_buddy e4b;
> int err = 0;
> int ret;
> + int maxblocks;
>
> if (bh) {
> if (block)
> @@ -4543,6 +4544,29 @@ do_more:
> goto error_return;
> }
>
> + maxblocks = count;
> + ret = ext4_snapshot_get_delete_access(handle, inode,
> + block, &maxblocks);

It would be nice to have this defined in the same commit, so I know what
is it doing.

> + if (ret < 0) {
> + ext4_journal_abort_handle(where, line, __func__,
> + NULL, handle, ret);
> + err = ret;
> + goto error_return;
> + }
> + if (ret > 0) {
> + /* 'ret' blocks were moved to snapshot - skip them */
> + block += maxblocks;
> + count -= maxblocks;
> + count += overflow;
> + cond_resched();
> + if (count > 0)
> + goto do_more;
> + /* no more blocks to free/move to snapshot */
> + ext4_mark_super_dirty(sb);
> + goto error_return;
> + }
> + overflow += count - maxblocks;
> + count = maxblocks;
> BUFFER_TRACE(bitmap_bh, "getting write access");
> err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
> if (err)
>

2011-06-07 13:01:23

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 7, 2011 at 1:09 PM, Lukas Czerner <[email protected]> wrote:
> On Tue, 7 Jun 2011, Amir G. wrote:
>
>> On Tue, Jun 7, 2011 at 8:17 AM, Andreas Dilger <[email protected]> wrote:
>> > On 2011-06-06, at 2:55 PM, Ted Ts'o wrote:
>> >> On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
>> >>>> For one reason, a snapshot file format is currently an indirect file
>> >>>> and big_alloc doesn't support indirect mapped files.
>> >>>> I am not saying it cannot be done, but if it does, there would be
>> >>>> several obstacles to cross.
>> >>>
>> >>> I know I'm kind of just throwing a bomb out here, but I am very concerned
>> >>> about the ever-growing feature (in)compatibility matrix in ext4.
>> >>
>> >> bigalloc doesn't support indirect blocks mainly because it was faster
>> >> to get things working if I didn't have to worry about indirect blocks.
>> >> It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
>> >> I'll get around to it at some point.
>> >
>> > My main concern isn't about whether bigalloc grows support for indirect-
>> > mapped files, but rather the opposite - that snapshots gain support for
>> > extent-mapped files. ?In fact, since extent-mapped files can be 16TB in
>> > size, it might make sense that the snapshots are _always_ extent-mapped
>> > files, and we don't need to deal with the new block-mapped files with
>> > 4-triple-indirect blocks layout at all? ?Since snapshots are only going
>> > into ext4, and ext4 + e2fsprogs already support extents, there wouldn't
>> > be any issue about compatibility?
>> >
>> > The only concern might be that mapping fragmented files into extents is
>> > more effort, which makes me wonder about whether we should introduce the
>> > "block-mapped extents" that I proposed in the past, to allow efficient
>> > mapping of files (or parts thereof) that are highly fragmented, but still
>> > keeping the benefits of extents (internal redundancy, 48-bit physical
>> > block numbers, and while we are adding a new extent format it could be
>> > designed to add 48-bit logical block numbers.
>> >
>>
>> You are right about snapshot file being a highly fragmented file by design,
>> so single block mapping is an advantage. The down side is that deleting
>> an extent mapped file, requires mapping all blocks one-by-one to snapshot
>> file, which is not efficient and makes deletes slow.
>> So having a format optimized for both single and multi block mapping would be
>> best.
>>
>> The reason I DO NOT want to change the snapshot file format at this moment
>> is that it will make us lose all the stabilization that snapshot feature gained
>> during 1 year in production as next3.
>> You see, ext4_free_blocks() cares not if blocks are deleted from indirect or
>> extent mapped files and from there on, the code that maps those blocks to
>> the special snapshot file is the same in next3 and ext4.
>>
>
> But the problem is, that you will not be able to change it in the future
> or at least not without adding more incompatibility flags, which is
> exactly the point of this thread. I just wonder if it would not be
> better to do it now, because now is the right time. Although I do not
> know how much work will that require.
>

There are no compatibility issues.
ext4 fs is either 32bit or 64bit and you cannot convert between the 2 formats.
32bit ext4 has snapshots support with indirect mapped snapshot files.
64bit ext4 has no snapshots support.
if in the future, be it near or far, 64bit ext4 will have snapshots support with
a new snapshot file format, then 64bit feature + snapshots feature will
prevent the present (i.e. next) kernel from mouting that fs rw.
which is exactly the same as older kernel will prevent mounting a 32bit ext4
with snapshots rw.

Amir.

2011-06-07 13:20:54

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 01/30] ext4: EXT4 snapshots (Experimental)

On Tue, Jun 7, 2011 at 1:42 PM, Lukas Czerner <[email protected]> wrote:
> On Tue, 7 Jun 2011, Amir G. wrote:
>
>> >> +config EXT4_FS_SNAPSHOT
>> >> + ? ? bool "EXT4 snapshots (Experimental)"
>> >> + ? ? depends on EXT4_FS && EXPERIMENTAL
>> >> + ? ? default n
>> >> + ? ? help
>> >> + ? ? ? Built-in snapshots support for ext4.
>> >> + ? ? ? Requires that the filesystem has the has_snapshot and exclude_bitmap
>> >> + ? ? ? features and that block size is equal to system page size.
>> >> + ? ? ? Snapshots are not supported with 64bit and meta_bg features and the
>> >> + ? ? ? filesystem must be mounted with ordered data mode.
>> >
>> > What exactly do you mean by not supported with 64bit feature ? Maybe I
>> > am being dense, but I do not get it.
>>
>> I mean snapshots and 64bit features are mutually exclusive.
>> Is that what you got or do I need to make it more clear?
>
> Oh, I did not notice that it belongs to the "feature" word. Thats
> probably just my English:)

Or a combination of our 'Englishes' ;-)

>
>>
>> >>
>> >> ?#define outside(b, first, last) ? ? ?((b) < (first) || (b) >= (last))
>> >> ?#define inside(b, first, last) ? ? ? ((b) >= (first) && (b) < (last))
>> >> diff --git a/fs/ext4/snapshot.h b/fs/ext4/snapshot.h
>> >> new file mode 100644
>> >> index 0000000..a927090
>> >> --- /dev/null
>> >> +++ b/fs/ext4/snapshot.h
>> >> @@ -0,0 +1,193 @@
>> >> +/*
>> >> + * linux/fs/ext4/snapshot.h
>> >> + *
>> >> + * Written by Amir Goldstein <[email protected]>, 2008
>> >> + *
>> >> + * Copyright (C) 2008-2011 CTERA Networks
>> >> + *
>> >> + * This file is part of the Linux kernel and is made available under
>> >> + * the terms of the GNU General Public License, version 2, or at your
>> >> + * option, any later version, incorporated herein by reference.
>> >> + *
>> >> + * Ext4 snapshot extensions.
>> >
>> > This is great place to write more about snapshot design and
>> > implementation. If it is added later in a different file, then ignore it
>> > :).
>>
>> the inline documentation is scattered among several patches.
>> I should probably also add Documentation/ext4_snapshots.txt
>> with some design and overview information.
>> And perhaps insert a short short version of it here ;-)
>
> Documentation/filesystems/ext4_snapshots.txt would be the most welcome,
> thanks.

I though I'd just drop the Technical_Overview wiki as ext4_snapshots.txt:
http://sourceforge.net/apps/mediawiki/next3/index.php?title=Technical_overview
it seems like a good start, which can be completed by diving into the code.
would you agree with that statement?

>
>>
>> >
>> >> + */
>> >> +
>> >> +#ifndef _LINUX_EXT4_SNAPSHOT_H
>> >> +#define _LINUX_EXT4_SNAPSHOT_H
>> >> +
>> >> +#include <linux/version.h>
>> >> +#include <linux/delay.h>
>> >> +#include "ext4.h"
>> >> +
>> >> +
>> >> +/*
>> >> + * use signed 64bit for snapshot image addresses
>> >> + * negative addresses are used to reference snapshot meta blocks
>> >> + */
>> >> +#define ext4_snapblk_t long long
>> >
>> > typedef signed long long int ext4_snapblk_t maybe ?
>>
>> 1. checkpatch doesn't like adding new typedef to the kernel
>
> Yes, I suppose that the reason is so that people does not add new
> typedefs like crazy, but when it is well reasoned I do not think it is a
> problem.
>
>> 2. I am in th process of removing that define altogether
>
> And use what instead ? ext4 typedefs helped people to realize what type
> to use for what operation and if this type is used often enough and does
> make sense (which I do not know since I have not seen the whole series
> yet), it can help as well.
>

I found a bug last week with accessing the last 4M of a 16TB snapshot file.
it was caused by conversion from ext4_snapblk_t to ext4_lblk_t in
ext4_blk_to_path(),
so I am dropping the different offset type approach and going handle
the snapshot
file offset translation inside ext4_blk_to_path().
I won't get into it here. You will see it on the next (full) patch
series I will post.

>> >
>> >> +#define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? (SNAPSHOT_BLOCKS_PER_GROUP_BITS-SNAPSHOT_ADDR_PER_BLOCK_BITS)
>> >> +#define SNAPSHOT_IND_PER_BLOCK_GROUP ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? (1<<SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */
>> >> +#define SNAPSHOT_DIND_BLOCK_GROUPS_BITS ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? (SNAPSHOT_ADDR_PER_BLOCK_BITS-SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
>> >> +#define SNAPSHOT_DIND_BLOCK_GROUPS ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? (1<<SNAPSHOT_DIND_BLOCK_GROUPS_BITS)
>> >
>> > formating
>>
>> ?
>
> #define SNAPSHOT_IND_PER_BLOCK_GROUP_BITS ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> ? ? ? ?(SNAPSHOT_BLOCKS_PER_GROUP_BITS - SNAPSHOT_ADDR_PER_BLOCK_BITS)
> #define SNAPSHOT_IND_PER_BLOCK_GROUP ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> ? ? ? ?(1 << SNAPSHOT_IND_PER_BLOCK_GROUP_BITS) /* 32 */ <- 32 what ?
> #define SNAPSHOT_DIND_BLOCK_GROUPS_BITS ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
> ? ? ? ?(SNAPSHOT_ADDR_PER_BLOCK_BITS - SNAPSHOT_IND_PER_BLOCK_GROUP_BITS)
> #define SNAPSHOT_DIND_BLOCK_GROUPS ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
> ? ? ? ?(1 << SNAPSHOT_DIND_BLOCK_GROUPS_BITS)
>

OK. thanks.

>>
>> >
>> >> +
>> >> +#define SNAPSHOT_BLOCK_OFFSET ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?\
>> >> + ? ? (SNAPSHOT_DIR_BLOCKS+SNAPSHOT_IND_BLOCKS)
>> >> +#define SNAPSHOT_BLOCK(iblock) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? ((ext4_snapblk_t)(iblock) - SNAPSHOT_BLOCK_OFFSET)
>> >> +#define SNAPSHOT_IBLOCK(block) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? \
>> >> + ? ? (ext4_fsblk_t)((block) + SNAPSHOT_BLOCK_OFFSET)
>> >
>> > I do not see SNAPSHOT_BLOCK() defined anywhere.
>> >
>>
>> Do you mean you don't see it used anywhere?
>> It is used by later patches, but I do need to document it's meaning here.
>
> I have missed the define before SNAPSHOT_IBLOCK sorry.
>
> Thanks!
> -Lukas

2011-06-07 13:24:45

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 05/30] ext4: snapshot hooks - delete blocks

On Tue, Jun 7, 2011 at 2:24 PM, Lukas Czerner <[email protected]> wrote:
> On Mon, 9 May 2011, [email protected] wrote:
>
>> From: Amir Goldstein <[email protected]>
>>
>> Before deleting file blocks in ext4_free_blocks(),
>> we call the snapshot API ext4_snapshot_get_delete_access(),
>> to optionally move the block to the snapshot file instead of
>> freeing them.
>>
>> Signed-off-by: Amir Goldstein <[email protected]>
>> Signed-off-by: Yongqiang Yang <[email protected]>
>> ---
>> ?fs/ext4/ext4.h ? ?| ? 10 +++++++---
>> ?fs/ext4/mballoc.c | ? 30 +++++++++++++++++++++++++++---
>> ?2 files changed, 34 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index ca25e67..4e9e46a 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1729,9 +1729,13 @@ extern int ext4_mb_reserve_blocks(struct super_block *, int);
>> ?extern void ext4_discard_preallocations(struct inode *);
>> ?extern int __init ext4_init_mballoc(void);
>> ?extern void ext4_exit_mballoc(void);
>> -extern void ext4_free_blocks(handle_t *handle, struct inode *inode,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh, ext4_fsblk_t block,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long count, int flags);
>> +extern void __ext4_free_blocks(const char *where, unsigned int line,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?handle_t *handle, ?struct inode *inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct buffer_head *bh, ext4_fsblk_t block,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long count, int flags);
>> +#define ext4_free_blocks(handle, inode, bh, block, count, flags) \
>> + ? ? __ext4_free_blocks(__func__, __LINE__ , (handle), (inode), (bh), \
>> + ? ? ? ? ? ? ? ? ? ? ? ?(block), (count), (flags))
>> ?extern int ext4_mb_add_groupinfo(struct super_block *sb,
>> ? ? ? ? ? ? ? ext4_group_t i, struct ext4_group_desc *desc);
>> ?extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index e8bfd8d..3b1c6d1 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -4445,9 +4445,9 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
>> ? * @count: ? ? ? ? ? number of blocks to count
>> ? * @metadata: ? ? ? ? ? ? ? ?Are these metadata blocks
>> ? */
>> -void ext4_free_blocks(handle_t *handle, struct inode *inode,
>> - ? ? ? ? ? ? ? ? ? struct buffer_head *bh, ext4_fsblk_t block,
>> - ? ? ? ? ? ? ? ? ? unsigned long count, int flags)
>> +void __ext4_free_blocks(const char *where, unsigned int line, handle_t *handle,
>> + ? ? ? ? ? ? ? ? ? ? struct inode *inode, struct buffer_head *bh,
>> + ? ? ? ? ? ? ? ? ? ? ext4_fsblk_t block, unsigned long count, int flags)
>> ?{
>> ? ? ? struct buffer_head *bitmap_bh = NULL;
>> ? ? ? struct super_block *sb = inode->i_sb;
>> @@ -4461,6 +4461,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>> ? ? ? struct ext4_buddy e4b;
>> ? ? ? int err = 0;
>> ? ? ? int ret;
>> + ? ? int maxblocks;
>>
>> ? ? ? if (bh) {
>> ? ? ? ? ? ? ? if (block)
>> @@ -4543,6 +4544,29 @@ do_more:
>> ? ? ? ? ? ? ? goto error_return;
>> ? ? ? }
>>
>> + ? ? maxblocks = count;
>> + ? ? ret = ext4_snapshot_get_delete_access(handle, inode,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? block, &maxblocks);
>
> It would be nice to have this defined in the same commit, so I know what
> is it doing.

Sorry, full patches again :-/
I will post them soon without any fixes, so you can have a look.

>
>> + ? ? if (ret < 0) {
>> + ? ? ? ? ? ? ext4_journal_abort_handle(where, line, __func__,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? NULL, handle, ret);
>> + ? ? ? ? ? ? err = ret;
>> + ? ? ? ? ? ? goto error_return;
>> + ? ? }
>> + ? ? if (ret > 0) {
>> + ? ? ? ? ? ? /* 'ret' blocks were moved to snapshot - skip them */
>> + ? ? ? ? ? ? block += maxblocks;
>> + ? ? ? ? ? ? count -= maxblocks;
>> + ? ? ? ? ? ? count += overflow;
>> + ? ? ? ? ? ? cond_resched();
>> + ? ? ? ? ? ? if (count > 0)
>> + ? ? ? ? ? ? ? ? ? ? goto do_more;
>> + ? ? ? ? ? ? /* no more blocks to free/move to snapshot */
>> + ? ? ? ? ? ? ext4_mark_super_dirty(sb);
>> + ? ? ? ? ? ? goto error_return;
>> + ? ? }
>> + ? ? overflow += count - maxblocks;
>> + ? ? count = maxblocks;
>> ? ? ? BUFFER_TRACE(bitmap_bh, "getting write access");
>> ? ? ? err = ext4_handle_get_bitmap_access(handle, sb, block_group, bitmap_bh);
>> ? ? ? if (err)
>>
>

2011-06-07 13:32:49

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH RFC 05/30] ext4: snapshot hooks - delete blocks

On Tue, 7 Jun 2011, Amir G. wrote:

> >> @@ -4543,6 +4544,29 @@ do_more:
> >> ? ? ? ? ? ? ? goto error_return;
> >> ? ? ? }
> >>
> >> + ? ? maxblocks = count;
> >> + ? ? ret = ext4_snapshot_get_delete_access(handle, inode,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? block, &maxblocks);
> >
> > It would be nice to have this defined in the same commit, so I know what
> > is it doing.
>
> Sorry, full patches again :-/
> I will post them soon without any fixes, so you can have a look.
>

No problem, I'll wait for the full patches before proceeding. Fix
whatever you to and repost.

Thanks!
-Lukas

2011-06-07 13:51:05

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 06/07/2011 09:01 AM, Amir G. wrote:
> On Tue, Jun 7, 2011 at 1:09 PM, Lukas Czerner<[email protected]> wrote:
>> On Tue, 7 Jun 2011, Amir G. wrote:
>>
>>> On Tue, Jun 7, 2011 at 8:17 AM, Andreas Dilger<[email protected]> wrote:
>>>> On 2011-06-06, at 2:55 PM, Ted Ts'o wrote:
>>>>> On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
>>>>>>> For one reason, a snapshot file format is currently an indirect file
>>>>>>> and big_alloc doesn't support indirect mapped files.
>>>>>>> I am not saying it cannot be done, but if it does, there would be
>>>>>>> several obstacles to cross.
>>>>>> I know I'm kind of just throwing a bomb out here, but I am very concerned
>>>>>> about the ever-growing feature (in)compatibility matrix in ext4.
>>>>> bigalloc doesn't support indirect blocks mainly because it was faster
>>>>> to get things working if I didn't have to worry about indirect blocks.
>>>>> It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
>>>>> I'll get around to it at some point.
>>>> My main concern isn't about whether bigalloc grows support for indirect-
>>>> mapped files, but rather the opposite - that snapshots gain support for
>>>> extent-mapped files. In fact, since extent-mapped files can be 16TB in
>>>> size, it might make sense that the snapshots are _always_ extent-mapped
>>>> files, and we don't need to deal with the new block-mapped files with
>>>> 4-triple-indirect blocks layout at all? Since snapshots are only going
>>>> into ext4, and ext4 + e2fsprogs already support extents, there wouldn't
>>>> be any issue about compatibility?
>>>>
>>>> The only concern might be that mapping fragmented files into extents is
>>>> more effort, which makes me wonder about whether we should introduce the
>>>> "block-mapped extents" that I proposed in the past, to allow efficient
>>>> mapping of files (or parts thereof) that are highly fragmented, but still
>>>> keeping the benefits of extents (internal redundancy, 48-bit physical
>>>> block numbers, and while we are adding a new extent format it could be
>>>> designed to add 48-bit logical block numbers.
>>>>
>>> You are right about snapshot file being a highly fragmented file by design,
>>> so single block mapping is an advantage. The down side is that deleting
>>> an extent mapped file, requires mapping all blocks one-by-one to snapshot
>>> file, which is not efficient and makes deletes slow.
>>> So having a format optimized for both single and multi block mapping would be
>>> best.
>>>
>>> The reason I DO NOT want to change the snapshot file format at this moment
>>> is that it will make us lose all the stabilization that snapshot feature gained
>>> during 1 year in production as next3.
>>> You see, ext4_free_blocks() cares not if blocks are deleted from indirect or
>>> extent mapped files and from there on, the code that maps those blocks to
>>> the special snapshot file is the same in next3 and ext4.
>>>
>> But the problem is, that you will not be able to change it in the future
>> or at least not without adding more incompatibility flags, which is
>> exactly the point of this thread. I just wonder if it would not be
>> better to do it now, because now is the right time. Although I do not
>> know how much work will that require.
>>
> There are no compatibility issues.
> ext4 fs is either 32bit or 64bit and you cannot convert between the 2 formats.
> 32bit ext4 has snapshots support with indirect mapped snapshot files.
> 64bit ext4 has no snapshots support.
> if in the future, be it near or far, 64bit ext4 will have snapshots support with
> a new snapshot file format, then 64bit feature + snapshots feature will
> prevent the present (i.e. next) kernel from mouting that fs rw.
> which is exactly the same as older kernel will prevent mounting a 32bit ext4
> with snapshots rw.
>
> Amir.

Hi Amir,

I really am not comfortable with having two formats for snapshots.

Why not just do one 64 bit format and skip the 32 bit one?

This seems like a recipe for end user confusion and pain :)

thanks!

Ric


2011-06-07 13:59:46

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 06/06/2011 04:40 PM, Ted Ts'o wrote:
> On Mon, Jun 06, 2011 at 06:05:24PM +0200, Lukas Czerner wrote:
>> Well, we can also start ditching some unused features and tunnables, or
>> make it default and remove it from documentation so people will not use
>> it and we can get rid of some of the options in the future. For examle
>>
>> orlov
>> oldalloc
>> bsddf
>> minixdf
> I tried deprecated bsddf/minixdf, and got a complaint from a user who
> said they were using it. Linus's rule is, "though shalt not break
> backwards compatibility", so I have up on it. Realistically, the
> difference between these two is so small it's not really a big deal.
>
> Dropping the old allocator is probably a good idea at this point. I
> very much doubt anyone is using it.
>
> - Ted
>

Can we argue that the user base that uses these ancient options can simply
continue to use them in ext3 and use that fig leaf to kill them in ext4 going
forwards?

Ric



2011-06-07 14:39:06

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 7, 2011 at 4:50 PM, Ric Wheeler <[email protected]> wrote:
> On 06/07/2011 09:01 AM, Amir G. wrote:
>>
>> On Tue, Jun 7, 2011 at 1:09 PM, Lukas Czerner<[email protected]> ?wrote:
>>>
>>> On Tue, 7 Jun 2011, Amir G. wrote:
>>>
>>>> On Tue, Jun 7, 2011 at 8:17 AM, Andreas Dilger<[email protected]>
>>>> ?wrote:
>>>>>
>>>>> On 2011-06-06, at 2:55 PM, Ted Ts'o wrote:
>>>>>>
>>>>>> On Mon, Jun 06, 2011 at 10:31:33AM -0500, Eric Sandeen wrote:
>>>>>>>>
>>>>>>>> For one reason, a snapshot file format is currently an indirect file
>>>>>>>> and big_alloc doesn't support indirect mapped files.
>>>>>>>> I am not saying it cannot be done, but if it does, there would be
>>>>>>>> several obstacles to cross.
>>>>>>>
>>>>>>> I know I'm kind of just throwing a bomb out here, but I am very
>>>>>>> concerned
>>>>>>> about the ever-growing feature (in)compatibility matrix in ext4.
>>>>>>
>>>>>> bigalloc doesn't support indirect blocks mainly because it was faster
>>>>>> to get things working if I didn't have to worry about indirect blocks.
>>>>>> It wouldn't be _that_ hard to make bigalloc work on indirect blocks.
>>>>>> I'll get around to it at some point.
>>>>>
>>>>> My main concern isn't about whether bigalloc grows support for
>>>>> indirect-
>>>>> mapped files, but rather the opposite - that snapshots gain support for
>>>>> extent-mapped files. ?In fact, since extent-mapped files can be 16TB in
>>>>> size, it might make sense that the snapshots are _always_ extent-mapped
>>>>> files, and we don't need to deal with the new block-mapped files with
>>>>> 4-triple-indirect blocks layout at all? ?Since snapshots are only going
>>>>> into ext4, and ext4 + e2fsprogs already support extents, there wouldn't
>>>>> be any issue about compatibility?
>>>>>
>>>>> The only concern might be that mapping fragmented files into extents is
>>>>> more effort, which makes me wonder about whether we should introduce
>>>>> the
>>>>> "block-mapped extents" that I proposed in the past, to allow efficient
>>>>> mapping of files (or parts thereof) that are highly fragmented, but
>>>>> still
>>>>> keeping the benefits of extents (internal redundancy, 48-bit physical
>>>>> block numbers, and while we are adding a new extent format it could be
>>>>> designed to add 48-bit logical block numbers.
>>>>>
>>>> You are right about snapshot file being a highly fragmented file by
>>>> design,
>>>> so single block mapping is an advantage. The down side is that deleting
>>>> an extent mapped file, requires mapping all blocks one-by-one to
>>>> snapshot
>>>> file, which is not efficient and makes deletes slow.
>>>> So having a format optimized for both single and multi block mapping
>>>> would be
>>>> best.
>>>>
>>>> The reason I DO NOT want to change the snapshot file format at this
>>>> moment
>>>> is that it will make us lose all the stabilization that snapshot feature
>>>> gained
>>>> during 1 year in production as next3.
>>>> You see, ext4_free_blocks() cares not if blocks are deleted from
>>>> indirect or
>>>> extent mapped files and from there on, the code that maps those blocks
>>>> to
>>>> the special snapshot file is the same in next3 and ext4.
>>>>
>>> But the problem is, that you will not be able to change it in the future
>>> or at least not without adding more incompatibility flags, which is
>>> exactly the point of this thread. I just wonder if it would not be
>>> better to do it now, because now is the right time. Although I do not
>>> know how much work will that require.
>>>
>> There are no compatibility issues.
>> ext4 fs is either 32bit or 64bit and you cannot convert between the 2
>> formats.
>> 32bit ext4 has snapshots support with indirect mapped snapshot files.
>> 64bit ext4 has no snapshots support.
>> if in the future, be it near or far, 64bit ext4 will have snapshots
>> support with
>> a new snapshot file format, then 64bit feature + snapshots feature will
>> prevent the present (i.e. next) kernel from mouting that fs rw.
>> which is exactly the same as older kernel will prevent mounting a 32bit
>> ext4
>> with snapshots rw.
>>
>> Amir.
>
> Hi Amir,
>
> I really am not comfortable with having two formats for snapshots.
>
> Why not just do one 64 bit format and skip the 32 bit one?

Well for 2 reasons mainly:
1. Something like that could hold back the feature further more
and maybe even to eternity and some people do want to use it
this lifetime.
2. There are performance implications that need to be studied.
An indirect format gives me the ability to maps blocks of different
block groups without taking a global lock (not doing that yet).
With extent tree format, a global lock is needed for re-balancing
the tree, so concurrent COW operations on different blocks
in different block groups are bound to contend the same global
lock, which is something I am trying to see if can be avoided.


>
> This seems like a recipe for end user confusion and pain :)
>

I honestly don't see how the internal format of a snapshot file
affects the end user in any way.
What happens in 32bit ext4 stays in 32bit ext4.
There is no migration of formats whatsoever to 64bit ext4.
The only pain caused by 2 formats is having to maintain
the code for 2 formats.
But the fact of the matter is that indirect mapped file code
is there to stay, so having the snapshot file use it for now,
is not much of a maintenance burden later.
All it takes is an EXTENT_FL flag to distinguish between
an indirect mapped snapshot to a future extent mapped (v2)
snapshot.

> thanks!
>
> Ric
>
>

2011-06-07 15:26:57

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 05/09/2011 12:41 PM, [email protected] wrote:
> The following patch series includes all the changes to core ext4 files,
> which are needed for snapshots support. It adds some ~2K lines of code,
> which will never be executed unless the following 2 conditions apply:
> 1. ext4 is built with CONFIG_EXT4_FS_SNAPSHOT
> 2. HAS_SNAPSHOT and EXCLUDE_BITMAP features are set by mke2fs/tune2fs
>
> The remaining ~5K lines of code, added in new snapshot* files, were omitted
> from this series to simplify the review and becasue they are not needed
> when building ext4 without CONFIG_EXT4_FS_SNAPSHOT.
> the full patches will be posted soon after I recieve some comments.
>
> Ted has concluded my ext4 snapshots talk on LPC 2010 with the statement that
> as long as the snapshot patches don't break anything when snapshot support
> is disabled, he will pull them, so the main goal when reviewing this series
> should be to prove that it is safe to pull the patches.
>
> REVIEWING
> ---------
> To make it easy for reviewers, I will provide some pointers:
> - EXT4_SNAPSHOTS(sb) is defined to (0) (in snapshot.h) when ext4 is built
> without snapshots support.
> - EXT4_SNAPSHOTS(sb) is defined to test the HAS_SNAPSHOT feature when ext4
> is built with snapshots support.
> - All the ext4_snapshot_XXX function added by the patches, are defined to
> NOP macros in snapshot.h when ext4 is built without snapshots support.
> - Various flags defined by the patches (like EXT4_MB_HINT_COWING) will never
> get set if EXT4_SNAPSHOTS(sb) is false, so testing them will also be false.
>
> MERGING
> -------
> These patches are based on Ted's current master branch + alloc_semp removal
> patches. Although the alloc_semp removal is an independent (and in my eyes
> a good) change, it is also required by snapshot patches, to avoid circular
> locking dependency during COW allocations.
>
> Merging with Allison's punch holes patches should be straight forward, since
> the hard part, namely Yongqiang's split extent refactoring patches, was
> already merged by Ted.
>
> Merging with Ted's big alloc patches is going to be a bit more challenging,
> since big alloc patches make a lot of renaming and refactoring. However,
> since has_snapshots and big_alloc features will never work together,
> at least testing the code together is not a big concern.
>
> TESTING
> -------
> Apart from the extensive testing for the snapshots feature functionality, we
> also ran xfstests with snapshots and while taking a snapshot every 1 minute.
> More importantly, we ran xfstests with snapshots support disabled in compile
> time and with snapshot support enabled but without has_snapshot feature.
> These xfstests were run with blocksize 1K and 4K and on X86 and X86_64.
> The 1K blocksize tests are important for the alloc_semp removal patches.
> No problems were found apart from one (test 225 hung), which is already
> existing in master branch.
>
> CREDITS
> -------
> The snapshots patches originate in my implementation of the Next3 filesystem
> for CTERA networks.
> The porting of the Next3 snapshot patches to ext4 patches is attributed to
> Aditya Dani, Shardul Mangade, Piyush Nimbalkar and Harshad Shirwadkar from
> the Pune Institute of Computer Technology (PICT).
> The implementation of extents move-on-write, delayed move-on-write and much
> of the cleanup work on these patches was carried out by Yongqiang Yang from
> the Institute of Computing Technology, Chinese Academy of Sciences.
>

I probably should have brought this up before, but why put all this
effort into shoehorning in such a big an invasive feature to ext4 when
btrfs does this all already? Why not put your efforts into helping
btrfs become stable and ready and then use that, instead of having to
come up with a bunch of hacks to get around the myriad of weird feature
combinations you can get with ext4?

The wonderful thing about ext4 is its a nice basic fs. If we're going
to start doing lots of crazy things, why not do them to the fs that
isn't yet in wide use and can afford to have crazy things done to it
without screwing a bunch of users who already depend on ext4's
stability? Thanks,

Josef

2011-06-07 15:37:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 07, 2011 at 09:59:18AM -0400, Ric Wheeler wrote:
>
> Can we argue that the user base that uses these ancient options can
> simply continue to use them in ext3 and use that fig leaf to kill
> them in ext4 going forwards?

Unfortunately, I marking it as deprecated, with a printk that it would
go away, and someone complained saying they depended on it. So they
were using ext4, and according to the standard which Linus set out, we
have to keep it around. Honestly, the whole minixdf really isn't
worth worrying about...

- Ted

2011-06-07 16:46:06

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 7, 2011 at 6:26 PM, Josef Bacik <[email protected]> wrote:
> On 05/09/2011 12:41 PM, [email protected] wrote:
>> The following patch series includes all the changes to core ext4 files,
>> which are needed for snapshots support. It adds some ~2K lines of code,
>> which will never be executed unless the following 2 conditions apply:
>> 1. ext4 is built with CONFIG_EXT4_FS_SNAPSHOT
>> 2. HAS_SNAPSHOT and EXCLUDE_BITMAP features are set by mke2fs/tune2fs
>>
>> The remaining ~5K lines of code, added in new snapshot* files, were omitted
>> from this series to simplify the review and becasue they are not needed
>> when building ext4 without CONFIG_EXT4_FS_SNAPSHOT.
>> the full patches will be posted soon after I recieve some comments.
>>
>> Ted has concluded my ext4 snapshots talk on LPC 2010 with the statement that
>> as long as the snapshot patches don't break anything when snapshot support
>> is disabled, he will pull them, so the main goal when reviewing this series
>> should be to prove that it is safe to pull the patches.
>>
>> REVIEWING
>> ---------
>> To make it easy for reviewers, I will provide some pointers:
>> - EXT4_SNAPSHOTS(sb) is defined to (0) (in snapshot.h) when ext4 is built
>> ? without snapshots support.
>> - EXT4_SNAPSHOTS(sb) is defined to test the HAS_SNAPSHOT feature when ext4
>> ? is built with snapshots support.
>> - All the ext4_snapshot_XXX function added by the patches, are defined to
>> ? NOP macros in snapshot.h when ext4 is built without snapshots support.
>> - Various flags defined by the patches (like EXT4_MB_HINT_COWING) will never
>> ? get set if EXT4_SNAPSHOTS(sb) is false, so testing them will also be false.
>>
>> MERGING
>> -------
>> These patches are based on Ted's current master branch + alloc_semp removal
>> patches. Although the alloc_semp removal is an independent (and in my eyes
>> a good) change, it is also required by snapshot patches, to avoid circular
>> locking dependency during COW allocations.
>>
>> Merging with Allison's punch holes patches should be straight forward, since
>> the hard part, namely Yongqiang's split extent refactoring patches, was
>> already merged by Ted.
>>
>> Merging with Ted's big alloc patches is going to be a bit more challenging,
>> since big alloc patches make a lot of renaming and refactoring. However,
>> since has_snapshots and big_alloc features will never work together,
>> at least testing the code together is not a big concern.
>>
>> TESTING
>> -------
>> Apart from the extensive testing for the snapshots feature functionality, we
>> also ran xfstests with snapshots and while taking a snapshot every 1 minute.
>> More importantly, we ran xfstests with snapshots support disabled in compile
>> time and with snapshot support enabled but without has_snapshot feature.
>> These xfstests were run with blocksize 1K and 4K and on X86 and X86_64.
>> The 1K blocksize tests are important for the alloc_semp removal patches.
>> No problems were found apart from one (test 225 hung), which is already
>> existing in master branch.
>>
>> CREDITS
>> -------
>> The snapshots patches originate in my implementation of the Next3 filesystem
>> for CTERA networks.
>> The porting of the Next3 snapshot patches to ext4 patches is attributed to
>> Aditya Dani, Shardul Mangade, Piyush Nimbalkar and Harshad Shirwadkar from
>> the Pune Institute of Computer Technology (PICT).
>> The implementation of extents move-on-write, delayed move-on-write and much
>> of the cleanup work on these patches was carried out by Yongqiang Yang from
>> the Institute of Computing Technology, Chinese Academy of Sciences.
>>
>
> I probably should have brought this up before, but why put all this
> effort into shoehorning in such a big an invasive feature to ext4 when
> btrfs does this all already? ?Why not put your efforts into helping
> btrfs become stable and ready and then use that, instead of having to
> come up with a bunch of hacks to get around the myriad of weird feature
> combinations you can get with ext4?

Hi Josef,

I understand the bitterness in btrfs community regarding ext4 snapshot
feature. You might say the same things about ext4 64bit feature.
I think it is not up to us to decide how it rolls. it's the users
and companies involved that dictate where the development happens.

I like the answer that Ted once replied to the old btrfs vs. ext4 question:
competition is good because it makes us modest.

I believe there is room in the future for both fs's, even with
similar features in both.



>
> The wonderful thing about ext4 is its a nice basic fs. ?If we're going
> to start doing lots of crazy things, why not do them to the fs that
> isn't yet in wide use and can afford to have crazy things done to it
> without screwing a bunch of users who already depend on ext4's
> stability? ?Thanks,
>

As I see it, stability is the *only* advantage of ext4 snapshots over btrfs
even though the snapshot feature is new and not stable, you still
have the good olf e2fsprogs tools that can get you out of any mess.
specifically, fsck -x will discard all snapshot files and make your ext4
fs clean and stable again.

The repair tool is one thing that btrfs is still lacking, so I back CTERA's
decision to progress to ext4 with snapshots and not to btrfs on a
production system.

Cheers,
Amir.

2011-06-07 16:54:42

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 06/07/2011 12:46 PM, Amir G. wrote:
> On Tue, Jun 7, 2011 at 6:26 PM, Josef Bacik <[email protected]> wrote:
>> On 05/09/2011 12:41 PM, [email protected] wrote:
>>> The following patch series includes all the changes to core ext4 files,
>>> which are needed for snapshots support. It adds some ~2K lines of code,
>>> which will never be executed unless the following 2 conditions apply:
>>> 1. ext4 is built with CONFIG_EXT4_FS_SNAPSHOT
>>> 2. HAS_SNAPSHOT and EXCLUDE_BITMAP features are set by mke2fs/tune2fs
>>>
>>> The remaining ~5K lines of code, added in new snapshot* files, were omitted
>>> from this series to simplify the review and becasue they are not needed
>>> when building ext4 without CONFIG_EXT4_FS_SNAPSHOT.
>>> the full patches will be posted soon after I recieve some comments.
>>>
>>> Ted has concluded my ext4 snapshots talk on LPC 2010 with the statement that
>>> as long as the snapshot patches don't break anything when snapshot support
>>> is disabled, he will pull them, so the main goal when reviewing this series
>>> should be to prove that it is safe to pull the patches.
>>>
>>> REVIEWING
>>> ---------
>>> To make it easy for reviewers, I will provide some pointers:
>>> - EXT4_SNAPSHOTS(sb) is defined to (0) (in snapshot.h) when ext4 is built
>>> without snapshots support.
>>> - EXT4_SNAPSHOTS(sb) is defined to test the HAS_SNAPSHOT feature when ext4
>>> is built with snapshots support.
>>> - All the ext4_snapshot_XXX function added by the patches, are defined to
>>> NOP macros in snapshot.h when ext4 is built without snapshots support.
>>> - Various flags defined by the patches (like EXT4_MB_HINT_COWING) will never
>>> get set if EXT4_SNAPSHOTS(sb) is false, so testing them will also be false.
>>>
>>> MERGING
>>> -------
>>> These patches are based on Ted's current master branch + alloc_semp removal
>>> patches. Although the alloc_semp removal is an independent (and in my eyes
>>> a good) change, it is also required by snapshot patches, to avoid circular
>>> locking dependency during COW allocations.
>>>
>>> Merging with Allison's punch holes patches should be straight forward, since
>>> the hard part, namely Yongqiang's split extent refactoring patches, was
>>> already merged by Ted.
>>>
>>> Merging with Ted's big alloc patches is going to be a bit more challenging,
>>> since big alloc patches make a lot of renaming and refactoring. However,
>>> since has_snapshots and big_alloc features will never work together,
>>> at least testing the code together is not a big concern.
>>>
>>> TESTING
>>> -------
>>> Apart from the extensive testing for the snapshots feature functionality, we
>>> also ran xfstests with snapshots and while taking a snapshot every 1 minute.
>>> More importantly, we ran xfstests with snapshots support disabled in compile
>>> time and with snapshot support enabled but without has_snapshot feature.
>>> These xfstests were run with blocksize 1K and 4K and on X86 and X86_64.
>>> The 1K blocksize tests are important for the alloc_semp removal patches.
>>> No problems were found apart from one (test 225 hung), which is already
>>> existing in master branch.
>>>
>>> CREDITS
>>> -------
>>> The snapshots patches originate in my implementation of the Next3 filesystem
>>> for CTERA networks.
>>> The porting of the Next3 snapshot patches to ext4 patches is attributed to
>>> Aditya Dani, Shardul Mangade, Piyush Nimbalkar and Harshad Shirwadkar from
>>> the Pune Institute of Computer Technology (PICT).
>>> The implementation of extents move-on-write, delayed move-on-write and much
>>> of the cleanup work on these patches was carried out by Yongqiang Yang from
>>> the Institute of Computing Technology, Chinese Academy of Sciences.
>>>
>>
>> I probably should have brought this up before, but why put all this
>> effort into shoehorning in such a big an invasive feature to ext4 when
>> btrfs does this all already? Why not put your efforts into helping
>> btrfs become stable and ready and then use that, instead of having to
>> come up with a bunch of hacks to get around the myriad of weird feature
>> combinations you can get with ext4?
>
> Hi Josef,
>
> I understand the bitterness in btrfs community regarding ext4 snapshot
> feature. You might say the same things about ext4 64bit feature.
> I think it is not up to us to decide how it rolls. it's the users
> and companies involved that dictate where the development happens.
>

Oh don't misunderstand me, I'm not bitter. It just seems like this is a
lot of work for something you get for free with btrfs. A lot of work
which I don't really think is justified when it comes to ext4.

> I like the answer that Ted once replied to the old btrfs vs. ext4 question:
> competition is good because it makes us modest.
>
> I believe there is room in the future for both fs's, even with
> similar features in both.
>
>
>
>>
>> The wonderful thing about ext4 is its a nice basic fs. If we're going
>> to start doing lots of crazy things, why not do them to the fs that
>> isn't yet in wide use and can afford to have crazy things done to it
>> without screwing a bunch of users who already depend on ext4's
>> stability? Thanks,
>>
>
> As I see it, stability is the *only* advantage of ext4 snapshots over btrfs
> even though the snapshot feature is new and not stable, you still
> have the good olf e2fsprogs tools that can get you out of any mess.
> specifically, fsck -x will discard all snapshot files and make your ext4
> fs clean and stable again.
>
> The repair tool is one thing that btrfs is still lacking, so I back CTERA's
> decision to progress to ext4 with snapshots and not to btrfs on a
> production system.
>

Sure, if you had spent time on a fsck tool for btrfs you would be done
by now ;). I feel that ext4 is becoming a dumping ground for every ones
pet project which is resulting in this weird frankenstien like fs that
is growing organically, rather than a great, stable and all around
useful file system. Rather than cramming more crap into it, maybe we
should evaluate whether the work is useful in the first place with
things like btrfs or the dm snapshotting stuff exist. Thanks,

Josef

2011-06-07 17:14:57

by Sunil Mushran

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On 06/07/2011 09:46 AM, Amir G. wrote:
> I understand the bitterness in btrfs community regarding ext4 snapshot
> feature. You might say the same things about ext4 64bit feature.
> I think it is not up to us to decide how it rolls. it's the users
> and companies involved that dictate where the development happens.

Bitterness is not the issue. The issue is what happens when your
_patron_ has had enough of the project and decides to stop funding.

Are you going to spend your free time maintaining the entire file system?

2011-06-07 17:30:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 07, 2011 at 10:14:27AM -0700, Sunil Mushran wrote:
> On 06/07/2011 09:46 AM, Amir G. wrote:
> >I understand the bitterness in btrfs community regarding ext4 snapshot
> >feature. You might say the same things about ext4 64bit feature.
> >I think it is not up to us to decide how it rolls. it's the users
> >and companies involved that dictate where the development happens.
>
> Bitterness is not the issue. The issue is what happens when your
> _patron_ has had enough of the project and decides to stop funding.

Yep. On the other hand, the question is that if you move too slowly,
the patron is just as likely to find another solution to his/her
business problems. Sometimes perfect is the enemy of the good, and
the best technology is not necesssarily what carries the day.
Practical issues of what is available and what works "good enough" are
just as important, if not more so.

The philosophical questions have been discussed before in the "Worse
is Better" dialectic. See: http://dreamsongs.com/WorseIsBetter.html

Or, this set of slides by the same author, "Models of Software
Acceptance: How Winners Win": http://dreamsongs.com/Files/AcceptanceModels.pdf

Back to solid ground --- I'm not going to insist on perfection, but at
the same time, the maintenance burden is something that has to be
acceptable, and there needs to be plan towards making it better. If I
thought Amir was going to disappear the moment the snapshot patches
got accepted, there's no way I'd be accepting them.

- Ted

2011-06-07 17:54:25

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 7, 2011 at 8:14 PM, Sunil Mushran <[email protected]> wrote:
> On 06/07/2011 09:46 AM, Amir G. wrote:
>>
>> I understand the bitterness in btrfs community regarding ext4 snapshot
>> feature. You might say the same things about ext4 64bit feature.
>> I think it is not up to us to decide how it rolls. it's the users
>> and companies involved that dictate where the development happens.
>
> Bitterness is not the issue. The issue is what happens when your
> _patron_ has had enough of the project and decides to stop funding.
>

My patron is not funding the project for the good of humanity
it has customers using the product and relying on the snapshot
feature.
This is what I meant by market forces driving the development
decisions.

> Are you going to spend your free time maintaining the entire file system?
>

Nope. But if we accept the premise that the ext4 snapshots feature
is something that can drive a business to more income, than I should
have no problem finding a new patron if it comes to that.

Cheers,
Amir.

2011-06-07 18:22:04

by Amir G.

[permalink] [raw]
Subject: Re: [PATCH RFC 00/30] Ext4 snapshots - core patches

On Tue, Jun 7, 2011 at 7:54 PM, Josef Bacik <[email protected]> wrote:
> On 06/07/2011 12:46 PM, Amir G. wrote:
>> On Tue, Jun 7, 2011 at 6:26 PM, Josef Bacik <[email protected]> wrote:
>>>
>>> I probably should have brought this up before, but why put all this
>>> effort into shoehorning in such a big an invasive feature to ext4 when
>>> btrfs does this all already? ?Why not put your efforts into helping
>>> btrfs become stable and ready and then use that, instead of having to
>>> come up with a bunch of hacks to get around the myriad of weird feature
>>> combinations you can get with ext4?
>>
>> Hi Josef,
>>
>> I understand the bitterness in btrfs community regarding ext4 snapshot
>> feature. You might say the same things about ext4 64bit feature.
>> I think it is not up to us to decide how it rolls. it's the users
>> and companies involved that dictate where the development happens.
>>
>
> Oh don't misunderstand me, I'm not bitter. ?It just seems like this is a
> lot of work for something you get for free with btrfs. ?A lot of work
> which I don't really think is justified when it comes to ext4.
>

Bitterness was a poor choice of expression.
Let me rephrase myself: I understand the wish of the btrfs community
that ext4 development would be focused on stabilization only
and that more developers would invest their time on stabilizing and
enhancing btrfs.

But there's the perfect world where everyone has migrated to btrfs
and there's real life, where sys admins are still hanging onto
ext3...

>> I like the answer that Ted once replied to the old btrfs vs. ext4 question:
>> competition is good because it makes us modest.
>>
>> I believe there is room in the future for both fs's, even with
>> similar features in both.
>>
>>
>>
>>>
>>> The wonderful thing about ext4 is its a nice basic fs. ?If we're going
>>> to start doing lots of crazy things, why not do them to the fs that
>>> isn't yet in wide use and can afford to have crazy things done to it
>>> without screwing a bunch of users who already depend on ext4's
>>> stability? ?Thanks,
>>>
>>
>> As I see it, stability is the *only* advantage of ext4 snapshots over btrfs
>> even though the snapshot feature is new and not stable, you still
>> have the good olf e2fsprogs tools that can get you out of any mess.
>> specifically, fsck -x will discard all snapshot files and make your ext4
>> fs clean and stable again.
>>
>> The repair tool is one thing that btrfs is still lacking, so I back CTERA's
>> decision to progress to ext4 with snapshots and not to btrfs on a
>> production system.
>>
>
> Sure, if you had spent time on a fsck tool for btrfs you would be done
> by now ;).

Touche ;-)

However, one cannot fast forward 20(?) years of stabilization
of the extended fs on-disk format checking tools.
More over, I think there is a reason, beyond not finding one
developer, why btrfs repair tools have not been written yet.
The degrees of freedom in the rigid extended fs format allows fsck
to be very effective in rescuing most of the fs.
Btrfs, being a much bigger hammer that it is, with everything is a tree
and all, has more degrees of freedom, which makes it hard for
any repair tool (or sysadmin) to make the right decision.
And the fact that ext4 snapshots (almost) doesn't change the extended
on-disk format, because snapshot files are posing as regular sparse
files, is it's strongest (well only) advantage over btrfs at the
moment.

> I feel that ext4 is becoming a dumping ground for every ones
> pet project which is resulting in this weird frankenstien like fs that
> is growing organically, rather than a great, stable and all around
> useful file system. ?Rather than cramming more crap into it, maybe we
> should evaluate whether the work is useful in the first place with
> things like btrfs or the dm snapshotting stuff exist. ?Thanks,
>

And how exactly will we be making this evaluation?
There is no clear value for 'stable' that we can be used to compare
the alternatives.

Cheers,
Amir.