Thanks everyone for the feedback. Here's [PATCH v2], with these notable
changes since [PATCH 0/5 resend]:
- made bisect-safe "Freeze and thaw the journal on ext4 freeze"
- functional change: allow read-only quota subcmds when frozen
- added "VFS: Rename and refactor writeback_inodes_sb_if_idle"
- split-out "VFS: Avoid read-write deadlock in try_to_writeback_inodes_sb"
- split-out "VFS: Document s_frozen state through freeze_super"
- dropped "VFS: Rename vfs_check_frozen()"
-----
This set of filesystem freeze/thaw deadlock patches is submitted on
behalf of the authors, Kamal Mostafa, Surbhi Palande and Valerie Aurora.
The patches fix the bug:
BugLink: https://bugs.launchpad.net/bugs/897421
Kamal Mostafa (1):
VFS: Rename and refactor writeback_inodes_sb_if_idle
Surbhi Palande (2):
Adding support to freeze and unfreeze a journal
Freeze and thaw the journal on ext4 freeze
Valerie Aurora (4):
VFS: Fix s_umount thaw/write deadlock
VFS: Avoid read-write deadlock in try_to_writeback_inodes_sb
VFS: Document s_frozen state through freeze_super
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
Documentation/filesystems/Locking | 4 +-
fs/btrfs/extent-tree.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/ext4/super.c | 13 +++++-----
fs/fs-writeback.c | 46 +++++++++++++++++++++---------------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 42 +++++++++++++++++++++++++++++++++
fs/quota/quota.c | 21 ++++++++++++++++-
fs/super.c | 26 +++++++++++++++++++++
fs/sync.c | 4 +-
include/linux/fs.h | 7 +++++-
include/linux/jbd2.h | 7 +++++
include/linux/writeback.h | 4 +-
13 files changed, 143 insertions(+), 36 deletions(-)
From: Surbhi Palande <[email protected]>
Freeze and thaw the journal when you freeze and thaw the filesystem.
BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Surbhi Palande <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/ext4/super.c | 13 ++++++-------
1 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3858767..751908b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4330,14 +4330,11 @@ static int ext4_freeze(struct super_block *sb)
journal = EXT4_SB(sb)->s_journal;
- /* Now we set up the journal barrier. */
- jbd2_journal_lock_updates(journal);
-
+ error = jbd2_journal_freeze(journal);
/*
- * Don't clear the needs_recovery flag if we failed to flush
+ * Don't clear the needs_recovery flag if we failed to freeze
* the journal.
*/
- error = jbd2_journal_flush(journal);
if (error < 0)
goto out;
@@ -4345,8 +4342,6 @@ static int ext4_freeze(struct super_block *sb)
EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
error = ext4_commit_super(sb, 1);
out:
- /* we rely on s_frozen to stop further updates */
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
return error;
}
@@ -4356,10 +4351,14 @@ out:
*/
static int ext4_unfreeze(struct super_block *sb)
{
+ journal_t *journal;
if (sb->s_flags & MS_RDONLY)
return 0;
lock_super(sb);
+ journal = EXT4_SB(sb)->s_journal;
+
+ jbd2_journal_thaw(journal);
/* Reset the needs_recovery flag before the fs is unlocked. */
EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
ext4_commit_super(sb, 1);
--
1.7.5.4
From: Valerie Aurora <[email protected]>
File system freeze/thaw require the superblock's s_umount lock.
However, we write to the file system while holding the s_umount lock
in several places (e.g., writeback and quotas). Any of these can
block on a frozen file system while holding s_umount, preventing any
later thaw from occurring, since thaw requires s_umount.
The solution is to add a check, vfs_is_frozen(), to all code paths
that write while holding sb->s_umount and return without writing if it
is true.
BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
[[email protected]: minor changes; patch restructure]
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/fs-writeback.c | 8 ++++++++
fs/quota/quota.c | 21 ++++++++++++++++++++-
fs/super.c | 8 ++++++++
fs/sync.c | 4 ++--
include/linux/fs.h | 7 ++++++-
5 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 73c3992..eee01cd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -554,6 +554,14 @@ static long writeback_sb_inodes(struct super_block *sb,
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
+ if (inode->i_sb == sb && vfs_is_frozen(sb)) {
+ /*
+ * Inode is on a frozen superblock; skip it for now.
+ */
+ redirty_tail(inode, wb);
+ continue;
+ }
+
if (inode->i_sb != sb) {
if (work->sb) {
/*
diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 35f4b0e..1d770f2 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -47,7 +47,7 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,
static void quota_sync_one(struct super_block *sb, void *arg)
{
- if (sb->s_qcop && sb->s_qcop->quota_sync)
+ if (sb->s_qcop && sb->s_qcop->quota_sync && !vfs_is_frozen(sb))
sb->s_qcop->quota_sync(sb, *(int *)arg, 1);
}
@@ -368,8 +368,27 @@ SYSCALL_DEFINE4(quotactl, unsigned int, cmd, const char __user *, special,
goto out;
}
+ /*
+ * If the fs is frozen, only allow read-only quota subcmds.
+ */
+ if (vfs_is_frozen(sb)) {
+ switch (cmd) {
+ case Q_GETFMT:
+ case Q_GETINFO:
+ case Q_XGETQSTAT:
+ ret = 0;
+ break;
+ default:
+ ret = -EBUSY;
+ break;
+ }
+ if ( ret != 0 )
+ goto out_drop_super;
+ }
+
ret = do_quotactl(sb, type, cmds, id, addr, pathp);
+out_drop_super:
drop_super(sb);
out:
if (pathp && !IS_ERR(pathp))
diff --git a/fs/super.c b/fs/super.c
index afd0f1a..5629d06 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -526,6 +526,10 @@ void sync_supers(void)
*
* Scans the superblock list and calls given function, passing it
* locked superblock and given argument.
+ *
+ * Note: If @f is going to write to the file system, it must first
+ * check if the file system is frozen (via vfs_is_frozen(sb)) and
+ * refuse to write if so.
*/
void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
{
@@ -595,6 +599,10 @@ EXPORT_SYMBOL(iterate_supers_type);
*
* Scans the superblock list and finds the superblock of the file system
* mounted on the device given. %NULL is returned if no match is found.
+ *
+ * Note: If caller is going to write to the superblock, it must first
+ * check if the file system is frozen (via vfs_is_frozen(sb)) and
+ * refuse to write if so.
*/
struct super_block *get_super(struct block_device *bdev)
diff --git a/fs/sync.c b/fs/sync.c
index 101b8ef..e8166db 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -68,7 +68,7 @@ int sync_filesystem(struct super_block *sb)
/*
* No point in syncing out anything if the filesystem is read-only.
*/
- if (sb->s_flags & MS_RDONLY)
+ if ((sb->s_flags & MS_RDONLY) || vfs_is_frozen(sb))
return 0;
ret = __sync_filesystem(sb, 0);
@@ -80,7 +80,7 @@ EXPORT_SYMBOL_GPL(sync_filesystem);
static void sync_one_sb(struct super_block *sb, void *arg)
{
- if (!(sb->s_flags & MS_RDONLY))
+ if (!(sb->s_flags & MS_RDONLY) && !vfs_is_frozen(sb))
__sync_filesystem(sb, *(int *)arg);
}
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 019dc55..ec33b4c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1490,7 +1490,7 @@ extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
extern struct timespec current_fs_time(struct super_block *sb);
/*
- * Snapshotting support.
+ * Snapshotting support. See freeze_super() for documentation.
*/
enum {
SB_UNFROZEN = 0,
@@ -1501,6 +1501,11 @@ enum {
#define vfs_check_frozen(sb, level) \
wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
+static inline int vfs_is_frozen(struct super_block *sb)
+{
+ return sb->s_frozen == SB_FREEZE_TRANS;
+}
+
/*
* until VFS tracks user namespaces for inodes, just make all files
* belong to init_user_ns
--
1.7.5.4
From: Valerie Aurora <[email protected]>
freeze_fs/unfreeze_fs ops are called with s_umount held for write, not read.
BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
Documentation/filesystems/Locking | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index d819ba1..7e46a94 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -134,8 +134,8 @@ evict_inode:
put_super: write
write_super: read
sync_fs: read
-freeze_fs: read
-unfreeze_fs: read
+freeze_fs: write
+unfreeze_fs: write
statfs: maybe(read) (see below)
remount_fs: write
umount_begin: no
--
1.7.5.4
From: Valerie Aurora <[email protected]>
BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
[[email protected]: patch restructure]
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/super.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index 5629d06..a56696b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1140,6 +1140,24 @@ out:
* Syncs the super to make sure the filesystem is consistent and calls the fs's
* freeze_fs. Subsequent calls to this without first thawing the fs will return
* -EBUSY.
+ *
+ * During this function, sb->s_frozen goes through these values:
+ *
+ * SB_UNFROZEN: File system is normal, all writes progress as usual.
+ *
+ * SB_FREEZE_WRITE: The file system is in the process of being frozen
+ * and any remaining out-standing writes are being synced. Writes
+ * that complete in-process writes should be permitted but new ones
+ * should be blocked.
+ *
+ * SB_FREEZE_TRANS: The file system is frozen. The ->freeze_fs and
+ * ->unfreeze_fs ops are the only operations permitted to write to the
+ * file system in this state.
+ *
+ * sb->s_frozen is protected by sb->s_umount. Additionally,
+ * SB_FREEZE_WRITE is only temporarily set during freeze/thaw while
+ * holding sb->s_umount for writing, so any other callers holding
+ * sb->s_umount will never see this state.
*/
int freeze_super(struct super_block *sb)
{
--
1.7.5.4
From: Valerie Aurora <[email protected]>
Use trylock in try_to_writeback_inodes_sb to avoid read-write
deadlocks that could be triggered by freeze.
BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
[[email protected]: patch restructure]
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/fs-writeback.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ea89b3f..3a80f1b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1274,8 +1274,9 @@ EXPORT_SYMBOL(writeback_inodes_sb);
* try_to_writeback_inodes_sb - start writeback if none underway
* @sb: the superblock
*
- * Invoke writeback_inodes_sb if no writeback is currently underway.
- * Returns 1 if writeback was started, 0 if not.
+ * Invoke writeback_inodes_sb if no writeback is currently underway
+ * and no one else holds the s_umount lock. Returns 1 if writeback
+ * was started, 0 if not.
*/
int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
{
@@ -1288,15 +1289,17 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
* @sb: the superblock
* @nr: the number of pages to write
*
- * Invoke writeback_inodes_sb if no writeback is currently underway.
- * Returns 1 if writeback was started, 0 if not.
+ * Invoke writeback_inodes_sb if no writeback is currently underway
+ * and no one else holds the s_umount lock. Returns 1 if writeback
+ * was started, 0 if not.
*/
int try_to_writeback_inodes_sb_nr(struct super_block *sb,
unsigned long nr,
enum wb_reason reason)
{
if (!writeback_in_progress(sb->s_bdi)) {
- down_read(&sb->s_umount);
+ if (!down_read_trylock(&sb->s_umount))
+ return 0;
if (nr == 0)
writeback_inodes_sb(sb, reason);
else
--
1.7.5.4
Rename writeback_inodes_sb{_nr}_if_idle to try_to_writeback_inodes_sb{_nr}
and refactor to avoid duplicating logic.
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/btrfs/extent-tree.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/fs-writeback.c | 25 +++++++++++--------------
include/linux/writeback.h | 4 ++--
4 files changed, 15 insertions(+), 18 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f0d5718..6f6fe2b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3416,7 +3416,7 @@ static int shrink_delalloc(struct btrfs_root *root, u64 to_reclaim,
smp_mb();
nr_pages = min_t(unsigned long, nr_pages,
root->fs_info->delalloc_bytes >> PAGE_CACHE_SHIFT);
- writeback_inodes_sb_nr_if_idle(root->fs_info->sb, nr_pages,
+ try_to_writeback_inodes_sb_nr(root->fs_info->sb, nr_pages,
WB_REASON_FS_FREE_SPACE);
spin_lock(&space_info->lock);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 848f436..0da75cd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2373,7 +2373,7 @@ static int ext4_nonda_switch(struct super_block *sb)
* start pushing delalloc when 1/2 of free blocks are dirty.
*/
if (free_blocks < 2 * dirty_blocks)
- writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
+ try_to_writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
return 0;
}
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index eee01cd..ea89b3f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1271,45 +1271,42 @@ void writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
EXPORT_SYMBOL(writeback_inodes_sb);
/**
- * writeback_inodes_sb_if_idle - start writeback if none underway
+ * try_to_writeback_inodes_sb - start writeback if none underway
* @sb: the superblock
*
* Invoke writeback_inodes_sb if no writeback is currently underway.
* Returns 1 if writeback was started, 0 if not.
*/
-int writeback_inodes_sb_if_idle(struct super_block *sb, enum wb_reason reason)
+int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
{
- if (!writeback_in_progress(sb->s_bdi)) {
- down_read(&sb->s_umount);
- writeback_inodes_sb(sb, reason);
- up_read(&sb->s_umount);
- return 1;
- } else
- return 0;
+ return try_to_writeback_inodes_sb_nr(sb, 0, reason);
}
-EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
+EXPORT_SYMBOL(try_to_writeback_inodes_sb);
/**
- * writeback_inodes_sb_if_idle - start writeback if none underway
+ * try_to_writeback_inodes_sb_nr - start writeback if none underway
* @sb: the superblock
* @nr: the number of pages to write
*
* Invoke writeback_inodes_sb if no writeback is currently underway.
* Returns 1 if writeback was started, 0 if not.
*/
-int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
+int try_to_writeback_inodes_sb_nr(struct super_block *sb,
unsigned long nr,
enum wb_reason reason)
{
if (!writeback_in_progress(sb->s_bdi)) {
down_read(&sb->s_umount);
- writeback_inodes_sb_nr(sb, nr, reason);
+ if (nr == 0)
+ writeback_inodes_sb(sb, reason);
+ else
+ writeback_inodes_sb_nr(sb, nr, reason);
up_read(&sb->s_umount);
return 1;
} else
return 0;
}
-EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
+EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);
/**
* sync_inodes_sb - sync sb inode pages
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a378c29..e824225 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -89,8 +89,8 @@ int inode_wait(void *);
void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
-int writeback_inodes_sb_if_idle(struct super_block *, enum wb_reason reason);
-int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr,
+int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
+int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
void sync_inodes_sb(struct super_block *);
long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
--
1.7.5.4
From: Surbhi Palande <[email protected]>
The journal should be frozen when a filesystem freezes. What this means is
that until the filesystem is thawed again, no new transactions should be
accepted by the journal. When the filesystem thaws, inturn it should thaw the
journal and this should allow the journal to resume accepting new
transactions. While the the filesystem has frozen the journal, the clients of
the journal on calling jbd2_journal_start() will sleep on a wait queue.
Thawing the journal will wake up the sleeping clients and journalling can
progress normally.
An example of the race condition that can happen without this patch is as
follows:
Say the filesystem is thawed when we begin. Let tx be the time at unit x
P1: Process doing an aio write
t1) ext4_file_write()
t2) generic_file_aio_write()
t3) __generic_file_aio_write()
// filesystem is not frozen, so we do not block in the next check.
t4) vfs_check_frozen()
t5) generic_write_checks()
----------------- Prempted------------------
P2: Process that does filesystem freeze
t6) freeze_super()
t7) sync_filesystem()
t8) sync_blockdev()
t9) sb->s_op->freeze_fs() (= ext4_freeze)
t10) jbd2_journal_lock_updates()
t11) jbd2_journal_flush()
// Need to unlock the journal before returning to user space.
t12) jbd2_journal_unlock_updates()
// Journal is unlocked and so we can start accepting new transactions now.
// freezing process completes execution. Page cache is now clean and should
// remain clean till the filesystem is frozen.
--------------------------------------------
P1: writing process gets the control back
t13) generic_file_buffered_write()
t14) generic_perform_write()
t15) a_ops->write_begin() (= ext4_write_begin)
t16) ext4_journal_start()
// New handle is started. We do not block here! Write continues
// dirtying the page cache while the filesystem is frozen!
BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Surbhi Palande <[email protected]>
Acked-by: Jan Kara <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 7 +++++++
3 files changed, 50 insertions(+), 0 deletions(-)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0fa0123..f0170cc 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -894,6 +894,7 @@ static journal_t * journal_init_common (void)
init_waitqueue_head(&journal->j_wait_checkpoint);
init_waitqueue_head(&journal->j_wait_commit);
init_waitqueue_head(&journal->j_wait_updates);
+ init_waitqueue_head(&journal->j_wait_frozen);
mutex_init(&journal->j_barrier);
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a0e41a4..340ee35 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -173,6 +173,17 @@ repeat:
journal->j_barrier_count == 0);
goto repeat;
}
+ /* Don't let a new handle start when a journal is frozen.
+ * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+ * the j_flags indicate that the journal is frozen. So if the
+ * j_barrier_count is 0, then check if this was made 0 by the freezing
+ * process
+ */
+ if (journal->j_flags & JBD2_FROZEN) {
+ read_unlock(&journal->j_state_lock);
+ wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN));
+ goto repeat;
+ }
if (!journal->j_running_transaction) {
read_unlock(&journal->j_state_lock);
@@ -492,6 +503,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
}
EXPORT_SYMBOL(jbd2_journal_restart);
+int jbd2_journal_freeze(journal_t *journal)
+{
+ int error = 0;
+ /* Now we set up the journal barrier. */
+ jbd2_journal_lock_updates(journal);
+
+ /*
+ * Don't clear the needs_recovery flag if we failed to flush
+ * the journal.
+ */
+ error = jbd2_journal_flush(journal);
+ if (error >= 0) {
+ write_lock(&journal->j_state_lock);
+ journal->j_flags |= JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ }
+ jbd2_journal_unlock_updates(journal);
+ return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags &= ~JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2092ea2..bfa0752 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -658,6 +658,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
* @j_wait_checkpoint: Wait queue to trigger checkpointing
* @j_wait_commit: Wait queue to trigger commit
* @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
* @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
* @j_head: Journal head - identifies the first unused block in the journal
* @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -775,6 +776,9 @@ struct journal_s
/* Wait queue to wait for updates to complete */
wait_queue_head_t j_wait_updates;
+ /* Wait queue to wait for journal to thaw*/
+ wait_queue_head_t j_wait_frozen;
+
/* Semaphore for locking against concurrent checkpoints */
struct mutex j_checkpoint_mutex;
@@ -953,6 +957,7 @@ struct journal_s
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_FROZEN 0x080 /* Journal thread frozen along with filesystem */
/*
* Function declarations for the journaling transaction and buffer
@@ -1060,6 +1065,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned long);
extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
extern int jbd2_journal_stop(handle_t *);
+extern int jbd2_journal_freeze(journal_t *);
+extern void jbd2_journal_thaw(journal_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
--
1.7.5.4
On thu, 8 Dec 2011 10:04:34 -0800, Kamal Mostafa wrote:
> /**
> - * writeback_inodes_sb_if_idle - start writeback if none underway
> + * try_to_writeback_inodes_sb_nr - start writeback if none underway
> * @sb: the superblock
> * @nr: the number of pages to write
> *
> * Invoke writeback_inodes_sb if no writeback is currently underway.
> * Returns 1 if writeback was started, 0 if not.
> */
> -int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
> +int try_to_writeback_inodes_sb_nr(struct super_block *sb,
> unsigned long nr,
> enum wb_reason reason)
> {
> if (!writeback_in_progress(sb->s_bdi)) {
> down_read(&sb->s_umount);
> - writeback_inodes_sb_nr(sb, nr, reason);
> + if (nr == 0)
> + writeback_inodes_sb(sb, reason);
> + else
> + writeback_inodes_sb_nr(sb, nr, reason);
> up_read(&sb->s_umount);
> return 1;
> } else
> return 0;
The comment said "Returns 1 if writeback was started", so if writeback_in_progress()
return true, I think this function also should return 1.
BTW: Does anyone know when this patchset will be merged into the main tree?
Thanks
Miao
Ping...
On tue, 13 Dec 2011 11:34:05 +0800, Miao Xie wrote:
> On thu, 8 Dec 2011 10:04:34 -0800, Kamal Mostafa wrote:
>> /**
>> - * writeback_inodes_sb_if_idle - start writeback if none underway
>> + * try_to_writeback_inodes_sb_nr - start writeback if none underway
>> * @sb: the superblock
>> * @nr: the number of pages to write
>> *
>> * Invoke writeback_inodes_sb if no writeback is currently underway.
>> * Returns 1 if writeback was started, 0 if not.
>> */
>> -int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>> +int try_to_writeback_inodes_sb_nr(struct super_block *sb,
>> unsigned long nr,
>> enum wb_reason reason)
>> {
>> if (!writeback_in_progress(sb->s_bdi)) {
>> down_read(&sb->s_umount);
>> - writeback_inodes_sb_nr(sb, nr, reason);
>> + if (nr == 0)
>> + writeback_inodes_sb(sb, reason);
>> + else
>> + writeback_inodes_sb_nr(sb, nr, reason);
>> up_read(&sb->s_umount);
>> return 1;
>> } else
>> return 0;
>
> The comment said "Returns 1 if writeback was started", so if writeback_in_progress()
> return true, I think this function also should return 1.
>
> BTW: Does anyone know when this patchset will be merged into the main tree?
>
> Thanks
> Miao
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Tue, 2011-12-13 at 11:34 +0800, Miao Xie wrote:
> On thu, 8 Dec 2011 10:04:34 -0800, Kamal Mostafa wrote:
> > /**
> > - * writeback_inodes_sb_if_idle - start writeback if none underway
> > + * try_to_writeback_inodes_sb_nr - start writeback if none underway
> > * @sb: the superblock
> > * @nr: the number of pages to write
> > *
> > * Invoke writeback_inodes_sb if no writeback is currently underway.
> > * Returns 1 if writeback was started, 0 if not.
> > */
> > -int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
> > +int try_to_writeback_inodes_sb_nr(struct super_block *sb,
> > unsigned long nr,
> > enum wb_reason reason)
> > {
> > if (!writeback_in_progress(sb->s_bdi)) {
> > down_read(&sb->s_umount);
> > - writeback_inodes_sb_nr(sb, nr, reason);
> > + if (nr == 0)
> > + writeback_inodes_sb(sb, reason);
> > + else
> > + writeback_inodes_sb_nr(sb, nr, reason);
> > up_read(&sb->s_umount);
> > return 1;
> > } else
> > return 0;
>
> The comment said "Returns 1 if writeback was started", so if writeback_in_progress()
> return true, I think this function also should return 1.
My interpretation of that comment is that it will return 1 only if this
call results in a new writeback being started (not if a writeback was
already in progress).
This patch [4/7] intentionally does not introduce any functional
changes.
> BTW: Does anyone know when this patchset will be merged into the main tree?
I also eagerly await the merge of this patch set.
-Kamal
On Thu 08-12-11 10:04:32, Kamal Mostafa wrote:
> From: Surbhi Palande <[email protected]>
>
> Freeze and thaw the journal when you freeze and thaw the filesystem.
>
> BugLink: https://bugs.launchpad.net/bugs/897421
> Signed-off-by: Surbhi Palande <[email protected]>
> Cc: Kamal Mostafa <[email protected]>
> Tested-by: Peter M. Petrakis <[email protected]>
> Signed-off-by: Kamal Mostafa <[email protected]>
The patch looks good. You can add:
Acked-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/super.c | 13 ++++++-------
> 1 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3858767..751908b 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4330,14 +4330,11 @@ static int ext4_freeze(struct super_block *sb)
>
> journal = EXT4_SB(sb)->s_journal;
>
> - /* Now we set up the journal barrier. */
> - jbd2_journal_lock_updates(journal);
> -
> + error = jbd2_journal_freeze(journal);
> /*
> - * Don't clear the needs_recovery flag if we failed to flush
> + * Don't clear the needs_recovery flag if we failed to freeze
> * the journal.
> */
> - error = jbd2_journal_flush(journal);
> if (error < 0)
> goto out;
>
> @@ -4345,8 +4342,6 @@ static int ext4_freeze(struct super_block *sb)
> EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> error = ext4_commit_super(sb, 1);
> out:
> - /* we rely on s_frozen to stop further updates */
> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> return error;
> }
>
> @@ -4356,10 +4351,14 @@ out:
> */
> static int ext4_unfreeze(struct super_block *sb)
> {
> + journal_t *journal;
> if (sb->s_flags & MS_RDONLY)
> return 0;
>
> lock_super(sb);
> + journal = EXT4_SB(sb)->s_journal;
> +
> + jbd2_journal_thaw(journal);
> /* Reset the needs_recovery flag before the fs is unlocked. */
> EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> ext4_commit_super(sb, 1);
> --
> 1.7.5.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 08-12-11 10:04:34, Kamal Mostafa wrote:
> Rename writeback_inodes_sb{_nr}_if_idle to try_to_writeback_inodes_sb{_nr}
> and refactor to avoid duplicating logic.
Looks good to me. You can add:
Acked-by: Jan Kara <[email protected]>
Honza
>
> Signed-off-by: Kamal Mostafa <[email protected]>
> ---
> fs/btrfs/extent-tree.c | 2 +-
> fs/ext4/inode.c | 2 +-
> fs/fs-writeback.c | 25 +++++++++++--------------
> include/linux/writeback.h | 4 ++--
> 4 files changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f0d5718..6f6fe2b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3416,7 +3416,7 @@ static int shrink_delalloc(struct btrfs_root *root, u64 to_reclaim,
> smp_mb();
> nr_pages = min_t(unsigned long, nr_pages,
> root->fs_info->delalloc_bytes >> PAGE_CACHE_SHIFT);
> - writeback_inodes_sb_nr_if_idle(root->fs_info->sb, nr_pages,
> + try_to_writeback_inodes_sb_nr(root->fs_info->sb, nr_pages,
> WB_REASON_FS_FREE_SPACE);
>
> spin_lock(&space_info->lock);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 848f436..0da75cd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2373,7 +2373,7 @@ static int ext4_nonda_switch(struct super_block *sb)
> * start pushing delalloc when 1/2 of free blocks are dirty.
> */
> if (free_blocks < 2 * dirty_blocks)
> - writeback_inodes_sb_if_idle(sb, WB_REASON_FS_FREE_SPACE);
> + try_to_writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
>
> return 0;
> }
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index eee01cd..ea89b3f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1271,45 +1271,42 @@ void writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
> EXPORT_SYMBOL(writeback_inodes_sb);
>
> /**
> - * writeback_inodes_sb_if_idle - start writeback if none underway
> + * try_to_writeback_inodes_sb - start writeback if none underway
> * @sb: the superblock
> *
> * Invoke writeback_inodes_sb if no writeback is currently underway.
> * Returns 1 if writeback was started, 0 if not.
> */
> -int writeback_inodes_sb_if_idle(struct super_block *sb, enum wb_reason reason)
> +int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
> {
> - if (!writeback_in_progress(sb->s_bdi)) {
> - down_read(&sb->s_umount);
> - writeback_inodes_sb(sb, reason);
> - up_read(&sb->s_umount);
> - return 1;
> - } else
> - return 0;
> + return try_to_writeback_inodes_sb_nr(sb, 0, reason);
> }
> -EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
> +EXPORT_SYMBOL(try_to_writeback_inodes_sb);
>
> /**
> - * writeback_inodes_sb_if_idle - start writeback if none underway
> + * try_to_writeback_inodes_sb_nr - start writeback if none underway
> * @sb: the superblock
> * @nr: the number of pages to write
> *
> * Invoke writeback_inodes_sb if no writeback is currently underway.
> * Returns 1 if writeback was started, 0 if not.
> */
> -int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
> +int try_to_writeback_inodes_sb_nr(struct super_block *sb,
> unsigned long nr,
> enum wb_reason reason)
> {
> if (!writeback_in_progress(sb->s_bdi)) {
> down_read(&sb->s_umount);
> - writeback_inodes_sb_nr(sb, nr, reason);
> + if (nr == 0)
> + writeback_inodes_sb(sb, reason);
> + else
> + writeback_inodes_sb_nr(sb, nr, reason);
> up_read(&sb->s_umount);
> return 1;
> } else
> return 0;
> }
> -EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
> +EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);
>
> /**
> * sync_inodes_sb - sync sb inode pages
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a378c29..e824225 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -89,8 +89,8 @@ int inode_wait(void *);
> void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
> void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
> enum wb_reason reason);
> -int writeback_inodes_sb_if_idle(struct super_block *, enum wb_reason reason);
> -int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr,
> +int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
> +int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
> enum wb_reason reason);
> void sync_inodes_sb(struct super_block *);
> long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
> --
> 1.7.5.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 08-12-11 10:04:35, Kamal Mostafa wrote:
> From: Valerie Aurora <[email protected]>
>
> Use trylock in try_to_writeback_inodes_sb to avoid read-write
> deadlocks that could be triggered by freeze.
Christoph asked about what is the exact deadlock this patch tries to fix.
I don't think you answered that. So can you elaborate please? Is it somehow
connected with the fact that ext4 calls try_to_writeback_inodes_sb() with
i_mutex held?
Honza
> BugLink: https://bugs.launchpad.net/bugs/897421
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: Kamal Mostafa <[email protected]>
> Tested-by: Peter M. Petrakis <[email protected]>
> [[email protected]: patch restructure]
> Signed-off-by: Kamal Mostafa <[email protected]>
> ---
> fs/fs-writeback.c | 13 ++++++++-----
> 1 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index ea89b3f..3a80f1b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1274,8 +1274,9 @@ EXPORT_SYMBOL(writeback_inodes_sb);
> * try_to_writeback_inodes_sb - start writeback if none underway
> * @sb: the superblock
> *
> - * Invoke writeback_inodes_sb if no writeback is currently underway.
> - * Returns 1 if writeback was started, 0 if not.
> + * Invoke writeback_inodes_sb if no writeback is currently underway
> + * and no one else holds the s_umount lock. Returns 1 if writeback
> + * was started, 0 if not.
> */
> int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
> {
> @@ -1288,15 +1289,17 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
> * @sb: the superblock
> * @nr: the number of pages to write
> *
> - * Invoke writeback_inodes_sb if no writeback is currently underway.
> - * Returns 1 if writeback was started, 0 if not.
> + * Invoke writeback_inodes_sb if no writeback is currently underway
> + * and no one else holds the s_umount lock. Returns 1 if writeback
> + * was started, 0 if not.
> */
> int try_to_writeback_inodes_sb_nr(struct super_block *sb,
> unsigned long nr,
> enum wb_reason reason)
> {
> if (!writeback_in_progress(sb->s_bdi)) {
> - down_read(&sb->s_umount);
> + if (!down_read_trylock(&sb->s_umount))
> + return 0;
> if (nr == 0)
> writeback_inodes_sb(sb, reason);
> else
> --
> 1.7.5.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 08-12-11 10:04:36, Kamal Mostafa wrote:
> From: Valerie Aurora <[email protected]>
Looks good. You can add:
Acked-by: Jan Kara <[email protected]>
Honza
> BugLink: https://bugs.launchpad.net/bugs/897421
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: Kamal Mostafa <[email protected]>
> Tested-by: Peter M. Petrakis <[email protected]>
> [[email protected]: patch restructure]
> Signed-off-by: Kamal Mostafa <[email protected]>
> ---
> fs/super.c | 18 ++++++++++++++++++
> 1 files changed, 18 insertions(+), 0 deletions(-)
>
> diff --git a/fs/super.c b/fs/super.c
> index 5629d06..a56696b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -1140,6 +1140,24 @@ out:
> * Syncs the super to make sure the filesystem is consistent and calls the fs's
> * freeze_fs. Subsequent calls to this without first thawing the fs will return
> * -EBUSY.
> + *
> + * During this function, sb->s_frozen goes through these values:
> + *
> + * SB_UNFROZEN: File system is normal, all writes progress as usual.
> + *
> + * SB_FREEZE_WRITE: The file system is in the process of being frozen
> + * and any remaining out-standing writes are being synced. Writes
> + * that complete in-process writes should be permitted but new ones
> + * should be blocked.
> + *
> + * SB_FREEZE_TRANS: The file system is frozen. The ->freeze_fs and
> + * ->unfreeze_fs ops are the only operations permitted to write to the
> + * file system in this state.
> + *
> + * sb->s_frozen is protected by sb->s_umount. Additionally,
> + * SB_FREEZE_WRITE is only temporarily set during freeze/thaw while
> + * holding sb->s_umount for writing, so any other callers holding
> + * sb->s_umount will never see this state.
> */
> int freeze_super(struct super_block *sb)
> {
> --
> 1.7.5.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 08-12-11 10:04:37, Kamal Mostafa wrote:
> From: Valerie Aurora <[email protected]>
>
> freeze_fs/unfreeze_fs ops are called with s_umount held for write, not read.
Looks good. You can add:
Acked-by: Jan Kara <[email protected]>
Honza
>
> BugLink: https://bugs.launchpad.net/bugs/897421
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: Kamal Mostafa <[email protected]>
> Tested-by: Peter M. Petrakis <[email protected]>
> Signed-off-by: Kamal Mostafa <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> Documentation/filesystems/Locking | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index d819ba1..7e46a94 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -134,8 +134,8 @@ evict_inode:
> put_super: write
> write_super: read
> sync_fs: read
> -freeze_fs: read
> -unfreeze_fs: read
> +freeze_fs: write
> +unfreeze_fs: write
> statfs: maybe(read) (see below)
> remount_fs: write
> umount_begin: no
> --
> 1.7.5.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hello,
On Thu 08-12-11 10:04:33, Kamal Mostafa wrote:
> From: Valerie Aurora <[email protected]>
>
> File system freeze/thaw require the superblock's s_umount lock.
> However, we write to the file system while holding the s_umount lock
> in several places (e.g., writeback and quotas). Any of these can
> block on a frozen file system while holding s_umount, preventing any
> later thaw from occurring, since thaw requires s_umount.
>
> The solution is to add a check, vfs_is_frozen(), to all code paths
> that write while holding sb->s_umount and return without writing if it
> is true.
Mikulas Patocka correctly pointed out (in thread starting by
https://lkml.org/lkml/2011/11/25/169) that skipping frozen filesystem is
currently actually wrong for filesystems such as ext2. These filesystems
cannot stop modifications from happening and thus sync must not skip them
even when they are marked as frozen. That complicates the solution of the
deadlock you observe.
Another issue is that even with ext4 current freezing code can leave dirty
data on frozen filesystem. Consider the following race:
Thread 1 Thread 2
freeze_super() __generic_file_aio_write()
... vfs_check_frozen(sb)
sb->s_frozen = SB_FREEZE_WRITE; ...
sync_filesystem(sb);
now we go and write to the fs
sb->s_frozen = SB_FREEZE_TRANS;
Your patches would just hide this race (which I can actually trigger rather
easily in my testing).
Above two issues make me think more about how to really avoid having any
dirty bits set on frozen filesystem. Then we shouldn't need this patch at
all. The trouble is that race like above can really happen with any
operation modifying the filesystem so it's not really easy to fix. I'll
write email about that tomorrow...
Honza
> BugLink: https://bugs.launchpad.net/bugs/897421
> Signed-off-by: Valerie Aurora <[email protected]>
> Cc: Kamal Mostafa <[email protected]>
> Tested-by: Peter M. Petrakis <[email protected]>
> [[email protected]: minor changes; patch restructure]
> Signed-off-by: Kamal Mostafa <[email protected]>
> ---
> fs/fs-writeback.c | 8 ++++++++
> fs/quota/quota.c | 21 ++++++++++++++++++++-
> fs/super.c | 8 ++++++++
> fs/sync.c | 4 ++--
> include/linux/fs.h | 7 ++++++-
> 5 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 73c3992..eee01cd 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -554,6 +554,14 @@ static long writeback_sb_inodes(struct super_block *sb,
> while (!list_empty(&wb->b_io)) {
> struct inode *inode = wb_inode(wb->b_io.prev);
>
> + if (inode->i_sb == sb && vfs_is_frozen(sb)) {
> + /*
> + * Inode is on a frozen superblock; skip it for now.
> + */
> + redirty_tail(inode, wb);
> + continue;
> + }
> +
> if (inode->i_sb != sb) {
> if (work->sb) {
> /*
> diff --git a/fs/quota/quota.c b/fs/quota/quota.c
> index 35f4b0e..1d770f2 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -47,7 +47,7 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,
>
> static void quota_sync_one(struct super_block *sb, void *arg)
> {
> - if (sb->s_qcop && sb->s_qcop->quota_sync)
> + if (sb->s_qcop && sb->s_qcop->quota_sync && !vfs_is_frozen(sb))
> sb->s_qcop->quota_sync(sb, *(int *)arg, 1);
> }
>
> @@ -368,8 +368,27 @@ SYSCALL_DEFINE4(quotactl, unsigned int, cmd, const char __user *, special,
> goto out;
> }
>
> + /*
> + * If the fs is frozen, only allow read-only quota subcmds.
> + */
> + if (vfs_is_frozen(sb)) {
> + switch (cmd) {
> + case Q_GETFMT:
> + case Q_GETINFO:
> + case Q_XGETQSTAT:
> + ret = 0;
> + break;
> + default:
> + ret = -EBUSY;
> + break;
> + }
> + if ( ret != 0 )
> + goto out_drop_super;
> + }
> +
> ret = do_quotactl(sb, type, cmds, id, addr, pathp);
>
> +out_drop_super:
> drop_super(sb);
> out:
> if (pathp && !IS_ERR(pathp))
> diff --git a/fs/super.c b/fs/super.c
> index afd0f1a..5629d06 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -526,6 +526,10 @@ void sync_supers(void)
> *
> * Scans the superblock list and calls given function, passing it
> * locked superblock and given argument.
> + *
> + * Note: If @f is going to write to the file system, it must first
> + * check if the file system is frozen (via vfs_is_frozen(sb)) and
> + * refuse to write if so.
> */
> void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> {
> @@ -595,6 +599,10 @@ EXPORT_SYMBOL(iterate_supers_type);
> *
> * Scans the superblock list and finds the superblock of the file system
> * mounted on the device given. %NULL is returned if no match is found.
> + *
> + * Note: If caller is going to write to the superblock, it must first
> + * check if the file system is frozen (via vfs_is_frozen(sb)) and
> + * refuse to write if so.
> */
>
> struct super_block *get_super(struct block_device *bdev)
> diff --git a/fs/sync.c b/fs/sync.c
> index 101b8ef..e8166db 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -68,7 +68,7 @@ int sync_filesystem(struct super_block *sb)
> /*
> * No point in syncing out anything if the filesystem is read-only.
> */
> - if (sb->s_flags & MS_RDONLY)
> + if ((sb->s_flags & MS_RDONLY) || vfs_is_frozen(sb))
> return 0;
>
> ret = __sync_filesystem(sb, 0);
> @@ -80,7 +80,7 @@ EXPORT_SYMBOL_GPL(sync_filesystem);
>
> static void sync_one_sb(struct super_block *sb, void *arg)
> {
> - if (!(sb->s_flags & MS_RDONLY))
> + if (!(sb->s_flags & MS_RDONLY) && !vfs_is_frozen(sb))
> __sync_filesystem(sb, *(int *)arg);
> }
> /*
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 019dc55..ec33b4c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1490,7 +1490,7 @@ extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
> extern struct timespec current_fs_time(struct super_block *sb);
>
> /*
> - * Snapshotting support.
> + * Snapshotting support. See freeze_super() for documentation.
> */
> enum {
> SB_UNFROZEN = 0,
> @@ -1501,6 +1501,11 @@ enum {
> #define vfs_check_frozen(sb, level) \
> wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
>
> +static inline int vfs_is_frozen(struct super_block *sb)
> +{
> + return sb->s_frozen == SB_FREEZE_TRANS;
> +}
> +
> /*
> * until VFS tracks user namespaces for inodes, just make all files
> * belong to init_user_ns
> --
> 1.7.5.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 12/8/11 12:04 PM, Kamal Mostafa wrote:
> From: Surbhi Palande <[email protected]>
>
> The journal should be frozen when a filesystem freezes. What this means is
> that until the filesystem is thawed again, no new transactions should be
> accepted by the journal. When the filesystem thaws, inturn it should thaw the
> journal and this should allow the journal to resume accepting new
> transactions. While the the filesystem has frozen the journal, the clients of
> the journal on calling jbd2_journal_start() will sleep on a wait queue.
> Thawing the journal will wake up the sleeping clients and journalling can
> progress normally.
>
> An example of the race condition that can happen without this patch is as
> follows:
>
> Say the filesystem is thawed when we begin. Let tx be the time at unit x
>
> P1: Process doing an aio write
> t1) ext4_file_write()
> t2) generic_file_aio_write()
> t3) __generic_file_aio_write()
> // filesystem is not frozen, so we do not block in the next check.
> t4) vfs_check_frozen()
> t5) generic_write_checks()
> ----------------- Prempted------------------
>
> P2: Process that does filesystem freeze
>
> t6) freeze_super()
> t7) sync_filesystem()
> t8) sync_blockdev()
> t9) sb->s_op->freeze_fs() (= ext4_freeze)
> t10) jbd2_journal_lock_updates()
> t11) jbd2_journal_flush()
> // Need to unlock the journal before returning to user space.
> t12) jbd2_journal_unlock_updates()
> // Journal is unlocked and so we can start accepting new transactions now.
>
> // freezing process completes execution. Page cache is now clean and should
> // remain clean till the filesystem is frozen.
> --------------------------------------------
>
> P1: writing process gets the control back
> t13) generic_file_buffered_write()
> t14) generic_perform_write()
> t15) a_ops->write_begin() (= ext4_write_begin)
> t16) ext4_journal_start()
> // New handle is started. We do not block here! Write continues
> // dirtying the page cache while the filesystem is frozen!
Hrm let me think through this a little more; we actually do:
t16) ext4_journal_start()
t17) ext4_journal_start_sb()
t18) handle = ext4_journal_current_handle();
t19) if (!handle) vfs_check_frozen()
t20) ... jbd2_journal_start()
So actually we *do* block new handles, but let *existing* ones
continue (see commits 6b0310fbf087ad6e9e3b8392adca97cd77184084
and be4f27d324e8ddd57cc0d4d604fe85ee0425cba9)
So your assertion that a new handle is started is incorrect
in general, isn't it? So then does the fix seem necessary?
Or, at least, in the fashion below - maybe we need to just make
sure all started handles complete before the unlock_updates?
Or am I missing something...?
-Eric
>
> BugLink: https://bugs.launchpad.net/bugs/897421
> Signed-off-by: Surbhi Palande <[email protected]>
> Acked-by: Jan Kara <[email protected]>
> Reviewed-by: Andreas Dilger <[email protected]>
> Cc: Kamal Mostafa <[email protected]>
> Tested-by: Peter M. Petrakis <[email protected]>
> Signed-off-by: Kamal Mostafa <[email protected]>
> ---
> fs/jbd2/journal.c | 1 +
> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h | 7 +++++++
> 3 files changed, 50 insertions(+), 0 deletions(-)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 0fa0123..f0170cc 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -894,6 +894,7 @@ static journal_t * journal_init_common (void)
> init_waitqueue_head(&journal->j_wait_checkpoint);
> init_waitqueue_head(&journal->j_wait_commit);
> init_waitqueue_head(&journal->j_wait_updates);
> + init_waitqueue_head(&journal->j_wait_frozen);
> mutex_init(&journal->j_barrier);
> mutex_init(&journal->j_checkpoint_mutex);
> spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index a0e41a4..340ee35 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -173,6 +173,17 @@ repeat:
> journal->j_barrier_count == 0);
> goto repeat;
> }
> + /* Don't let a new handle start when a journal is frozen.
> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> + * the j_flags indicate that the journal is frozen. So if the
> + * j_barrier_count is 0, then check if this was made 0 by the freezing
> + * process
> + */
> + if (journal->j_flags & JBD2_FROZEN) {
> + read_unlock(&journal->j_state_lock);
> + wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN));
> + goto repeat;
> + }
>
> if (!journal->j_running_transaction) {
> read_unlock(&journal->j_state_lock);
> @@ -492,6 +503,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> }
> EXPORT_SYMBOL(jbd2_journal_restart);
>
> +int jbd2_journal_freeze(journal_t *journal)
> +{
> + int error = 0;
> + /* Now we set up the journal barrier. */
> + jbd2_journal_lock_updates(journal);
> +
> + /*
> + * Don't clear the needs_recovery flag if we failed to flush
> + * the journal.
> + */
> + error = jbd2_journal_flush(journal);
> + if (error >= 0) {
> + write_lock(&journal->j_state_lock);
> + journal->j_flags |= JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + }
> + jbd2_journal_unlock_updates(journal);
> + return error;
> +}
> +EXPORT_SYMBOL(jbd2_journal_freeze);
> +
> +void jbd2_journal_thaw(journal_t * journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags &= ~JBD2_FROZEN;
> + write_unlock(&journal->j_state_lock);
> + wake_up(&journal->j_wait_frozen);
> +}
> +EXPORT_SYMBOL(jbd2_journal_thaw);
> +
> +
> /**
> * void jbd2_journal_lock_updates () - establish a transaction barrier.
> * @journal: Journal to establish a barrier on.
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 2092ea2..bfa0752 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -658,6 +658,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> * @j_wait_checkpoint: Wait queue to trigger checkpointing
> * @j_wait_commit: Wait queue to trigger commit
> * @j_wait_updates: Wait queue to wait for updates to complete
> + * @j_wait_frozen: Wait queue to wait for journal to thaw
> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> * @j_head: Journal head - identifies the first unused block in the journal
> * @j_tail: Journal tail - identifies the oldest still-used block in the
> @@ -775,6 +776,9 @@ struct journal_s
> /* Wait queue to wait for updates to complete */
> wait_queue_head_t j_wait_updates;
>
> + /* Wait queue to wait for journal to thaw*/
> + wait_queue_head_t j_wait_frozen;
> +
> /* Semaphore for locking against concurrent checkpoints */
> struct mutex j_checkpoint_mutex;
>
> @@ -953,6 +957,7 @@ struct journal_s
> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
> * data write error in ordered
> * mode */
> +#define JBD2_FROZEN 0x080 /* Journal thread frozen along with filesystem */
>
> /*
> * Function declarations for the journaling transaction and buffer
> @@ -1060,6 +1065,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
> struct page *, unsigned long);
> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> extern int jbd2_journal_stop(handle_t *);
> +extern int jbd2_journal_freeze(journal_t *);
> +extern void jbd2_journal_thaw(journal_t *);
> extern int jbd2_journal_flush (journal_t *);
> extern void jbd2_journal_lock_updates (journal_t *);
> extern void jbd2_journal_unlock_updates (journal_t *);
On Tue 10-01-12 14:20:23, Eric Sandeen wrote:
> On 12/8/11 12:04 PM, Kamal Mostafa wrote:
> > From: Surbhi Palande <[email protected]>
> >
> > The journal should be frozen when a filesystem freezes. What this means is
> > that until the filesystem is thawed again, no new transactions should be
> > accepted by the journal. When the filesystem thaws, inturn it should thaw the
> > journal and this should allow the journal to resume accepting new
> > transactions. While the the filesystem has frozen the journal, the clients of
> > the journal on calling jbd2_journal_start() will sleep on a wait queue.
> > Thawing the journal will wake up the sleeping clients and journalling can
> > progress normally.
> >
> > An example of the race condition that can happen without this patch is as
> > follows:
> >
> > Say the filesystem is thawed when we begin. Let tx be the time at unit x
> >
> > P1: Process doing an aio write
> > t1) ext4_file_write()
> > t2) generic_file_aio_write()
> > t3) __generic_file_aio_write()
> > // filesystem is not frozen, so we do not block in the next check.
> > t4) vfs_check_frozen()
> > t5) generic_write_checks()
> > ----------------- Prempted------------------
> >
> > P2: Process that does filesystem freeze
> >
> > t6) freeze_super()
> > t7) sync_filesystem()
> > t8) sync_blockdev()
> > t9) sb->s_op->freeze_fs() (= ext4_freeze)
> > t10) jbd2_journal_lock_updates()
> > t11) jbd2_journal_flush()
> > // Need to unlock the journal before returning to user space.
> > t12) jbd2_journal_unlock_updates()
> > // Journal is unlocked and so we can start accepting new transactions now.
> >
> > // freezing process completes execution. Page cache is now clean and should
> > // remain clean till the filesystem is frozen.
> > --------------------------------------------
> >
> > P1: writing process gets the control back
> > t13) generic_file_buffered_write()
> > t14) generic_perform_write()
> > t15) a_ops->write_begin() (= ext4_write_begin)
> > t16) ext4_journal_start()
> > // New handle is started. We do not block here! Write continues
> > // dirtying the page cache while the filesystem is frozen!
>
> Hrm let me think through this a little more; we actually do:
>
> t16) ext4_journal_start()
> t17) ext4_journal_start_sb()
> t18) handle = ext4_journal_current_handle();
> t19) if (!handle) vfs_check_frozen()
> t20) ... jbd2_journal_start()
Ah, right. I forgot.
> So actually we *do* block new handles, but let *existing* ones
> continue (see commits 6b0310fbf087ad6e9e3b8392adca97cd77184084
> and be4f27d324e8ddd57cc0d4d604fe85ee0425cba9)
>
> So your assertion that a new handle is started is incorrect
> in general, isn't it? So then does the fix seem necessary?
> Or, at least, in the fashion below - maybe we need to just make
> sure all started handles complete before the unlock_updates?
> Or am I missing something...?
Well, the problem with running operations and freezing is more
fundamental I believe. See my email
http://marc.info/?l=linux-fsdevel&m=132585911925796&w=2
So I believe we'll need some better exclusion mechanism already in VFS.
Honza
> > BugLink: https://bugs.launchpad.net/bugs/897421
> > Signed-off-by: Surbhi Palande <[email protected]>
> > Acked-by: Jan Kara <[email protected]>
> > Reviewed-by: Andreas Dilger <[email protected]>
> > Cc: Kamal Mostafa <[email protected]>
> > Tested-by: Peter M. Petrakis <[email protected]>
> > Signed-off-by: Kamal Mostafa <[email protected]>
> > ---
> > fs/jbd2/journal.c | 1 +
> > fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
> > include/linux/jbd2.h | 7 +++++++
> > 3 files changed, 50 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 0fa0123..f0170cc 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -894,6 +894,7 @@ static journal_t * journal_init_common (void)
> > init_waitqueue_head(&journal->j_wait_checkpoint);
> > init_waitqueue_head(&journal->j_wait_commit);
> > init_waitqueue_head(&journal->j_wait_updates);
> > + init_waitqueue_head(&journal->j_wait_frozen);
> > mutex_init(&journal->j_barrier);
> > mutex_init(&journal->j_checkpoint_mutex);
> > spin_lock_init(&journal->j_revoke_lock);
> > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > index a0e41a4..340ee35 100644
> > --- a/fs/jbd2/transaction.c
> > +++ b/fs/jbd2/transaction.c
> > @@ -173,6 +173,17 @@ repeat:
> > journal->j_barrier_count == 0);
> > goto repeat;
> > }
> > + /* Don't let a new handle start when a journal is frozen.
> > + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
> > + * the j_flags indicate that the journal is frozen. So if the
> > + * j_barrier_count is 0, then check if this was made 0 by the freezing
> > + * process
> > + */
> > + if (journal->j_flags & JBD2_FROZEN) {
> > + read_unlock(&journal->j_state_lock);
> > + wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN));
> > + goto repeat;
> > + }
> >
> > if (!journal->j_running_transaction) {
> > read_unlock(&journal->j_state_lock);
> > @@ -492,6 +503,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
> > }
> > EXPORT_SYMBOL(jbd2_journal_restart);
> >
> > +int jbd2_journal_freeze(journal_t *journal)
> > +{
> > + int error = 0;
> > + /* Now we set up the journal barrier. */
> > + jbd2_journal_lock_updates(journal);
> > +
> > + /*
> > + * Don't clear the needs_recovery flag if we failed to flush
> > + * the journal.
> > + */
> > + error = jbd2_journal_flush(journal);
> > + if (error >= 0) {
> > + write_lock(&journal->j_state_lock);
> > + journal->j_flags |= JBD2_FROZEN;
> > + write_unlock(&journal->j_state_lock);
> > + }
> > + jbd2_journal_unlock_updates(journal);
> > + return error;
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_freeze);
> > +
> > +void jbd2_journal_thaw(journal_t * journal)
> > +{
> > + write_lock(&journal->j_state_lock);
> > + journal->j_flags &= ~JBD2_FROZEN;
> > + write_unlock(&journal->j_state_lock);
> > + wake_up(&journal->j_wait_frozen);
> > +}
> > +EXPORT_SYMBOL(jbd2_journal_thaw);
> > +
> > +
> > /**
> > * void jbd2_journal_lock_updates () - establish a transaction barrier.
> > * @journal: Journal to establish a barrier on.
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 2092ea2..bfa0752 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -658,6 +658,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
> > * @j_wait_checkpoint: Wait queue to trigger checkpointing
> > * @j_wait_commit: Wait queue to trigger commit
> > * @j_wait_updates: Wait queue to wait for updates to complete
> > + * @j_wait_frozen: Wait queue to wait for journal to thaw
> > * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
> > * @j_head: Journal head - identifies the first unused block in the journal
> > * @j_tail: Journal tail - identifies the oldest still-used block in the
> > @@ -775,6 +776,9 @@ struct journal_s
> > /* Wait queue to wait for updates to complete */
> > wait_queue_head_t j_wait_updates;
> >
> > + /* Wait queue to wait for journal to thaw*/
> > + wait_queue_head_t j_wait_frozen;
> > +
> > /* Semaphore for locking against concurrent checkpoints */
> > struct mutex j_checkpoint_mutex;
> >
> > @@ -953,6 +957,7 @@ struct journal_s
> > #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
> > * data write error in ordered
> > * mode */
> > +#define JBD2_FROZEN 0x080 /* Journal thread frozen along with filesystem */
> >
> > /*
> > * Function declarations for the journaling transaction and buffer
> > @@ -1060,6 +1065,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
> > struct page *, unsigned long);
> > extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
> > extern int jbd2_journal_stop(handle_t *);
> > +extern int jbd2_journal_freeze(journal_t *);
> > +extern void jbd2_journal_thaw(journal_t *);
> > extern int jbd2_journal_flush (journal_t *);
> > extern void jbd2_journal_lock_updates (journal_t *);
> > extern void jbd2_journal_unlock_updates (journal_t *);
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
> > Hrm let me think through this a little more; we actually do:
> >
> > t16) ext4_journal_start()
> > t17) ext4_journal_start_sb()
> > t18) handle = ext4_journal_current_handle();
> > t19) if (!handle) vfs_check_frozen()
> > t20) ... jbd2_journal_start()
> Ah, right. I forgot.
>
> > So actually we *do* block new handles, but let *existing* ones
> > continue (see commits 6b0310fbf087ad6e9e3b8392adca97cd77184084
> > and be4f27d324e8ddd57cc0d4d604fe85ee0425cba9)
> >
> > So your assertion that a new handle is started is incorrect
> > in general, isn't it? So then does the fix seem necessary?
> > Or, at least, in the fashion below - maybe we need to just make
> > sure all started handles complete before the unlock_updates?
> > Or am I missing something...?
> Well, the problem with running operations and freezing is more
> fundamental I believe. See my email
> http://marc.info/?l=linux-fsdevel&m=132585911925796&w=2
>
> So I believe we'll need some better exclusion mechanism already in VFS.
>
> Honza
>
If all the write operations were journaled, then this patch would not
allow ext4 filesystem to have any dirty data after its frozen.
(as journal_start() would block).
I think the only one candidate that creates dirty data without
calling ext4_journal_start() is mmapped?
Regards,
Surbhi.
> Hrm let me think through this a little more; we actually do:
>
> t16) ext4_journal_start()
> t17) ext4_journal_start_sb()
> t18) handle = ext4_journal_current_handle();
> t19) if (!handle) vfs_check_frozen()
> t20) ... jbd2_journal_start()
>
> So actually we *do* block new handles, but let *existing* ones
> continue (see commits 6b0310fbf087ad6e9e3b8392adca97cd77184084
> and be4f27d324e8ddd57cc0d4d604fe85ee0425cba9)
>
> So your assertion that a new handle is started is incorrect
> in general, isn't it? So then does the fix seem necessary?
> Or, at least, in the fashion below - maybe we need to just make
> sure all started handles complete before the unlock_updates?
> Or am I missing something...?
>
> -Eric
This patch addresses the race that occurs between filesystem freeze
and jbd2_journal_start().
Task1 Task2
t1) vfs_check_frozen()
t2) ---------------------------------------------------------------------
t3) filesystem is frozen.
t4) jbd2_journal_start()
Without this patch a new handle can be started because of the race.
With this patch, no new journaled transaction can start after the
journal is frozen and hence no new journaled writes can be made.
Regards,
Surbhi
>
>>
>> BugLink: https://bugs.launchpad.net/bugs/897421
>> Signed-off-by: Surbhi Palande <[email protected]>
>> Acked-by: Jan Kara <[email protected]>
>> Reviewed-by: Andreas Dilger <[email protected]>
>> Cc: Kamal Mostafa <[email protected]>
>> Tested-by: Peter M. Petrakis <[email protected]>
>> Signed-off-by: Kamal Mostafa <[email protected]>
>> ---
>> fs/jbd2/journal.c | 1 +
>> fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>> include/linux/jbd2.h | 7 +++++++
>> 3 files changed, 50 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index 0fa0123..f0170cc 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -894,6 +894,7 @@ static journal_t * journal_init_common (void)
>> init_waitqueue_head(&journal->j_wait_checkpoint);
>> init_waitqueue_head(&journal->j_wait_commit);
>> init_waitqueue_head(&journal->j_wait_updates);
>> + init_waitqueue_head(&journal->j_wait_frozen);
>> mutex_init(&journal->j_barrier);
>> mutex_init(&journal->j_checkpoint_mutex);
>> spin_lock_init(&journal->j_revoke_lock);
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index a0e41a4..340ee35 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -173,6 +173,17 @@ repeat:
>> journal->j_barrier_count == 0);
>> goto repeat;
>> }
>> + /* Don't let a new handle start when a journal is frozen.
>> + * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
>> + * the j_flags indicate that the journal is frozen. So if the
>> + * j_barrier_count is 0, then check if this was made 0 by the freezing
>> + * process
>> + */
>> + if (journal->j_flags & JBD2_FROZEN) {
>> + read_unlock(&journal->j_state_lock);
>> + wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN));
>> + goto repeat;
>> + }
>>
>> if (!journal->j_running_transaction) {
>> read_unlock(&journal->j_state_lock);
>> @@ -492,6 +503,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
>> }
>> EXPORT_SYMBOL(jbd2_journal_restart);
>>
>> +int jbd2_journal_freeze(journal_t *journal)
>> +{
>> + int error = 0;
>> + /* Now we set up the journal barrier. */
>> + jbd2_journal_lock_updates(journal);
>> +
>> + /*
>> + * Don't clear the needs_recovery flag if we failed to flush
>> + * the journal.
>> + */
>> + error = jbd2_journal_flush(journal);
>> + if (error >= 0) {
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags |= JBD2_FROZEN;
>> + write_unlock(&journal->j_state_lock);
>> + }
>> + jbd2_journal_unlock_updates(journal);
>> + return error;
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_freeze);
>> +
>> +void jbd2_journal_thaw(journal_t * journal)
>> +{
>> + write_lock(&journal->j_state_lock);
>> + journal->j_flags &= ~JBD2_FROZEN;
>> + write_unlock(&journal->j_state_lock);
>> + wake_up(&journal->j_wait_frozen);
>> +}
>> +EXPORT_SYMBOL(jbd2_journal_thaw);
>> +
>> +
>> /**
>> * void jbd2_journal_lock_updates () - establish a transaction barrier.
>> * @journal: Journal to establish a barrier on.
>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>> index 2092ea2..bfa0752 100644
>> --- a/include/linux/jbd2.h
>> +++ b/include/linux/jbd2.h
>> @@ -658,6 +658,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
>> * @j_wait_checkpoint: Wait queue to trigger checkpointing
>> * @j_wait_commit: Wait queue to trigger commit
>> * @j_wait_updates: Wait queue to wait for updates to complete
>> + * @j_wait_frozen: Wait queue to wait for journal to thaw
>> * @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
>> * @j_head: Journal head - identifies the first unused block in the journal
>> * @j_tail: Journal tail - identifies the oldest still-used block in the
>> @@ -775,6 +776,9 @@ struct journal_s
>> /* Wait queue to wait for updates to complete */
>> wait_queue_head_t j_wait_updates;
>>
>> + /* Wait queue to wait for journal to thaw*/
>> + wait_queue_head_t j_wait_frozen;
>> +
>> /* Semaphore for locking against concurrent checkpoints */
>> struct mutex j_checkpoint_mutex;
>>
>> @@ -953,6 +957,7 @@ struct journal_s
>> #define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
>> * data write error in ordered
>> * mode */
>> +#define JBD2_FROZEN 0x080 /* Journal thread frozen along with filesystem */
>>
>> /*
>> * Function declarations for the journaling transaction and buffer
>> @@ -1060,6 +1065,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
>> struct page *, unsigned long);
>> extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
>> extern int jbd2_journal_stop(handle_t *);
>> +extern int jbd2_journal_freeze(journal_t *);
>> +extern void jbd2_journal_thaw(journal_t *);
>> extern int jbd2_journal_flush (journal_t *);
>> extern void jbd2_journal_lock_updates (journal_t *);
>> extern void jbd2_journal_unlock_updates (journal_t *);
>
On Tue 10-01-12 13:50:22, Surbhi Palande wrote:
> > > Hrm let me think through this a little more; we actually do:
> > >
> > > t16) ext4_journal_start()
> > > t17) ext4_journal_start_sb()
> > > t18) handle = ext4_journal_current_handle();
> > > t19) if (!handle) vfs_check_frozen()
> > > t20) ... jbd2_journal_start()
> > Ah, right. I forgot.
> >
> > > So actually we *do* block new handles, but let *existing* ones
> > > continue (see commits 6b0310fbf087ad6e9e3b8392adca97cd77184084
> > > and be4f27d324e8ddd57cc0d4d604fe85ee0425cba9)
> > >
> > > So your assertion that a new handle is started is incorrect
> > > in general, isn't it? So then does the fix seem necessary?
> > > Or, at least, in the fashion below - maybe we need to just make
> > > sure all started handles complete before the unlock_updates?
> > > Or am I missing something...?
> > Well, the problem with running operations and freezing is more
> > fundamental I believe. See my email
> > http://marc.info/?l=linux-fsdevel&m=132585911925796&w=2
> >
> > So I believe we'll need some better exclusion mechanism already in VFS.
> >
> > Honza
> >
>
>
> If all the write operations were journaled, then this patch would not allow
> ext4 filesystem to have any dirty data after its frozen.
> (as journal_start() would block).
>
> I think the only one candidate that creates dirty data without calling
> ext4_journal_start() is mmapped?
No, the problem is in any write path. The problem is with operations
that happen during the phase when s_frozen == SB_FREEZE_WRITE. These
operations dirty the filesystem but running sync may easily miss them.
During this phase journal is not frozen so that does not help you in any
way.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Jan,
>>
>>
>> If all the write operations were journaled, then this patch would not allow
>> ext4 filesystem to have any dirty data after its frozen.
>> (as journal_start() would block).
>>
>> I think the only one candidate that creates dirty data without calling
>> ext4_journal_start() is mmapped?
> No, the problem is in any write path. The problem is with operations
> that happen during the phase when s_frozen == SB_FREEZE_WRITE. These
> operations dirty the filesystem but running sync may easily miss them.
> During this phase journal is not frozen so that does not help you in any
> way.
>
> Honza
Ok! No new transaction can really start after the journal is frozen.
But we can have dirty data after SB_FREEZE_WRITE and before
SB_FREEZE_TRANS.
I agree with you. However, can this be fixed by adding a
sync_filesystem() in freeze_super() after the sb->s_op->freeze_fs() is
over?
So then essentially, when freeze_super() returns, the page cache is clean?
I do definitely agree that the fix is to add a lock for mutual
exclusion between freeze filesystem and writes to a frozen filesystem.
Thanks!
Regards,
Surbhi.
On Tue 10-01-12 16:13:46, Surbhi Palande wrote:
> >> If all the write operations were journaled, then this patch would not allow
> >> ext4 filesystem to have any dirty data after its frozen.
> >> (as journal_start() would block).
> >>
> >> ?I think the only one candidate that creates dirty data without calling
> >> ext4_journal_start() is mmapped?
> > ?No, the problem is in any write path. The problem is with operations
> > that happen during the phase when s_frozen == SB_FREEZE_WRITE. These
> > operations dirty the filesystem but running sync may easily miss them.
> > During this phase journal is not frozen so that does not help you in any
> > way.
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
>
> Ok! No new transaction can really start after the journal is frozen.
> But we can have dirty data after SB_FREEZE_WRITE and before
> SB_FREEZE_TRANS.
> I agree with you. However, can this be fixed by adding a
> sync_filesystem() in freeze_super() after the sb->s_op->freeze_fs() is
> over?
The problem with this is that to do writeback you need to start a
transaction. So you would deadlock.
> So then essentially, when freeze_super() returns, the page cache is clean?
>
> I do definitely agree that the fix is to add a lock for mutual
> exclusion between freeze filesystem and writes to a frozen filesystem.
Yes, that would be the cleanest solution but it's quite some work.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 1/10/12 3:31 PM, Jan Kara wrote:
> On Tue 10-01-12 14:20:23, Eric Sandeen wrote:
<snip>
>> Hrm let me think through this a little more; we actually do:
>>
>> t16) ext4_journal_start()
>> t17) ext4_journal_start_sb()
>> t18) handle = ext4_journal_current_handle();
>> t19) if (!handle) vfs_check_frozen()
>> t20) ... jbd2_journal_start()
> Ah, right. I forgot.
>
>> So actually we *do* block new handles, but let *existing* ones
>> continue (see commits 6b0310fbf087ad6e9e3b8392adca97cd77184084
>> and be4f27d324e8ddd57cc0d4d604fe85ee0425cba9)
>>
>> So your assertion that a new handle is started is incorrect
>> in general, isn't it? So then does the fix seem necessary?
>> Or, at least, in the fashion below - maybe we need to just make
>> sure all started handles complete before the unlock_updates?
>> Or am I missing something...?
> Well, the problem with running operations and freezing is more
> fundamental I believe. See my email
> http://marc.info/?l=linux-fsdevel&m=132585911925796&w=2
>
> So I believe we'll need some better exclusion mechanism already in VFS.
>
> Honza
>
Yep, saw it, just wasn't sure if this patchset was still under active consideration.
Thanks,
-Eric
On second thoughts, I fail to see why there is still a race window
after this patch.
Here are the reasons why i fail to see how the data can be dirtied
when all the operations involve a journal:
----------
So here is the problem that we see
CPU1 CPU2
Task1 (write operation) Task2
---------------------------------------------------------------------------------------
t1 ext4_journal_start()
t2 ext4_journal_start_sb()
t3 vfs_check_frozen sb->frozen=SB_FREEZE_WRITE
t4 jbd2_journal_start() /* hence forth all processes calling
vfs_check_frozen will wait */
Now, our aim is to stop Task1 from dirtying the page cache ie in
starting this transaction. However if it is successful in starting
this transaction, then we want to make sure that this transaction is
flushed out.
Correct?
(with the journal freeze patch applied)
* Task1 is now executing the code of jbd2_journal_start(). Task2 is
the freezing process.
CPU1 CPU2
Task1 (write operation) Task2
t4 jbd2_journal_start()
...
tx read_lock(&journal->j_state_lock)
Now two things can happen at this point with respect to Task2:
a) either journal->j_flags is set to JBD2_FROZEN already
b) or it is not set.
Lets look at both the cases:
A) When journal->j_flags is set to JBD2_FROZEN already then
jbd2_journal_start() will get stuck on the waitqueue as long as the
journal->j_flags is JBD2_FROZEN. So we cannot create dirty data in
this case.
B) When journal->j_flags is not set to JBD2_ FROZEN then two more
things could happen:
Task2 has already finished the call to jbd2_journal_flush() by the
time Task1 calls read_lock(). So now no new task (T1) should create
any new dirty data as that will not get flushed out. So we really want
to stop the jbd2_journal_start() from succeeding.
CPU1 CPU2
Task1 (write operation) Task2
----------------------------------------------------------------------------------------------
tx read_lock(&journal->j_state_lock)
jbd2_journal_lock_updates() /*journal->j_barrier_count incremented -
so non zero ! */
tx+1 jbd2_journal_flush()
tx+2 write_lock(&journal->j_state_lock);
/* till the read_lock is relinquished by
task1 , we are stuck */
tx+3 if(journal->j_barrier_count)
read_unlock(&journal->j_state_lock);
/* so we can set the journal->j_flags now as write_lock()
succeeds here */
tx+4 goto repeat
The above case is where the writing process/jbd2_journal_start() calls
read_lock() *after* jbd2_journal_lock_updates() are called by the
freezing process and hence is protected by the j_barrier_count check.
This transaction can definitely not start!
Now following is the case where the writing
process/jbd2_journal_start() calls read_lock() *before*
jbd2_journal_lock_updates() are called by the freezing process.
However, if Task1 already finished starting the transaction as follows:
CPU1 CPU2
Task1 (write operation)
Task2
-------------------------------------------------------------------------------------------------------------------------------------------------------------
tx read_lock(&journal->j_state_lock)
tx+1
jbd2_journal_lock_updates() (will be stuck at the next instance...)
tx+3 if(journal->j_barrier_count)
write_lock(journal->j_state_lock)
/* journal->j_barrier_count = 0, so we proceed here*/
/* stuck here till Task1 relinquishes the lock*/
tx+4 read_unlock(&journal->j_state_lock);
tx+5 now start_this_handle() returns successfully
and associates this handle with the running
transaction.
tx+6
jbd2_journal_lock_updates() succeeds
tx+7
jbd2_journal_flush() /* This shall flush the running transactions */
/* There are no
outstanding transcations. No new ext4_journal_start() will succeed by
this point*/
--------------------------------
Thus:
Though a racy transaction can really be started after SB_FREEZE_WRITE,
either that transaction will be flushed by jbd2_journal_flush or it
cannot be started at all because of the barrier count.
So, I do not understand why the journaled writes should really have a
race window after this patch is applied?
Thanks!
Regards,
Surbhi
On Tue, Jan 10, 2012 at 4:13 PM, Surbhi Palande <[email protected]> wrote:
> Hi Jan,
>
>>>
>>>
>>> If all the write operations were journaled, then this patch would not allow
>>> ext4 filesystem to have any dirty data after its frozen.
>>> (as journal_start() would block).
>>>
>>> I think the only one candidate that creates dirty data without calling
>>> ext4_journal_start() is mmapped?
>> No, the problem is in any write path. The problem is with operations
>> that happen during the phase when s_frozen == SB_FREEZE_WRITE. These
>> operations dirty the filesystem but running sync may easily miss them.
>> During this phase journal is not frozen so that does not help you in any
>> way.
>>
>> Honza
>
> Ok! No new transaction can really start after the journal is frozen.
> But we can have dirty data after SB_FREEZE_WRITE and before
> SB_FREEZE_TRANS.
> I agree with you. However, can this be fixed by adding a
> sync_filesystem() in freeze_super() after the sb->s_op->freeze_fs() is
> over?
>
>
> So then essentially, when freeze_super() returns, the page cache is clean?
>
> I do definitely agree that the fix is to add a lock for mutual
> exclusion between freeze filesystem and writes to a frozen filesystem.
>
> Thanks!
>
> Regards,
> Surbhi.
(Resending one of the last cases, as the two task scenario got mixed! )
> Now following is the case where the writing
> process/jbd2_journal_start() calls read_lock() *before*
> jbd2_journal_lock_updates() are called by the freezing process.
>
CPU1 CPU2
Task1 (write operation) Task2
tx read_lock(&journal->j_state_lock)
tx+1 jbd2_journal_lock_updates() (will be stuck
at the next instance...)
tx+3 if(journal->j_barrier_count)
write_lock(journal->j_state_lock) /* stuck here till Task1
relinquishes the lock*/
tx+4 handle started and associated with
running transaction.
tx+5 read_unlock(&journal->j_state_lock)
tx+6 jbd2_journal_lock_updates() succeeds
tx+7 jbd2_journal_flush() /* This shall flush the
running transactions */
/* There are no outstanding transcations. No
new ext4_journal_start() will succeed by this point */
> --------------------------------
> Thus:
>
> Though a racy transaction can really be started after SB_FREEZE_WRITE,
> either that transaction will be flushed by jbd2_journal_flush or it
> cannot be started at all because of the barrier count.
>
> So, I do not understand why the journaled writes should really have a
> race window after this patch is applied?
>
Regards,
Surbhi
On Tue, Jan 10, 2012 at 9:38 PM, Surbhi Palande <[email protected]> wrote:
> On second thoughts, I fail to see why there is still a race window
> after this patch.
>
> Here are the reasons why i fail to see how the data can be dirtied
> when all the operations involve a journal:
>
> ----------
> So here is the problem that we see
> CPU1 CPU2
> Task1 (write operation) Task2
> ---------------------------------------------------------------------------------------
> t1 ext4_journal_start()
> t2 ext4_journal_start_sb()
> t3 vfs_check_frozen sb->frozen=SB_FREEZE_WRITE
> t4 jbd2_journal_start() /* hence forth all processes calling
> vfs_check_frozen will wait */
>
> Now, our aim is to stop Task1 from dirtying the page cache ie in
> starting this transaction. However if it is successful in starting
> this transaction, then we want to make sure that this transaction is
> flushed out.
> Correct?
>
> (with the journal freeze patch applied)
> * Task1 is now executing the code of jbd2_journal_start(). Task2 is
> the freezing process.
>
> CPU1 CPU2
> Task1 (write operation) Task2
> t4 jbd2_journal_start()
> ...
> tx read_lock(&journal->j_state_lock)
>
> Now two things can happen at this point with respect to Task2:
> a) either journal->j_flags is set to JBD2_FROZEN already
> b) or it is not set.
>
>
> Lets look at both the cases:
> A) When journal->j_flags is set to JBD2_FROZEN already then
> jbd2_journal_start() will get stuck on the waitqueue as long as the
> journal->j_flags is JBD2_FROZEN. So we cannot create dirty data in
> this case.
>
> B) When journal->j_flags is not set to JBD2_ FROZEN then two more
> things could happen:
> Task2 has already finished the call to jbd2_journal_flush() by the
> time Task1 calls read_lock(). So now no new task (T1) should create
> any new dirty data as that will not get flushed out. So we really want
> to stop the jbd2_journal_start() from succeeding.
>
>
> CPU1 CPU2
> Task1 (write operation) Task2
> ----------------------------------------------------------------------------------------------
> tx read_lock(&journal->j_state_lock)
> jbd2_journal_lock_updates() /*journal->j_barrier_count incremented -
> so non zero ! */
> tx+1 jbd2_journal_flush()
> tx+2 write_lock(&journal->j_state_lock);
> /* till the read_lock is relinquished by
> task1 , we are stuck */
> tx+3 if(journal->j_barrier_count)
> read_unlock(&journal->j_state_lock);
>
> /* so we can set the journal->j_flags now as write_lock()
> succeeds here */
> tx+4 goto repeat
>
> The above case is where the writing process/jbd2_journal_start() calls
> read_lock() *after* jbd2_journal_lock_updates() are called by the
> freezing process and hence is protected by the j_barrier_count check.
> This transaction can definitely not start!
>
> Now following is the case where the writing
> process/jbd2_journal_start() calls read_lock() *before*
> jbd2_journal_lock_updates() are called by the freezing process.
>
> However, if Task1 already finished starting the transaction as follows:
> CPU1 CPU2
> Task1 (write operation)
> Task2
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
> tx read_lock(&journal->j_state_lock)
> tx+1
> jbd2_journal_lock_updates() (will be stuck at the next instance...)
> tx+3 if(journal->j_barrier_count)
> write_lock(journal->j_state_lock)
> /* journal->j_barrier_count = 0, so we proceed here*/
> /* stuck here till Task1 relinquishes the lock*/
> tx+4 read_unlock(&journal->j_state_lock);
> tx+5 now start_this_handle() returns successfully
> and associates this handle with the running
> transaction.
> tx+6
> jbd2_journal_lock_updates() succeeds
> tx+7
> jbd2_journal_flush() /* This shall flush the running transactions */
> /* There are no
> outstanding transcations. No new ext4_journal_start() will succeed by
> this point*/
>
> --------------------------------
> Thus:
>
> Though a racy transaction can really be started after SB_FREEZE_WRITE,
> either that transaction will be flushed by jbd2_journal_flush or it
> cannot be started at all because of the barrier count.
>
> So, I do not understand why the journaled writes should really have a
> race window after this patch is applied?
>
> Thanks!
>
> Regards,
> Surbhi
>
>
>
>
>
>
>
> On Tue, Jan 10, 2012 at 4:13 PM, Surbhi Palande <[email protected]> wrote:
>> Hi Jan,
>>
>>>>
>>>>
>>>> If all the write operations were journaled, then this patch would not allow
>>>> ext4 filesystem to have any dirty data after its frozen.
>>>> (as journal_start() would block).
>>>>
>>>> I think the only one candidate that creates dirty data without calling
>>>> ext4_journal_start() is mmapped?
>>> No, the problem is in any write path. The problem is with operations
>>> that happen during the phase when s_frozen == SB_FREEZE_WRITE. These
>>> operations dirty the filesystem but running sync may easily miss them.
>>> During this phase journal is not frozen so that does not help you in any
>>> way.
>>>
>>> Honza
>>
>> Ok! No new transaction can really start after the journal is frozen.
>> But we can have dirty data after SB_FREEZE_WRITE and before
>> SB_FREEZE_TRANS.
>> I agree with you. However, can this be fixed by adding a
>> sync_filesystem() in freeze_super() after the sb->s_op->freeze_fs() is
>> over?
>>
>>
>> So then essentially, when freeze_super() returns, the page cache is clean?
>>
>> I do definitely agree that the fix is to add a lock for mutual
>> exclusion between freeze filesystem and writes to a frozen filesystem.
>>
>> Thanks!
>>
>> Regards,
>> Surbhi.
On Tue 10-01-12 21:38:29, Surbhi Palande wrote:
> On second thoughts, I fail to see why there is still a race window
> after this patch.
>
> Here are the reasons why i fail to see how the data can be dirtied
> when all the operations involve a journal:
>
> ----------
> So here is the problem that we see
> CPU1 CPU2
> Task1 (write operation) Task2
> ---------------------------------------------------------------------------------------
> t1 ext4_journal_start()
> t2 ext4_journal_start_sb()
> t3 vfs_check_frozen sb->frozen=SB_FREEZE_WRITE
> t4 jbd2_journal_start() /* hence forth all processes calling
> vfs_check_frozen will wait */
Note that we call vfs_check_frozen(sb, SB_FREEZE_TRANS) in
ext4_journal_start_sb(). Thus we start blocking only when s_frozen ==
SB_FREEZE_TRANS and we just ignore s_frozen == SB_FREEZE_WRITE.
> Now, our aim is to stop Task1 from dirtying the page cache ie in
> starting this transaction. However if it is successful in starting
> this transaction, then we want to make sure that this transaction is
> flushed out.
> Correct?
Not quite. Flushing a journal will flush dirty metadata but we will still
have dirty pages (dirty data is not part of any transaction). So in the
scenarion I describe in
http://marc.info/?l=linux-fsdevel&m=132585911925796&w=2
all metadata changes will be flushed inside ->freeze_fs (at least for
journalling filesystems) but pages will be left dirty. Is it clearer now?
But your comment makes me realize that the situation is simpler than I
thought by the fact that we only have to protect paths that create dirty
data as dirty metadata can be handled by flushing a journal. And there are
only a few places creating dirty data. So a reasonably clean solution
shouldn't be that complicated after all. I'll tweak my patch and try it in
a moment.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Jan,
Thanks a lot for your comments :)
Isn't dirty data flushed out in "ordered" mode? as
ext4_jbd2_file_inode() will get called for ordered writes. Thus this
inode's data is flushed at journal commit time through
journal_submit_data_buffers()?
However I do see that we will still have a dirty data problem for
"writeback" and "journalled" mode?
Regards,
Surbhi.
On Wed, Jan 11, 2012 at 4:10 AM, Jan Kara <[email protected]> wrote:
> On Tue 10-01-12 21:38:29, Surbhi Palande wrote:
>> On second thoughts, I fail to see why there is still a race window
>> after this patch.
>>
>> Here are the reasons why i fail to see how the data can be dirtied
>> when all the operations involve a journal:
>>
>> ----------
>> So here is the problem that we see
>> CPU1 CPU2
>> Task1 (write operation) Task2
>> ---------------------------------------------------------------------------------------
>> t1 ext4_journal_start()
>> t2 ext4_journal_start_sb()
>> t3 vfs_check_frozen sb->frozen=SB_FREEZE_WRITE
>> t4 jbd2_journal_start() /* hence forth all processes calling
>> vfs_check_frozen will wait */
> Note that we call vfs_check_frozen(sb, SB_FREEZE_TRANS) in
> ext4_journal_start_sb(). Thus we start blocking only when s_frozen ==
> SB_FREEZE_TRANS and we just ignore s_frozen == SB_FREEZE_WRITE.
>
>> Now, our aim is to stop Task1 from dirtying the page cache ie in
>> starting this transaction. However if it is successful in starting
>> this transaction, then we want to make sure that this transaction is
>> flushed out.
>> Correct?
> Not quite. Flushing a journal will flush dirty metadata but we will still
> have dirty pages (dirty data is not part of any transaction). So in the
> scenarion I describe in
> http://marc.info/?l=linux-fsdevel&m=132585911925796&w=2
> all metadata changes will be flushed inside ->freeze_fs (at least for
> journalling filesystems) but pages will be left dirty. Is it clearer now?
>
> But your comment makes me realize that the situation is simpler than I
> thought by the fact that we only have to protect paths that create dirty
> data as dirty metadata can be handled by flushing a journal. And there are
> only a few places creating dirty data. So a reasonably clean solution
> shouldn't be that complicated after all. I'll tweak my patch and try it in
> a moment.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
Hello,
On Wed 11-01-12 08:45:17, Surbhi Palande wrote:
> Isn't dirty data flushed out in "ordered" mode? as
> ext4_jbd2_file_inode() will get called for ordered writes. Thus this
> inode's data is flushed at journal commit time through
> journal_submit_data_buffers()?
Well, not with delayed allocation and also not for example for xfs. So
in some special cases it might happen but we cannot really depend on it.
> However I do see that we will still have a dirty data problem for
> "writeback" and "journalled" mode?
For journalled mode, data is treated as metadata so it's the mode where
the problems are smallest (although we'd still have problems because even
though kjournald writes the data, it clears only buffer dirty bits but not
page dirty bits). For writeback mode you are correct.
Honza
> On Wed, Jan 11, 2012 at 4:10 AM, Jan Kara <[email protected]> wrote:
> > On Tue 10-01-12 21:38:29, Surbhi Palande wrote:
> >> On second thoughts, I fail to see why there is still a race window
> >> after this patch.
> >>
> >> Here are the reasons why i fail to see how the data can be dirtied
> >> when all the operations involve a journal:
> >>
> >> ----------
> >> So here is the problem that we see
> >> ? ? ? CPU1 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? CPU2
> >> ? ? ? ?Task1 (write operation) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Task2
> >> ---------------------------------------------------------------------------------------
> >> t1 ? ?ext4_journal_start()
> >> t2 ? ? ?ext4_journal_start_sb()
> >> t3 ? ? ? ?vfs_check_frozen ? ? ? ? ? ? ? ? ? ? ? ? ? ?sb->frozen=SB_FREEZE_WRITE
> >> t4 ? ? ? ? ? ?jbd2_journal_start() ? ? ? ? ? ? ? ? ? ?/* hence forth all processes calling
> >> vfs_check_frozen will wait */
> > ?Note that we call vfs_check_frozen(sb, SB_FREEZE_TRANS) in
> > ext4_journal_start_sb(). Thus we start blocking only when s_frozen ==
> > SB_FREEZE_TRANS and we just ignore s_frozen == SB_FREEZE_WRITE.
> >
> >> Now, our aim is to stop Task1 from dirtying the page cache ie in
> >> starting this transaction. However if it is successful in starting
> >> this transaction, then we want to make sure that this transaction is
> >> flushed out.
> >> Correct?
> > ?Not quite. Flushing a journal will flush dirty metadata but we will still
> > have dirty pages (dirty data is not part of any transaction). So in the
> > scenarion I describe in
> > http://marc.info/?l=linux-fsdevel&m=132585911925796&w=2
> > all metadata changes will be flushed inside ->freeze_fs (at least for
> > journalling filesystems) but pages will be left dirty. Is it clearer now?
> >
> > But your comment makes me realize that the situation is simpler than I
> > thought by the fact that we only have to protect paths that create dirty
> > data as dirty metadata can be handled by flushing a journal. And there are
> > only a few places creating dirty data. So a reasonably clean solution
> > shouldn't be that complicated after all. I'll tweak my patch and try it in
> > a moment.
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, 2012-01-06 at 01:35 +0100, Jan Kara wrote:
> On Thu 08-12-11 10:04:35, Kamal Mostafa wrote:
> > From: Valerie Aurora <[email protected]>
> >
> > Use trylock in try_to_writeback_inodes_sb to avoid read-write
> > deadlocks that could be triggered by freeze.
> Christoph asked about what is the exact deadlock this patch tries to fix.
> I don't think you answered that. So can you elaborate please? Is it somehow
> connected with the fact that ext4 calls try_to_writeback_inodes_sb() with
> i_mutex held?
>
> Honza
This was discussed in the thread
http://www.spinics.net/lists/linux-fsdevel/msg48754.html
Summarizing...
Jan> What's exactly the deadlock trylock protects from here?
Jan> Or is it just an optimization?
Val> The trylock is an optimization Dave Chinner suggested. The first
Val> version I wrote acquired the lock and then checked vfs_is_frozen().
Dave> It's not so much an optimisation, but the general case of avoiding
Dave> read-write deadlocks such that freezing can trigger. I think remount
Dave> can trigger the same deadlock as freezing, so the trylock avoids
Dave> both deadlock cases rather than just working around the freeze
Dave> problem....
-Kamal
> > BugLink: https://bugs.launchpad.net/bugs/897421
> > Signed-off-by: Valerie Aurora <[email protected]>
> > Cc: Kamal Mostafa <[email protected]>
> > Tested-by: Peter M. Petrakis <[email protected]>
> > [[email protected]: patch restructure]
> > Signed-off-by: Kamal Mostafa <[email protected]>
> > ---
> > fs/fs-writeback.c | 13 ++++++++-----
> > 1 files changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index ea89b3f..3a80f1b 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1274,8 +1274,9 @@ EXPORT_SYMBOL(writeback_inodes_sb);
> > * try_to_writeback_inodes_sb - start writeback if none underway
> > * @sb: the superblock
> > *
> > - * Invoke writeback_inodes_sb if no writeback is currently underway.
> > - * Returns 1 if writeback was started, 0 if not.
> > + * Invoke writeback_inodes_sb if no writeback is currently underway
> > + * and no one else holds the s_umount lock. Returns 1 if writeback
> > + * was started, 0 if not.
> > */
> > int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
> > {
> > @@ -1288,15 +1289,17 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
> > * @sb: the superblock
> > * @nr: the number of pages to write
> > *
> > - * Invoke writeback_inodes_sb if no writeback is currently underway.
> > - * Returns 1 if writeback was started, 0 if not.
> > + * Invoke writeback_inodes_sb if no writeback is currently underway
> > + * and no one else holds the s_umount lock. Returns 1 if writeback
> > + * was started, 0 if not.
> > */
> > int try_to_writeback_inodes_sb_nr(struct super_block *sb,
> > unsigned long nr,
> > enum wb_reason reason)
> > {
> > if (!writeback_in_progress(sb->s_bdi)) {
> > - down_read(&sb->s_umount);
> > + if (!down_read_trylock(&sb->s_umount))
> > + return 0;
> > if (nr == 0)
> > writeback_inodes_sb(sb, reason);
> > else
> > --
> > 1.7.5.4
> >
On Wed, 11 Jan 2012, Kamal Mostafa wrote:
> On Fri, 2012-01-06 at 01:35 +0100, Jan Kara wrote:
> > On Thu 08-12-11 10:04:35, Kamal Mostafa wrote:
> > > From: Valerie Aurora <[email protected]>
> > >
> > > Use trylock in try_to_writeback_inodes_sb to avoid read-write
> > > deadlocks that could be triggered by freeze.
>
> > Christoph asked about what is the exact deadlock this patch tries to fix.
> > I don't think you answered that. So can you elaborate please? Is it somehow
> > connected with the fact that ext4 calls try_to_writeback_inodes_sb() with
> > i_mutex held?
> >
> > Honza
>
> This was discussed in the thread
> http://www.spinics.net/lists/linux-fsdevel/msg48754.html
> Summarizing...
>
> Jan> What's exactly the deadlock trylock protects from here?
> Jan> Or is it just an optimization?
>
> Val> The trylock is an optimization Dave Chinner suggested. The first
> Val> version I wrote acquired the lock and then checked vfs_is_frozen().
>
> Dave> It's not so much an optimisation, but the general case of avoiding
> Dave> read-write deadlocks such that freezing can trigger. I think remount
> Dave> can trigger the same deadlock as freezing, so the trylock avoids
> Dave> both deadlock cases rather than just working around the freeze
> Dave> problem....
>
> -Kamal
As I wrote in
https://www.redhat.com/archives/dm-devel/2011-November/msg00151.html ,
down_read_trylock doesn't fix the freeze deadlock. Think of this sequence:
Process 1 (freezing)
down_write(&sb->s_umount);
set the filesystem to frozen state
up_write(&sb->s_umount);
Process 2 (executing the code from the patch)
down_read_trylock(&sb->s_umount); - succeeds, because s_umount is not held
writeback_inodes_sb(sb, reason); - waits, because the filesystem is frozen
Process 1 (unfreezing)
down_write(&sb->s_umount); - deadlock (process 1 is waiting for process 2
to drop the lock; process 2 is waiting for process 1 to unfreeze).
See the patch at
https://www.redhat.com/archives/dm-devel/2011-November/msg00151.html , it
has a different approach and it avoids the mentined freeze deadlock.
Mikulas
> > > BugLink: https://bugs.launchpad.net/bugs/897421
> > > Signed-off-by: Valerie Aurora <[email protected]>
> > > Cc: Kamal Mostafa <[email protected]>
> > > Tested-by: Peter M. Petrakis <[email protected]>
> > > [[email protected]: patch restructure]
> > > Signed-off-by: Kamal Mostafa <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 13 ++++++++-----
> > > 1 files changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index ea89b3f..3a80f1b 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -1274,8 +1274,9 @@ EXPORT_SYMBOL(writeback_inodes_sb);
> > > * try_to_writeback_inodes_sb - start writeback if none underway
> > > * @sb: the superblock
> > > *
> > > - * Invoke writeback_inodes_sb if no writeback is currently underway.
> > > - * Returns 1 if writeback was started, 0 if not.
> > > + * Invoke writeback_inodes_sb if no writeback is currently underway
> > > + * and no one else holds the s_umount lock. Returns 1 if writeback
> > > + * was started, 0 if not.
> > > */
> > > int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
> > > {
> > > @@ -1288,15 +1289,17 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
> > > * @sb: the superblock
> > > * @nr: the number of pages to write
> > > *
> > > - * Invoke writeback_inodes_sb if no writeback is currently underway.
> > > - * Returns 1 if writeback was started, 0 if not.
> > > + * Invoke writeback_inodes_sb if no writeback is currently underway
> > > + * and no one else holds the s_umount lock. Returns 1 if writeback
> > > + * was started, 0 if not.
> > > */
> > > int try_to_writeback_inodes_sb_nr(struct super_block *sb,
> > > unsigned long nr,
> > > enum wb_reason reason)
> > > {
> > > if (!writeback_in_progress(sb->s_bdi)) {
> > > - down_read(&sb->s_umount);
> > > + if (!down_read_trylock(&sb->s_umount))
> > > + return 0;
> > > if (nr == 0)
> > > writeback_inodes_sb(sb, reason);
> > > else
> > > --
> > > 1.7.5.4