2011-12-05 20:55:03

by Kamal Mostafa

[permalink] [raw]
Subject: [PATCH 0/5 resend] fix s_umount thaw/write and journal deadlock

[resend: This is an unchanged resend of the patches sent on 2011-11-30,
which were apparently not delivered to some recipients.]

This set of filesystem freeze/thaw deadlock patches is submitted
on behalf of the authors, Surbhi Palande and Valerie Aurora. The
patches fix the bug:

BugLink: https://bugs.launchpad.net/bugs/897421

Surbhi Palande (2):
Adding support to freeze and unfreeze a journal
Thaw the journal when you unfreeze the fs.

Valerie Aurora (3):
VFS: Fix s_umount thaw/write deadlock
VFS: Rename vfs_check_frozen() to
Documentation: Correct s_umount state for

Documentation/filesystems/Locking | 4 +-
fs/btrfs/disk-io.c | 4 +-
fs/btrfs/file.c | 2 +-
fs/buffer.c | 7 +++--
fs/drop_caches.c | 8 ++++++
fs/ext4/inode.c | 2 +-
fs/ext4/super.c | 28 +++++++++------------
fs/fs-writeback.c | 50 ++++++++++++++++++++++++------------
fs/fuse/file.c | 2 +-
fs/gfs2/file.c | 2 +-
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 42 +++++++++++++++++++++++++++++++
fs/nilfs2/ioctl.c | 2 +-
fs/nilfs2/segment.c | 2 +-
fs/ntfs/file.c | 2 +-
fs/ocfs2/file.c | 2 +-
fs/quota/quota.c | 12 ++++++++-
fs/super.c | 26 +++++++++++++++++++
fs/sync.c | 9 +++---
fs/xfs/xfs_mount.h | 2 +-
include/linux/fs.h | 8 ++++-
include/linux/jbd2.h | 7 +++++
mm/filemap.c | 2 +-
mm/filemap_xip.c | 2 +-
24 files changed, 170 insertions(+), 58 deletions(-)



2011-12-05 20:54:45

by Kamal Mostafa

[permalink] [raw]
Subject: [PATCH 1/5 resend] Adding support to freeze and unfreeze a journal

From: Surbhi Palande <[email protected]>

The journal should be frozen when a filesystem freezes. What this means is
that until the filesystem is thawed again, no new transactions should be
accepted by the journal. When the filesystem thaws, inturn it should thaw the
journal and this should allow the journal to resume accepting new
transactions. While the the filesystem has frozen the journal, the clients of
the journal on calling jbd2_journal_start() will sleep on a wait queue.
Thawing the journal will wake up the sleeping clients and journalling can
progress normally.

An example of the race condition that can happen without this patch is as
follows:

Say the filesystem is thawed when we begin. Let tx be the time at unit x

P1: Process doing an aio write
t1) ext4_file_write()
t2) generic_file_aio_write()
t3) __generic_file_aio_write()
// filesystem is not frozen, so we do not block in the next check.
t4) vfs_check_frozen()
t5) generic_write_checks()
----------------- Prempted------------------

P2: Process that does filesystem freeze

t6) freeze_super()
t7) sync_filesystem()
t8) sync_blockdev()
t9) sb->s_op->freeze_fs() (= ext4_freeze)
t10) jbd2_journal_lock_updates()
t11) jbd2_journal_flush()
// Need to unlock the journal before returning to user space.
t12) jbd2_journal_unlock_updates()
// Journal is unlocked and so we can start accepting new transactions now.

// freezing process completes execution. Page cache is now clean and should
// remain clean till the filesystem is frozen.
--------------------------------------------

P1: writing process gets the control back
t13) generic_file_buffered_write()
t14) generic_perform_write()
t15) a_ops->write_begin() (= ext4_write_begin)
t16) ext4_journal_start()
// New handle is started. We do not block here! Write continues
// dirtying the page cache while the filesystem is frozen!

BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Surbhi Palande <[email protected]>
Acked-by: Jan Kara <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/ext4/super.c | 20 ++++++--------------
fs/jbd2/journal.c | 1 +
fs/jbd2/transaction.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 7 +++++++
4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3858767..5bfdc21 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4330,23 +4330,15 @@ static int ext4_freeze(struct super_block *sb)

journal = EXT4_SB(sb)->s_journal;

- /* Now we set up the journal barrier. */
- jbd2_journal_lock_updates(journal);
-
+ error = jbd2_journal_freeze(journal);
/*
- * Don't clear the needs_recovery flag if we failed to flush
+ * Don't clear the needs_recovery flag if we failed to freeze
* the journal.
*/
- error = jbd2_journal_flush(journal);
- if (error < 0)
- goto out;
-
- /* Journal blocked and flushed, clear needs_recovery flag. */
- EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
- error = ext4_commit_super(sb, 1);
-out:
- /* we rely on s_frozen to stop further updates */
- jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+ if (error >= 0) {
+ EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
+ error = ext4_commit_super(sb, 1);
+ }
return error;
}

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 0fa0123..f0170cc 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -894,6 +894,7 @@ static journal_t * journal_init_common (void)
init_waitqueue_head(&journal->j_wait_checkpoint);
init_waitqueue_head(&journal->j_wait_commit);
init_waitqueue_head(&journal->j_wait_updates);
+ init_waitqueue_head(&journal->j_wait_frozen);
mutex_init(&journal->j_barrier);
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a0e41a4..340ee35 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -173,6 +173,17 @@ repeat:
journal->j_barrier_count == 0);
goto repeat;
}
+ /* Don't let a new handle start when a journal is frozen.
+ * jbd2_journal_freeze calls jbd2_journal_unlock_updates() only after
+ * the j_flags indicate that the journal is frozen. So if the
+ * j_barrier_count is 0, then check if this was made 0 by the freezing
+ * process
+ */
+ if (journal->j_flags & JBD2_FROZEN) {
+ read_unlock(&journal->j_state_lock);
+ wait_event(journal->j_wait_frozen, (journal->j_flags & JBD2_FROZEN));
+ goto repeat;
+ }

if (!journal->j_running_transaction) {
read_unlock(&journal->j_state_lock);
@@ -492,6 +503,37 @@ int jbd2_journal_restart(handle_t *handle, int nblocks)
}
EXPORT_SYMBOL(jbd2_journal_restart);

+int jbd2_journal_freeze(journal_t *journal)
+{
+ int error = 0;
+ /* Now we set up the journal barrier. */
+ jbd2_journal_lock_updates(journal);
+
+ /*
+ * Don't clear the needs_recovery flag if we failed to flush
+ * the journal.
+ */
+ error = jbd2_journal_flush(journal);
+ if (error >= 0) {
+ write_lock(&journal->j_state_lock);
+ journal->j_flags |= JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ }
+ jbd2_journal_unlock_updates(journal);
+ return error;
+}
+EXPORT_SYMBOL(jbd2_journal_freeze);
+
+void jbd2_journal_thaw(journal_t * journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags &= ~JBD2_FROZEN;
+ write_unlock(&journal->j_state_lock);
+ wake_up(&journal->j_wait_frozen);
+}
+EXPORT_SYMBOL(jbd2_journal_thaw);
+
+
/**
* void jbd2_journal_lock_updates () - establish a transaction barrier.
* @journal: Journal to establish a barrier on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2092ea2..bfa0752 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -658,6 +658,7 @@ jbd2_time_diff(unsigned long start, unsigned long end)
* @j_wait_checkpoint: Wait queue to trigger checkpointing
* @j_wait_commit: Wait queue to trigger commit
* @j_wait_updates: Wait queue to wait for updates to complete
+ * @j_wait_frozen: Wait queue to wait for journal to thaw
* @j_checkpoint_mutex: Mutex for locking against concurrent checkpoints
* @j_head: Journal head - identifies the first unused block in the journal
* @j_tail: Journal tail - identifies the oldest still-used block in the
@@ -775,6 +776,9 @@ struct journal_s
/* Wait queue to wait for updates to complete */
wait_queue_head_t j_wait_updates;

+ /* Wait queue to wait for journal to thaw*/
+ wait_queue_head_t j_wait_frozen;
+
/* Semaphore for locking against concurrent checkpoints */
struct mutex j_checkpoint_mutex;

@@ -953,6 +957,7 @@ struct journal_s
#define JBD2_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JBD2_FROZEN 0x080 /* Journal thread frozen along with filesystem */

/*
* Function declarations for the journaling transaction and buffer
@@ -1060,6 +1065,8 @@ extern void jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned long);
extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
extern int jbd2_journal_stop(handle_t *);
+extern int jbd2_journal_freeze(journal_t *);
+extern void jbd2_journal_thaw(journal_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);
--
1.7.5.4

2011-12-05 20:54:47

by Kamal Mostafa

[permalink] [raw]
Subject: [PATCH 3/5 resend] VFS: Fix s_umount thaw/write deadlock

From: Valerie Aurora <[email protected]>

File system freeze/thaw require the superblock's s_umount lock.
However, we write to the file system while holding the s_umount lock
in several places (e.g., writeback and quotas). Any of these can
block on a frozen file system while holding s_umount, preventing any
later thaw from occurring, since thaw requires s_umount.

The solution is to add a check, vfs_is_frozen(), to all code paths
that write while holding sb->s_umount and return without writing if it
is true.

BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/drop_caches.c | 8 ++++++++
fs/fs-writeback.c | 50 +++++++++++++++++++++++++++++++++-----------------
fs/quota/quota.c | 12 +++++++++++-
fs/super.c | 26 ++++++++++++++++++++++++++
fs/sync.c | 9 +++++----
include/linux/fs.h | 6 +++++-
6 files changed, 88 insertions(+), 23 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c00e055..90f8f7e 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -57,6 +57,14 @@ int drop_caches_sysctl_handler(ctl_table *table, int write,
ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
if (ret)
return ret;
+
+ /*
+ * Any file-system specific routines eventually called by
+ * drop_pagecache_sb() and drop_slab() ought to check for a
+ * frozen file system before writing to the disk. Most file
+ * systems won't write in this case but XFS is a notable
+ * exception in certain cases.
+ */
if (write) {
if (sysctl_drop_caches & 1)
iterate_supers(drop_pagecache_sb, NULL);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 73c3992..dbeaede 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -554,6 +554,14 @@ static long writeback_sb_inodes(struct super_block *sb,
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);

+ if (inode->i_sb == sb && vfs_is_frozen(sb)) {
+ /*
+ * Inode is on a frozen superblock; skip it for now.
+ */
+ redirty_tail(inode, wb);
+ continue;
+ }
+
if (inode->i_sb != sb) {
if (work->sb) {
/*
@@ -1266,40 +1274,48 @@ EXPORT_SYMBOL(writeback_inodes_sb);
* writeback_inodes_sb_if_idle - start writeback if none underway
* @sb: the superblock
*
- * Invoke writeback_inodes_sb if no writeback is currently underway.
- * Returns 1 if writeback was started, 0 if not.
+ * Invoke writeback_inodes_sb if no writeback is currently underway
+ * and no one else holds the s_umount lock. Returns 1 if writeback
+ * was started, 0 if not.
*/
int writeback_inodes_sb_if_idle(struct super_block *sb, enum wb_reason reason)
{
if (!writeback_in_progress(sb->s_bdi)) {
- down_read(&sb->s_umount);
- writeback_inodes_sb(sb, reason);
- up_read(&sb->s_umount);
- return 1;
- } else
- return 0;
+ /*
+ * Trylock also avoids read-write deadlocks that could be
+ * triggered by freeze.
+ */
+ if (down_read_trylock(&sb->s_umount)) {
+ writeback_inodes_sb(sb, reason);
+ up_read(&sb->s_umount);
+ return 1;
+ }
+ }
+ return 0;
}
EXPORT_SYMBOL(writeback_inodes_sb_if_idle);

/**
- * writeback_inodes_sb_if_idle - start writeback if none underway
+ * writeback_inodes_sb_nr_if_idle - start writeback if none underway
* @sb: the superblock
* @nr: the number of pages to write
*
- * Invoke writeback_inodes_sb if no writeback is currently underway.
- * Returns 1 if writeback was started, 0 if not.
+ * Invoke writeback_inodes_sb_nr if no writeback is currently underway
+ * and no one else holds the s_umount lock. Returns 1 if writeback
+ * was started, 0 if not.
*/
int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
unsigned long nr,
enum wb_reason reason)
{
if (!writeback_in_progress(sb->s_bdi)) {
- down_read(&sb->s_umount);
- writeback_inodes_sb_nr(sb, nr, reason);
- up_read(&sb->s_umount);
- return 1;
- } else
- return 0;
+ if (down_read_trylock(&sb->s_umount)) {
+ writeback_inodes_sb_nr(sb, nr, reason);
+ up_read(&sb->s_umount);
+ return 1;
+ }
+ }
+ return 0;
}
EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);

diff --git a/fs/quota/quota.c b/fs/quota/quota.c
index 35f4b0e..47983d9 100644
--- a/fs/quota/quota.c
+++ b/fs/quota/quota.c
@@ -47,7 +47,7 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,

static void quota_sync_one(struct super_block *sb, void *arg)
{
- if (sb->s_qcop && sb->s_qcop->quota_sync)
+ if (sb->s_qcop && sb->s_qcop->quota_sync && !vfs_is_frozen(sb))
sb->s_qcop->quota_sync(sb, *(int *)arg, 1);
}

@@ -368,8 +368,18 @@ SYSCALL_DEFINE4(quotactl, unsigned int, cmd, const char __user *, special,
goto out;
}

+ /*
+ * It's not clear which quota functions may write to the file
+ * system (all?). Check for a frozen fs and bail out now.
+ */
+ if (vfs_is_frozen(sb)) {
+ ret = -EBUSY;
+ goto out_drop_super;
+ }
+
ret = do_quotactl(sb, type, cmds, id, addr, pathp);

+out_drop_super:
drop_super(sb);
out:
if (pathp && !IS_ERR(pathp))
diff --git a/fs/super.c b/fs/super.c
index afd0f1a..a56696b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -526,6 +526,10 @@ void sync_supers(void)
*
* Scans the superblock list and calls given function, passing it
* locked superblock and given argument.
+ *
+ * Note: If @f is going to write to the file system, it must first
+ * check if the file system is frozen (via vfs_is_frozen(sb)) and
+ * refuse to write if so.
*/
void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
{
@@ -595,6 +599,10 @@ EXPORT_SYMBOL(iterate_supers_type);
*
* Scans the superblock list and finds the superblock of the file system
* mounted on the device given. %NULL is returned if no match is found.
+ *
+ * Note: If caller is going to write to the superblock, it must first
+ * check if the file system is frozen (via vfs_is_frozen(sb)) and
+ * refuse to write if so.
*/

struct super_block *get_super(struct block_device *bdev)
@@ -1132,6 +1140,24 @@ out:
* Syncs the super to make sure the filesystem is consistent and calls the fs's
* freeze_fs. Subsequent calls to this without first thawing the fs will return
* -EBUSY.
+ *
+ * During this function, sb->s_frozen goes through these values:
+ *
+ * SB_UNFROZEN: File system is normal, all writes progress as usual.
+ *
+ * SB_FREEZE_WRITE: The file system is in the process of being frozen
+ * and any remaining out-standing writes are being synced. Writes
+ * that complete in-process writes should be permitted but new ones
+ * should be blocked.
+ *
+ * SB_FREEZE_TRANS: The file system is frozen. The ->freeze_fs and
+ * ->unfreeze_fs ops are the only operations permitted to write to the
+ * file system in this state.
+ *
+ * sb->s_frozen is protected by sb->s_umount. Additionally,
+ * SB_FREEZE_WRITE is only temporarily set during freeze/thaw while
+ * holding sb->s_umount for writing, so any other callers holding
+ * sb->s_umount will never see this state.
*/
int freeze_super(struct super_block *sb)
{
diff --git a/fs/sync.c b/fs/sync.c
index 101b8ef..f497be8 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -68,7 +68,7 @@ int sync_filesystem(struct super_block *sb)
/*
* No point in syncing out anything if the filesystem is read-only.
*/
- if (sb->s_flags & MS_RDONLY)
+ if ((sb->s_flags & MS_RDONLY) || vfs_is_frozen(sb))
return 0;

ret = __sync_filesystem(sb, 0);
@@ -80,7 +80,7 @@ EXPORT_SYMBOL_GPL(sync_filesystem);

static void sync_one_sb(struct super_block *sb, void *arg)
{
- if (!(sb->s_flags & MS_RDONLY))
+ if (!(sb->s_flags & MS_RDONLY) && !vfs_is_frozen(sb))
__sync_filesystem(sb, *(int *)arg);
}
/*
@@ -136,7 +136,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
{
struct file *file;
struct super_block *sb;
- int ret;
+ int ret = 0;
int fput_needed;

file = fget_light(fd, &fput_needed);
@@ -145,7 +145,8 @@ SYSCALL_DEFINE1(syncfs, int, fd)
sb = file->f_dentry->d_sb;

down_read(&sb->s_umount);
- ret = sync_filesystem(sb);
+ if (!(sb->s_flags & MS_RDONLY) && !vfs_is_frozen(sb))
+ ret = sync_filesystem(sb);
up_read(&sb->s_umount);

fput_light(file, fput_needed);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e313022..ac7a495 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1490,7 +1490,7 @@ extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
extern struct timespec current_fs_time(struct super_block *sb);

/*
- * Snapshotting support.
+ * Snapshotting support. See freeze_super() for documentation.
*/
enum {
SB_UNFROZEN = 0,
@@ -1501,6 +1501,10 @@ enum {
#define vfs_check_frozen(sb, level) \
wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))

+static inline int vfs_is_frozen(struct super_block *sb) {
+ return sb->s_frozen == SB_FREEZE_TRANS;
+}
+
/*
* until VFS tracks user namespaces for inodes, just make all files
* belong to init_user_ns
--
1.7.5.4

2011-12-05 20:54:48

by Kamal Mostafa

[permalink] [raw]
Subject: [PATCH 4/5 resend] VFS: Rename vfs_check_frozen() to vfs_block_until_thawed()

From: Valerie Aurora <[email protected]>

The name of vfs_check_frozen() tends to de-emphasize an important
fact: that it will block until the file system thaws to the specified
level. Make bugs slightly less likely by renaming to
vfs_block_until_thawed().

BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/btrfs/disk-io.c | 4 ++--
fs/btrfs/file.c | 2 +-
fs/buffer.c | 7 ++++---
fs/ext4/inode.c | 2 +-
fs/ext4/super.c | 4 ++--
fs/fuse/file.c | 2 +-
fs/gfs2/file.c | 2 +-
fs/nilfs2/ioctl.c | 2 +-
fs/nilfs2/segment.c | 2 +-
fs/ntfs/file.c | 2 +-
fs/ocfs2/file.c | 2 +-
fs/xfs/xfs_mount.h | 2 +-
include/linux/fs.h | 2 +-
mm/filemap.c | 2 +-
mm/filemap_xip.c | 2 +-
15 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 632f8f3..308afe1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1569,7 +1569,7 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;

do {
- vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(root->fs_info->sb, SB_FREEZE_WRITE);

if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
mutex_trylock(&root->fs_info->cleaner_mutex)) {
@@ -1603,7 +1603,7 @@ static int transaction_kthread(void *arg)

do {
delay = HZ * 30;
- vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(root->fs_info->sb, SB_FREEZE_WRITE);
mutex_lock(&root->fs_info->transaction_kthread_mutex);

spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index dafdfa0..40bff2e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1346,7 +1346,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
ssize_t err = 0;
size_t count, ocount;

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);

mutex_lock(&inode->i_mutex);

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..fc679ba 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2338,8 +2338,9 @@ EXPORT_SYMBOL(block_commit_write);
* beyond EOF, then the page is guaranteed safe against truncation until we
* unlock the page.
*
- * Direct callers of this function should call vfs_check_frozen() so that page
- * fault does not busyloop until the fs is thawed.
+ * Direct callers of this function should call
+ * vfs_block_until_thawed() so that page fault does not busyloop until
+ * the fs is thawed.
*/
int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
get_block_t get_block)
@@ -2401,7 +2402,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
* This check is racy but catches the common case. The check in
* __block_page_mkwrite() is reliable.
*/
- vfs_check_frozen(sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(sb, SB_FREEZE_WRITE);
ret = __block_page_mkwrite(vma, vmf, get_block);
return block_page_mkwrite_return(ret);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 848f436..3d54061 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4737,7 +4737,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
* This check is racy but catches the common case. We rely on
* __block_page_mkwrite() to do a reliable check.
*/
- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 95f28b8..9fd8836 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -314,7 +314,7 @@ handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
* handles are not stopped.
*/
if (!handle)
- vfs_check_frozen(sb, SB_FREEZE_TRANS);
+ vfs_block_until_thawed(sb, SB_FREEZE_TRANS);

if (!journal)
return ext4_get_nojournal();
@@ -4282,7 +4282,7 @@ int ext4_force_commit(struct super_block *sb)

journal = EXT4_SB(sb)->s_journal;
if (journal) {
- vfs_check_frozen(sb, SB_FREEZE_TRANS);
+ vfs_block_until_thawed(sb, SB_FREEZE_TRANS);
ret = ext4_journal_force_commit(journal);
}

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 594f07a..42e5e30 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -944,7 +944,7 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
return err;

mutex_lock(&inode->i_mutex);
- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);

/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index ce36a56..9fea809 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -373,7 +373,7 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
* and retry if the fs has been frozen after the page lock has
* been acquired
*/
- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);

gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
ret = gfs2_glock_nq(&gh);
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 41d6743..084c8d9 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -655,7 +655,7 @@ static int nilfs_ioctl_clean_segments(struct inode *inode, struct file *filp,
goto out_free;
}

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);

ret = nilfs_ioctl_move_blocks(inode->i_sb, &argv[0], kbufs[0]);
if (ret < 0)
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index bb24ab6..e3388f1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -189,7 +189,7 @@ int nilfs_transaction_begin(struct super_block *sb,
if (ret > 0)
return 0;

- vfs_check_frozen(sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(sb, SB_FREEZE_WRITE);

nilfs = sb->s_fs_info;
down_read(&nilfs->ns_segctor_sem);
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index c587e2d..16f1ac6 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2084,7 +2084,7 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb,
if (err)
return err;
pos = *ppos;
- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);
/* We can write back this queue in page reclaim. */
current->backing_dev_info = mapping->backing_dev_info;
written = 0;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index de4ea1a..f863955 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2240,7 +2240,7 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
if (iocb->ki_left == 0)
return 0;

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);

appending = file->f_flags & O_APPEND ? 1 : 0;
direct_io = file->f_flags & O_DIRECT ? 1 : 0;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index bb24dac..bacfab8 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -312,7 +312,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
#define SHUTDOWN_DEVICE_REQ 0x0020 /* failed all paths to the device */

#define xfs_test_for_freeze(mp) ((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l) vfs_check_frozen((mp)->m_super, (l))
+#define xfs_wait_for_freeze(mp,l) vfs_block_until_thawed((mp)->m_super, (l))

/*
* Flags for xfs_mountfs
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac7a495..662e412 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1498,7 +1498,7 @@ enum {
SB_FREEZE_TRANS = 2,
};

-#define vfs_check_frozen(sb, level) \
+#define vfs_block_until_thawed(sb, level) \
wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))

static inline int vfs_is_frozen(struct super_block *sb) {
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..a973a13 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2529,7 +2529,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
count = ocount;
pos = *ppos;

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);

/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f91b2f6..4539564 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -406,7 +406,7 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
pos = *ppos;
count = len;

- vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+ vfs_block_until_thawed(inode->i_sb, SB_FREEZE_WRITE);

/* We can write back this queue in page reclaim */
current->backing_dev_info = mapping->backing_dev_info;
--
1.7.5.4

2011-12-05 20:54:49

by Kamal Mostafa

[permalink] [raw]
Subject: [PATCH 5/5 resend] Documentation: Correct s_umount state for freeze_fs/unfreeze_fs

From: Valerie Aurora <[email protected]>

freeze_fs/unfreeze_fs ops are called with s_umount held for write, not read.

BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Valerie Aurora <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
---
Documentation/filesystems/Locking | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index d819ba1..7e46a94 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -134,8 +134,8 @@ evict_inode:
put_super: write
write_super: read
sync_fs: read
-freeze_fs: read
-unfreeze_fs: read
+freeze_fs: write
+unfreeze_fs: write
statfs: maybe(read) (see below)
remount_fs: write
umount_begin: no
--
1.7.5.4

2011-12-05 20:55:05

by Kamal Mostafa

[permalink] [raw]
Subject: [PATCH 2/5 resend] Thaw the journal when you unfreeze the fs.

From: Surbhi Palande <[email protected]>

Thaw the journal so that transactions can resume

BugLink: https://bugs.launchpad.net/bugs/897421
Signed-off-by: Surbhi Palande <[email protected]>
Cc: Kamal Mostafa <[email protected]>
Tested-by: Peter M. Petrakis <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
---
fs/ext4/super.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 5bfdc21..95f28b8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4348,10 +4348,14 @@ static int ext4_freeze(struct super_block *sb)
*/
static int ext4_unfreeze(struct super_block *sb)
{
+ journal_t *journal;
if (sb->s_flags & MS_RDONLY)
return 0;

lock_super(sb);
+ journal = EXT4_SB(sb)->s_journal;
+
+ jbd2_journal_thaw(journal);
/* Reset the needs_recovery flag before the fs is unlocked. */
EXT4_SET_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
ext4_commit_super(sb, 1);
--
1.7.5.4


2011-12-06 11:07:24

by Matt Fleming

[permalink] [raw]
Subject: Re: [PATCH 2/5 resend] Thaw the journal when you unfreeze the fs.

On Mon, 2011-12-05 at 12:54 -0800, Kamal Mostafa wrote:
> From: Surbhi Palande <[email protected]>
>
> Thaw the journal so that transactions can resume
>
> BugLink: https://bugs.launchpad.net/bugs/897421
> Signed-off-by: Surbhi Palande <[email protected]>
> Cc: Kamal Mostafa <[email protected]>
> Tested-by: Peter M. Petrakis <[email protected]>
> Signed-off-by: Kamal Mostafa <[email protected]>
> ---
> fs/ext4/super.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)

Shouldn't this patch really be merged with [PATCH 1/5] ? Otherwise it's
possible to freeze a journal that will never be unfrozen, if say, you're
doing a git bisect and you have [PATCH 1/5] applied but not [PATCH 2/5].

--
Matt Fleming, Intel Open Source Technology Center


2011-12-06 11:35:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/5 resend] VFS: Fix s_umount thaw/write deadlock

On Mon, Dec 05, 2011 at 12:54:47PM -0800, Kamal Mostafa wrote:
> + /*
> + * Any file-system specific routines eventually called by
> + * drop_pagecache_sb() and drop_slab() ought to check for a
> + * frozen file system before writing to the disk. Most file
> + * systems won't write in this case but XFS is a notable
> + * exception in certain cases.
> + */

Instead of adding a comment like this please fix whatever issue you
found or at least report it to the the people that might be able to
fix if you don't understand the code well enough.

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 73c3992..dbeaede 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -554,6 +554,14 @@ static long writeback_sb_inodes(struct super_block *sb,
> while (!list_empty(&wb->b_io)) {
> struct inode *inode = wb_inode(wb->b_io.prev);
>
> + if (inode->i_sb == sb && vfs_is_frozen(sb)) {
> + /*
> + * Inode is on a frozen superblock; skip it for now.
> + */
> + redirty_tail(inode, wb);
> + continue;
> + }
> +

We make sure to not dirty any new inodes after the first phase of the
freeze, so this should be a BUG_ON/WARN_ON.

> + /*
> + * Trylock also avoids read-write deadlocks that could be
> + * triggered by freeze.
> + */
> + if (down_read_trylock(&sb->s_umount)) {
> + writeback_inodes_sb(sb, reason);
> + up_read(&sb->s_umount);
> + return 1;
> + }
> + }

What exactly deadlock? The comment as-is really doesn't tell the reader
anything interesting. Note that switching it to a trylock really should
be a separate patch. We've been through it a while ago and it got lost,
and now Miao Xie is looking into the general issue again.

Also as mentioned to Miao Xie a little elarier
writeback_inodes_if_idle should be rewritten to be a trivial wrapper
around writeback_inodes_sb_nr_if_idle instead of duplicating the logic.

While we're at it: can anyway suggest a less cumbersome name than
writeback_inodes_sb_if_idle? I'd go for
try_to_writeback_inodes_sb(_nr).

> index 35f4b0e..47983d9 100644
> --- a/fs/quota/quota.c
> +++ b/fs/quota/quota.c
> @@ -47,7 +47,7 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,
>
> static void quota_sync_one(struct super_block *sb, void *arg)
> {
> - if (sb->s_qcop && sb->s_qcop->quota_sync)
> + if (sb->s_qcop && sb->s_qcop->quota_sync && !vfs_is_frozen(sb))
> sb->s_qcop->quota_sync(sb, *(int *)arg, 1);
> }

Again, this should be a BUG_ON/WARN_ON. We shouldn't redirty quotas
after the freeze.

> @@ -368,8 +368,18 @@ SYSCALL_DEFINE4(quotactl, unsigned int, cmd, const char __user *, special,
> goto out;
> }
>
> + /*
> + * It's not clear which quota functions may write to the file
> + * system (all?). Check for a frozen fs and bail out now.
> + */
> + if (vfs_is_frozen(sb)) {
> + ret = -EBUSY;
> + goto out_drop_super;
> + }

How about spending the three minutes to figure it out?
Q_GETFMT/Q_GETINFO/Q_XGETQSTAT and Q_GETQUOTA are the obvious read-only
candidates.

> + * Note: If @f is going to write to the file system, it must first
> + * check if the file system is frozen (via vfs_is_frozen(sb)) and
> + * refuse to write if so.

That's an overgeneralization.

> + * During this function, sb->s_frozen goes through these values:
> + *
> + * SB_UNFROZEN: File system is normal, all writes progress as usual.
> + *
> + * SB_FREEZE_WRITE: The file system is in the process of being frozen
> + * and any remaining out-standing writes are being synced. Writes
> + * that complete in-process writes should be permitted but new ones
> + * should be blocked.
> + *
> + * SB_FREEZE_TRANS: The file system is frozen. The ->freeze_fs and
> + * ->unfreeze_fs ops are the only operations permitted to write to the
> + * file system in this state.
> + *
> + * sb->s_frozen is protected by sb->s_umount. Additionally,
> + * SB_FREEZE_WRITE is only temporarily set during freeze/thaw while
> + * holding sb->s_umount for writing, so any other callers holding
> + * sb->s_umount will never see this state.
> */

Please split adding this useful documentation into a separate patch.

> index 101b8ef..f497be8 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -68,7 +68,7 @@ int sync_filesystem(struct super_block *sb)
> /*
> * No point in syncing out anything if the filesystem is read-only.
> */
> - if (sb->s_flags & MS_RDONLY)
> + if ((sb->s_flags & MS_RDONLY) || vfs_is_frozen(sb))

Again, BUG_ON/WARN_ON.

> static void sync_one_sb(struct super_block *sb, void *arg)
> {
> - if (!(sb->s_flags & MS_RDONLY))
> + if (!(sb->s_flags & MS_RDONLY) && !vfs_is_frozen(sb))
> __sync_filesystem(sb, *(int *)arg);

Same here.

> + if (!(sb->s_flags & MS_RDONLY) && !vfs_is_frozen(sb))
> + ret = sync_filesystem(sb);

Same here.

> +static inline int vfs_is_frozen(struct super_block *sb) {
> + return sb->s_frozen == SB_FREEZE_TRANS;
> +}

wrong brace placement.


2011-12-06 11:40:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/5 resend] VFS: Rename vfs_check_frozen() to vfs_block_until_thawed()

On Mon, Dec 05, 2011 at 12:54:48PM -0800, Kamal Mostafa wrote:
> From: Valerie Aurora <[email protected]>
>
> The name of vfs_check_frozen() tends to de-emphasize an important
> fact: that it will block until the file system thaws to the specified
> level. Make bugs slightly less likely by renaming to
> vfs_block_until_thawed().

I really hate that name. The old one wasn't perfect either, but unless
you have a really good new one I'm strongly in favour of not repainting
the bikeshed.


2011-12-06 14:25:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/5 resend] Adding support to freeze and unfreeze a journal

On Mon, Dec 05, 2011 at 12:54:45PM -0800, Kamal Mostafa wrote:
> From: Surbhi Palande <[email protected]>
> @@ -4330,23 +4330,15 @@ static int ext4_freeze(struct super_block *sb)
>
> journal = EXT4_SB(sb)->s_journal;
>
> - /* Now we set up the journal barrier. */
> - jbd2_journal_lock_updates(journal);
> -
> + error = jbd2_journal_freeze(journal);
> /*
> - * Don't clear the needs_recovery flag if we failed to flush
> + * Don't clear the needs_recovery flag if we failed to freeze
> * the journal.
> */
> - error = jbd2_journal_flush(journal);
> - if (error < 0)
> - goto out;
> -
> - /* Journal blocked and flushed, clear needs_recovery flag. */
> - EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> - error = ext4_commit_super(sb, 1);
> -out:
> - /* we rely on s_frozen to stop further updates */
> - jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> + if (error >= 0) {
> + EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> + error = ext4_commit_super(sb, 1);
> + }
> return error;
> }

A strange goto-avoidance fetish here? Why not stick to the same style
as the function already had:

error = jbd2_journal_freeze(journal);
if (error)
goto out;

EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
error = ext4_commit_super(sb, 1);
out:
return error;

Easier to read the patch, easier to add new potential error cases to this
function later.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2011-12-06 16:23:41

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/5 resend] Thaw the journal when you unfreeze the fs.

On 12/6/11 5:07 AM, Matt Fleming wrote:
> On Mon, 2011-12-05 at 12:54 -0800, Kamal Mostafa wrote:
>> From: Surbhi Palande <[email protected]>
>>
>> Thaw the journal so that transactions can resume
>>
>> BugLink: https://bugs.launchpad.net/bugs/897421
>> Signed-off-by: Surbhi Palande <[email protected]>
>> Cc: Kamal Mostafa <[email protected]>
>> Tested-by: Peter M. Petrakis <[email protected]>
>> Signed-off-by: Kamal Mostafa <[email protected]>
>> ---
>> fs/ext4/super.c | 4 ++++
>> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> Shouldn't this patch really be merged with [PATCH 1/5] ? Otherwise it's
> possible to freeze a journal that will never be unfrozen, if say, you're
> doing a git bisect and you have [PATCH 1/5] applied but not [PATCH 2/5].

I second that; I was going to send the same comment but you beat
me to it. Doing it this way will make it a tough go when bisecting
freeze issues, it needs to be rolled into patch 1.

Thanks,
-Eric

2011-12-07 22:46:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5 resend] VFS: Rename vfs_check_frozen() to vfs_block_until_thawed()

On Tue 06-12-11 06:40:25, Christoph Hellwig wrote:
> On Mon, Dec 05, 2011 at 12:54:48PM -0800, Kamal Mostafa wrote:
> > From: Valerie Aurora <[email protected]>
> >
> > The name of vfs_check_frozen() tends to de-emphasize an important
> > fact: that it will block until the file system thaws to the specified
> > level. Make bugs slightly less likely by renaming to
> > vfs_block_until_thawed().
>
> I really hate that name. The old one wasn't perfect either, but unless
> you have a really good new one I'm strongly in favour of not repainting
> the bikeshed.
Maybe vfs_wait_frozen()?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-12-07 23:17:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/5 resend] VFS: Fix s_umount thaw/write deadlock

On Tue 06-12-11 06:35:44, Christoph Hellwig wrote:
> On Mon, Dec 05, 2011 at 12:54:47PM -0800, Kamal Mostafa wrote:
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 73c3992..dbeaede 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -554,6 +554,14 @@ static long writeback_sb_inodes(struct super_block *sb,
> > while (!list_empty(&wb->b_io)) {
> > struct inode *inode = wb_inode(wb->b_io.prev);
> >
> > + if (inode->i_sb == sb && vfs_is_frozen(sb)) {
> > + /*
> > + * Inode is on a frozen superblock; skip it for now.
> > + */
> > + redirty_tail(inode, wb);
> > + continue;
> > + }
> > +
>
> We make sure to not dirty any new inodes after the first phase of the
> freeze, so this should be a BUG_ON/WARN_ON.
This is not really true in presence of mmaped writes. To block mmaped
writes on a frozen filesystem, we need some synchronization between
page_mkwrite() and freezing code. Currently, to avoid any additional
locking overhead, we set page dirty and *then* check for filesystem being
frozen. Only this order can make sure either the page is written (and
write-protected) or the frozen check triggers and we wait... (see the
comment in block_page_mkwrite()). The nasty sideeffect of this is that
there can be dirty pages & inodes on a frozen filesystem. We are blocked in
the page fault of these pages so user cannot write any data to these pages
but still they are marked dirty.

Alternatively we could have a different mechanism (rw semaphore?) to
synchronize page faults and freezing but I'd hate the overhead for the case
almost noone cares about...

> While we're at it: can anyway suggest a less cumbersome name than
> writeback_inodes_sb_if_idle? I'd go for
> try_to_writeback_inodes_sb(_nr).
Sounds good.

> > index 35f4b0e..47983d9 100644
> > --- a/fs/quota/quota.c
> > +++ b/fs/quota/quota.c
> > @@ -47,7 +47,7 @@ static int check_quotactl_permission(struct super_block *sb, int type, int cmd,
> >
> > static void quota_sync_one(struct super_block *sb, void *arg)
> > {
> > - if (sb->s_qcop && sb->s_qcop->quota_sync)
> > + if (sb->s_qcop && sb->s_qcop->quota_sync && !vfs_is_frozen(sb))
> > sb->s_qcop->quota_sync(sb, *(int *)arg, 1);
> > }
>
> Again, this should be a BUG_ON/WARN_ON. We shouldn't redirty quotas
> after the freeze.
You cannot really turn the check into WARN_ON because we iterate over all
superblocks and only in ->quota_sync() check whether something is dirty.
But the check seems to be unnecessary, that is correct.

> > @@ -368,8 +368,18 @@ SYSCALL_DEFINE4(quotactl, unsigned int, cmd, const char __user *, special,
> > goto out;
> > }
> >
> > + /*
> > + * It's not clear which quota functions may write to the file
> > + * system (all?). Check for a frozen fs and bail out now.
> > + */
> > + if (vfs_is_frozen(sb)) {
> > + ret = -EBUSY;
> > + goto out_drop_super;
> > + }
>
> How about spending the three minutes to figure it out?
> Q_GETFMT/Q_GETINFO/Q_XGETQSTAT and Q_GETQUOTA are the obvious read-only
> candidates.
Q_GETQUOTA can actually cause filesystem modification (reservation of
space in quota file) but the others are read-only. Also after some thought
I'd prefer that quotactl(8) just blocks to be consistent with how other
syscalls behave...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-12-07 23:49:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 3/5 resend] VFS: Fix s_umount thaw/write deadlock

On Thu, Dec 08, 2011 at 12:16:58AM +0100, Jan Kara wrote:
> On Tue 06-12-11 06:35:44, Christoph Hellwig wrote:
> > We make sure to not dirty any new inodes after the first phase of the
> > freeze, so this should be a BUG_ON/WARN_ON.
> This is not really true in presence of mmaped writes. To block mmaped
> writes on a frozen filesystem, we need some synchronization between
> page_mkwrite() and freezing code. Currently, to avoid any additional
> locking overhead, we set page dirty and *then* check for filesystem being
> frozen. Only this order can make sure either the page is written (and
> write-protected) or the frozen check triggers and we wait... (see the
> comment in block_page_mkwrite()). The nasty sideeffect of this is that
> there can be dirty pages & inodes on a frozen filesystem. We are blocked in
> the page fault of these pages so user cannot write any data to these pages
> but still they are marked dirty.

Can't we make the mmaps read-only during the freezing process?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2011-12-08 00:05:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/5 resend] VFS: Fix s_umount thaw/write deadlock

On Wed 07-12-11 16:49:20, Matthew Wilcox wrote:
> On Thu, Dec 08, 2011 at 12:16:58AM +0100, Jan Kara wrote:
> > On Tue 06-12-11 06:35:44, Christoph Hellwig wrote:
> > > We make sure to not dirty any new inodes after the first phase of the
> > > freeze, so this should be a BUG_ON/WARN_ON.
> > This is not really true in presence of mmaped writes. To block mmaped
> > writes on a frozen filesystem, we need some synchronization between
> > page_mkwrite() and freezing code. Currently, to avoid any additional
> > locking overhead, we set page dirty and *then* check for filesystem being
> > frozen. Only this order can make sure either the page is written (and
> > write-protected) or the frozen check triggers and we wait... (see the
> > comment in block_page_mkwrite()). The nasty sideeffect of this is that
> > there can be dirty pages & inodes on a frozen filesystem. We are blocked in
> > the page fault of these pages so user cannot write any data to these pages
> > but still they are marked dirty.
>
> Can't we make the mmaps read-only during the freezing process?
But then application will get EFAULT instead of blocking? I don't think
that is acceptable (we'd see lots of applications dying ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-12-09 11:47:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/5 resend] VFS: Fix s_umount thaw/write deadlock

On Thu, Dec 08, 2011 at 12:16:58AM +0100, Jan Kara wrote:
> > We make sure to not dirty any new inodes after the first phase of the
> > freeze, so this should be a BUG_ON/WARN_ON.
> This is not really true in presence of mmaped writes. To block mmaped
> writes on a frozen filesystem, we need some synchronization between
> page_mkwrite() and freezing code. Currently, to avoid any additional
> locking overhead, we set page dirty and *then* check for filesystem being
> frozen. Only this order can make sure either the page is written (and
> write-protected) or the frozen check triggers and we wait... (see the
> comment in block_page_mkwrite()). The nasty sideeffect of this is that
> there can be dirty pages & inodes on a frozen filesystem. We are blocked in
> the page fault of these pages so user cannot write any data to these pages
> but still they are marked dirty.
>
> Alternatively we could have a different mechanism (rw semaphore?) to
> synchronize page faults and freezing but I'd hate the overhead for the case
> almost noone cares about...

I think the is the only sensible way to go forward. Requiring hacks in
lots of random places to work around the fact that a single place that
might actually dirty pages despite supposedly blocking that from happen
simply isn't maintainable over the long run.

> > > + */
> > > + if (vfs_is_frozen(sb)) {
> > > + ret = -EBUSY;
> > > + goto out_drop_super;
> > > + }
> >
> > How about spending the three minutes to figure it out?
> > Q_GETFMT/Q_GETINFO/Q_XGETQSTAT and Q_GETQUOTA are the obvious read-only
> > candidates.
> Q_GETQUOTA can actually cause filesystem modification (reservation of
> space in quota file) but the others are read-only. Also after some thought
> I'd prefer that quotactl(8) just blocks to be consistent with how other
> syscalls behave...

How can a simple dqget cause modifications in the VFS quota code?
Dirting anything for a simple read of the quota information is not
only completely non-obvious but also doesn't make much sene. We don't
dirty metadata on stat() either..