2007-05-16 17:05:46

by Jörn Engel

[permalink] [raw]
Subject: [PATCH resend] introduce I_SYNC

While others are busy coming up with new silly names, here is something
substantial to worry about.

Patches fixes a deadlock problem well enough for LogFS to survive. The
problem itself is generic and seems to be ancient. Shaggy has code in
JFS from about 2.4.20 that seems to work around the deadlock. Dave
Chinner indicated that this could cause latency problems (not a
deadlock) on the NFS server side. I still suspect that NTFS has hit the
same deadlock and its current "fix" will cause data corruption instead.

After all that, I consider it important that some version of this patch
gets merged. But before doing so, the patch needs better review and
testing. The code it changes is quite subtle and and easy to get wrong.


1. Introduction

In its write path, LogFS may have to do some Garbage Collection when
space is getting tight. GC requires reading inodes, so the I_LOCK bit
is taken for some random inodes.

I_LOCK is also held when syncing inodes to flash, so LogFS has to wait
for those inodes. Inodes are written by the same code path as regular
file data and needs to acquire an fs-global mutex. Call stacks of the
1-2 processes will look roughly like this:

Process A: Process B:
inode_wait [filesystem locking write path]
__wait_on_bit __writeback_single_inode
out_of_line_wait_on_bit
ifind_fast
[filesystem calling iget()]
[filesystem locking write path]


2. The usage of inode_lock and I_LOCK

Almost all modifications of inodes are protected by the inode_lock, a
global spinlock. Some modifications, however, can block for various
reasons and require the inode_lock to get dropped temporarily. In the
meantime, the individual inode needs to get protected somehow. Usually
this happens through the use of I_LOCK.

But I_LOCK is not a simple mutex. It is a Janus-faced bit in the inode
that is used for several things, including mutual exclusion and
completion notification. Most users are open-coded, so it is not easy
to follow, but can be summarized in the table below.

In this table columns indicate events when I_LOCK is either set or
reset (or not reset but all waiters are notified anyway). Rows
indicate code that either checks for I_LOCK and changes behaviour
depending on its state or is waiting until I_LOCK gets reset (or is
waiting even if I_LOCK is not set).

__sync_single_inode
| get_new_inode[_fast]
| | unlock_new_inode
| | | dispose_list
| | | | generic_delete_inode
| | | | | generic_forget_inode
lock v v | | | |
unlock/complete v v v v v comment
-------------------------------------------------------------------------------
__writeback_single_inodeX O O O O sync
write_inode_now X O O O O sync
clear_inode X O O O O sync
__mark_inode_dirty X O O O O lists
generic_osync_inode X O O O O sync
get_new_inode[_fast] O X O O O mutex
find_inode[_fast] O O X X X I_FREEING
ifind[_fast] O X O O O read

jfs txCommit ? ? ? ? ? ?
xfs_ichgtime[_fast] X O O O O sync

Comments:
sync - wait for writeout to finish
lists - move inode to dirty list without racing against __sync_single_inode
mutex - protect against two concurrent get_new_inode[_fast] creating two inodes
I_FREEING - wait for inode to get freed, then repeat
read - don't return inode until it is read from medium

Now, the "X"s mark combinations where columns and rows are related.
"O"s mark combinations where afaics columns and rows share no
relationship whatsoever except that both use either I_LOCK or
wake_up_inode()/wait_on_inode() or any other of the partially open-coded
variants.

The table shows that two large usage groups exist for I_LOCK, one
dealing exclusively with the various sync() functions in
fs/fs-writeback.c and another basically confined to fs/inode.c code.
JFS has one remaining user that is unclear to me.


This patch introduces a new flag, I_SYNC and seperates out all sync()
users of I_LOCK to use the new flag instead.


fs/fs-writeback.c | 39 ++++++++++++++++++++++++---------------
fs/xfs/linux-2.6/xfs_iops.c | 4 ++--
include/linux/fs.h | 2 ++
include/linux/writeback.h | 7 +++++++
4 files changed, 35 insertions(+), 17 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -99,11 +99,11 @@ void __mark_inode_dirty(struct inode *in
inode->i_state |= flags;

/*
- * If the inode is locked, just update its dirty state.
+ * If the inode is being synced, just update its dirty state.
* The unlocker will place the inode on the appropriate
* superblock list, based upon its state.
*/
- if (inode->i_state & I_LOCK)
+ if (inode->i_state & I_SYNC)
goto out;

/*
@@ -139,6 +139,15 @@ static int write_inode(struct inode *ino
return 0;
}

+static void inode_sync_complete(struct inode *inode)
+{
+ /*
+ * Prevent speculative execution through spin_unlock(&inode_lock);
+ */
+ smp_mb();
+ wake_up_bit(&inode->i_state, __I_SYNC);
+}
+
/*
* Write a single inode's dirty pages and inode data out to disk.
* If `wait' is set, wait on the writeout.
@@ -158,11 +167,11 @@ __sync_single_inode(struct inode *inode,
int wait = wbc->sync_mode == WB_SYNC_ALL;
int ret;

- BUG_ON(inode->i_state & I_LOCK);
+ BUG_ON(inode->i_state & I_SYNC);

- /* Set I_LOCK, reset I_DIRTY */
+ /* Set I_SYNC, reset I_DIRTY */
dirty = inode->i_state & I_DIRTY;
- inode->i_state |= I_LOCK;
+ inode->i_state |= I_SYNC;
inode->i_state &= ~I_DIRTY;

spin_unlock(&inode_lock);
@@ -183,7 +192,7 @@ __sync_single_inode(struct inode *inode,
}

spin_lock(&inode_lock);
- inode->i_state &= ~I_LOCK;
+ inode->i_state &= ~I_SYNC;
if (!(inode->i_state & I_FREEING)) {
if (!(inode->i_state & I_DIRTY) &&
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -231,7 +240,7 @@ __sync_single_inode(struct inode *inode,
list_move(&inode->i_list, &inode_unused);
}
}
- wake_up_inode(inode);
+ inode_sync_complete(inode);
return ret;
}

@@ -250,7 +259,7 @@ __writeback_single_inode(struct inode *i
else
WARN_ON(inode->i_state & I_WILL_FREE);

- if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
+ if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
struct address_space *mapping = inode->i_mapping;
int ret;

@@ -269,16 +278,16 @@ __writeback_single_inode(struct inode *i
/*
* It's a data-integrity sync. We must wait.
*/
- if (inode->i_state & I_LOCK) {
- DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LOCK);
+ if (inode->i_state & I_SYNC) {
+ DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);

- wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
+ wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
do {
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait,
TASK_UNINTERRUPTIBLE);
spin_lock(&inode_lock);
- } while (inode->i_state & I_LOCK);
+ } while (inode->i_state & I_SYNC);
}
return __sync_single_inode(inode, wbc);
}
@@ -311,7 +320,7 @@ __writeback_single_inode(struct inode *i
* The inodes to be written are parked on sb->s_io. They are moved back onto
* sb->s_dirty as they are selected for writing. This way, none can be missed
* on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on __wait_on_inode.
+ * throttled threads: we don't want them all piling up on inode_sync_wait.
*/
static void
sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
@@ -583,7 +592,7 @@ int write_inode_now(struct inode *inode,
ret = __writeback_single_inode(inode, &wbc);
spin_unlock(&inode_lock);
if (sync)
- wait_on_inode(inode);
+ inode_sync_wait(inode);
return ret;
}
EXPORT_SYMBOL(write_inode_now);
@@ -658,7 +667,7 @@ int generic_osync_inode(struct inode *in
err = err2;
}
else
- wait_on_inode(inode);
+ inode_sync_wait(inode);

return err;
}
unchanged:
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1184,6 +1184,8 @@ #define I_FREEING 16
#define I_CLEAR 32
#define I_NEW 64
#define I_WILL_FREE 128
+#define __I_SYNC 8
+#define I_SYNC (1 << __I_SYNC) /* Currently being synced */

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

unchanged:
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -77,6 +77,13 @@ static inline void wait_on_inode(struct
wait_on_bit(&inode->i_state, __I_LOCK, inode_wait,
TASK_UNINTERRUPTIBLE);
}
+static inline void inode_sync_wait(struct inode *inode)
+{
+ might_sleep();
+ wait_on_bit(&inode->i_state, __I_SYNC, inode_wait,
+ TASK_UNINTERRUPTIBLE);
+}
+

/*
* mm/page-writeback.c
only in patch2:
unchanged:
--- a/fs/xfs/linux-2.6/xfs_iops.c
+++ b/fs/xfs/linux-2.6/xfs_iops.c
@@ -133,7 +133,7 @@ xfs_ichgtime(
*/
SYNCHRONIZE();
ip->i_update_core = 1;
- if (!(inode->i_state & I_LOCK))
+ if (!(inode->i_state & I_SYNC))
mark_inode_dirty_sync(inode);
}

@@ -185,7 +185,7 @@ xfs_ichgtime_fast(
*/
SYNCHRONIZE();
ip->i_update_core = 1;
- if (!(inode->i_state & I_LOCK))
+ if (!(inode->i_state & I_SYNC))
mark_inode_dirty_sync(inode);
}


2007-05-16 17:23:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

On Wed, 16 May 2007 19:01:14 +0200 J?rn Engel <[email protected]> wrote:

> This patch introduces a new flag, I_SYNC and seperates out all sync()
> users of I_LOCK to use the new flag instead.

gack, you like sticking your head in dark and dusty holes.

If we're going to do this then please let's get some exhaustive commentary
in there so that others have a chance of understanding these flags without
having to do the amount of reverse-engineering which you've been put through.

2007-05-16 17:52:20

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

On Wed, 16 May 2007 10:15:35 -0700, Andrew Morton wrote:
> On Wed, 16 May 2007 19:01:14 +0200 Jörn Engel <[email protected]> wrote:
>
> > This patch introduces a new flag, I_SYNC and seperates out all sync()
> > users of I_LOCK to use the new flag instead.
>
> gack, you like sticking your head in dark and dusty holes.

I don't remember enjoying the experience too much. For a day or two I
was thinking about quitting and becoming a monk somewhere.

> If we're going to do this then please let's get some exhaustive commentary
> in there so that others have a chance of understanding these flags without
> having to do the amount of reverse-engineering which you've been put through.

Not sure how the monasteries would cope with that.

Jörn

--
I've never met a human being who would want to read 17,000 pages of
documentation, and if there was, I'd kill him to get him out of the
gene pool.
-- Joseph Costello

2007-05-31 14:30:27

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

On Wed, 16 May 2007 10:15:35 -0700, Andrew Morton wrote:
>
> If we're going to do this then please let's get some exhaustive commentary
> in there so that others have a chance of understanding these flags without
> having to do the amount of reverse-engineering which you've been put through.

Done. Found and fixed some bugs in the process. By now I feal
reasonable certain that the patch fixes more than it breaks.

Jörn

--
Good warriors cause others to come to them and do not go to others.
-- Sun Tzu

Introduce I_SYNC.

I_LOCK was used for several unrelated purposes, which caused deadlock
situations in certain filesystems as a side effect. One of the purposes
now uses the new I_SYNC bit.

Also document the various bits and change their order from historical to
logical.

Signed-off-by: Jörn Engel <[email protected]>
---

fs/fs-writeback.c | 39 +++++++++++++++---------
fs/hugetlbfs/inode.c | 2 -
fs/inode.c | 6 +--
fs/jfs/jfs_txnmgr.c | 9 +++++
fs/xfs/linux-2.6/xfs_iops.c | 4 +-
include/linux/fs.h | 70 ++++++++++++++++++++++++++++++++++++++------
include/linux/writeback.h | 7 ++++
mm/page-writeback.c | 2 -
8 files changed, 107 insertions(+), 32 deletions(-)

--- linux-2.6.21logfs/fs/fs-writeback.c~I_LOCK 2007-05-07 10:28:53.000000000 +0200
+++ linux-2.6.21logfs/fs/fs-writeback.c 2007-05-07 13:29:35.000000000 +0200
@@ -99,11 +99,11 @@ void __mark_inode_dirty(struct inode *in
inode->i_state |= flags;

/*
- * If the inode is locked, just update its dirty state.
+ * If the inode is being synced, just update its dirty state.
* The unlocker will place the inode on the appropriate
* superblock list, based upon its state.
*/
- if (inode->i_state & I_LOCK)
+ if (inode->i_state & I_SYNC)
goto out;

/*
@@ -139,6 +139,15 @@ static int write_inode(struct inode *ino
return 0;
}

+static void inode_sync_complete(struct inode *inode)
+{
+ /*
+ * Prevent speculative execution through spin_unlock(&inode_lock);
+ */
+ smp_mb();
+ wake_up_bit(&inode->i_state, __I_SYNC);
+}
+
/*
* Write a single inode's dirty pages and inode data out to disk.
* If `wait' is set, wait on the writeout.
@@ -158,11 +167,11 @@ __sync_single_inode(struct inode *inode,
int wait = wbc->sync_mode == WB_SYNC_ALL;
int ret;

- BUG_ON(inode->i_state & I_LOCK);
+ BUG_ON(inode->i_state & I_SYNC);

- /* Set I_LOCK, reset I_DIRTY */
+ /* Set I_SYNC, reset I_DIRTY */
dirty = inode->i_state & I_DIRTY;
- inode->i_state |= I_LOCK;
+ inode->i_state |= I_SYNC;
inode->i_state &= ~I_DIRTY;

spin_unlock(&inode_lock);
@@ -183,7 +192,7 @@ __sync_single_inode(struct inode *inode,
}

spin_lock(&inode_lock);
- inode->i_state &= ~I_LOCK;
+ inode->i_state &= ~I_SYNC;
if (!(inode->i_state & I_FREEING)) {
if (!(inode->i_state & I_DIRTY) &&
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
@@ -231,7 +240,7 @@ __sync_single_inode(struct inode *inode,
list_move(&inode->i_list, &inode_unused);
}
}
- wake_up_inode(inode);
+ inode_sync_complete(inode);
return ret;
}

@@ -250,7 +259,7 @@ __writeback_single_inode(struct inode *i
else
WARN_ON(inode->i_state & I_WILL_FREE);

- if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_LOCK)) {
+ if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
struct address_space *mapping = inode->i_mapping;
int ret;

@@ -269,16 +278,16 @@ __writeback_single_inode(struct inode *i
/*
* It's a data-integrity sync. We must wait.
*/
- if (inode->i_state & I_LOCK) {
- DEFINE_WAIT_BIT(wq, &inode->i_state, __I_LOCK);
+ if (inode->i_state & I_SYNC) {
+ DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);

- wqh = bit_waitqueue(&inode->i_state, __I_LOCK);
+ wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
do {
spin_unlock(&inode_lock);
__wait_on_bit(wqh, &wq, inode_wait,
TASK_UNINTERRUPTIBLE);
spin_lock(&inode_lock);
- } while (inode->i_state & I_LOCK);
+ } while (inode->i_state & I_SYNC);
}
return __sync_single_inode(inode, wbc);
}
@@ -311,7 +320,7 @@ __writeback_single_inode(struct inode *i
* The inodes to be written are parked on sb->s_io. They are moved back onto
* sb->s_dirty as they are selected for writing. This way, none can be missed
* on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on __wait_on_inode.
+ * throttled threads: we don't want them all piling up on inode_sync_wait.
*/
static void
sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
@@ -583,7 +592,7 @@ int write_inode_now(struct inode *inode,
ret = __writeback_single_inode(inode, &wbc);
spin_unlock(&inode_lock);
if (sync)
- wait_on_inode(inode);
+ inode_sync_wait(inode);
return ret;
}
EXPORT_SYMBOL(write_inode_now);
@@ -658,7 +667,7 @@ int generic_osync_inode(struct inode *in
err = err2;
}
else
- wait_on_inode(inode);
+ inode_sync_wait(inode);

return err;
}
--- linux-2.6.21logfs/include/linux/fs.h~I_LOCK 2007-05-07 13:24:00.000000000 +0200
+++ linux-2.6.21logfs/include/linux/fs.h 2007-05-29 08:42:18.000000000 +0200
@@ -1174,16 +1174,68 @@ struct super_operations {
#endif
};

-/* Inode state bits. Protected by inode_lock. */
-#define I_DIRTY_SYNC 1 /* Not dirty enough for O_DATASYNC */
-#define I_DIRTY_DATASYNC 2 /* Data-related inode changes pending */
-#define I_DIRTY_PAGES 4 /* Data-related inode changes pending */
-#define __I_LOCK 3
+/*
+ * Inode state bits. Protected by inode_lock.
+ *
+ * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ *
+ * Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
+ * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
+ * various stages of removing an inode.
+ *
+ * Two bits are used for locking and completion notification, I_LOCK and I_SYNC.
+ *
+ * I_DIRTY_SYNC Inode itself is dirty.
+ * I_DIRTY_DATASYNC Data-related inode changes pending
+ * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean.
+ * I_NEW get_new_inode() sets i_state to I_LOCK|I_NEW. Both
+ * are cleared by unlock_new_inode(), called from iget().
+ * I_WILL_FREE Must be set when calling write_inode_now() if i_count
+ * is zero. I_FREEING must be set when I_WILL_FREE is
+ * cleared.
+ * I_FREEING Set when inode is about to be freed but still has dirty
+ * pages or buffers attached or the inode itself is still
+ * dirty.
+ * I_CLEAR Set by clear_inode(). In this state the inode is clean
+ * and can be destroyed.
+ *
+ * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
+ * prohibited for many purposes. iget() must wait for
+ * the inode to be completely released, then create it
+ * anew. Other functions will just ignore such inodes,
+ * if appropriate. I_LOCK is used for waiting.
+ *
+ * I_LOCK Serves as both a mutex and completion notification.
+ * New inodes set I_LOCK. If two processes both create
+ * the same inode, one of them will release its inode and
+ * wait for I_LOCK to be released before returning.
+ * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can
+ * also cause waiting on I_LOCK, without I_LOCK actually
+ * being set. find_inode() uses this to prevent returning
+ * nearly-dead inodes.
+ * I_SYNC Similar to I_LOCK, but limited in scope to writeback
+ * of inode dirty data. Having a seperate lock for this
+ * purpose reduces latency and prevents some filesystem-
+ * specific deadlocks.
+ *
+ * Q: Why does I_DIRTY_DATASYNC exist? It appears as if it could be replaced
+ * by (I_DIRTY_SYNC|I_DIRTY_PAGES).
+ * Q: What is the difference between I_WILL_FREE and I_FREEING?
+ * Q: igrab() only checks on (I_FREEING|I_WILL_FREE). Should it also check on
+ * I_CLEAR? If not, why?
+ */
+#define I_DIRTY_SYNC 1
+#define I_DIRTY_DATASYNC 2
+#define I_DIRTY_PAGES 4
+#define I_NEW 8
+#define I_WILL_FREE 16
+#define I_FREEING 32
+#define I_CLEAR 64
+#define __I_LOCK 7
#define I_LOCK (1 << __I_LOCK)
-#define I_FREEING 16
-#define I_CLEAR 32
-#define I_NEW 64
-#define I_WILL_FREE 128
+#define __I_SYNC 8
+#define I_SYNC (1 << __I_SYNC)

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

--- linux-2.6.21logfs/include/linux/writeback.h~I_LOCK 2007-05-07 13:24:01.000000000 +0200
+++ linux-2.6.21logfs/include/linux/writeback.h 2007-05-07 13:29:35.000000000 +0200
@@ -77,6 +77,13 @@ static inline void wait_on_inode(struct
wait_on_bit(&inode->i_state, __I_LOCK, inode_wait,
TASK_UNINTERRUPTIBLE);
}
+static inline void inode_sync_wait(struct inode *inode)
+{
+ might_sleep();
+ wait_on_bit(&inode->i_state, __I_SYNC, inode_wait,
+ TASK_UNINTERRUPTIBLE);
+}
+

/*
* mm/page-writeback.c
--- linux-2.6.21logfs/fs/inode.c~I_LOCK 2007-05-07 10:28:55.000000000 +0200
+++ linux-2.6.21logfs/fs/inode.c 2007-05-29 13:26:57.000000000 +0200
@@ -228,7 +228,7 @@ void __iget(struct inode * inode)
return;
}
atomic_inc(&inode->i_count);
- if (!(inode->i_state & (I_DIRTY|I_LOCK)))
+ if (!(inode->i_state & (I_DIRTY|I_SYNC)))
list_move(&inode->i_list, &inode_in_use);
inodes_stat.nr_unused--;
}
@@ -249,7 +249,7 @@ void clear_inode(struct inode *inode)
BUG_ON(inode->i_data.nrpages);
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
- wait_on_inode(inode);
+ inode_sync_wait(inode);
DQUOT_DROP(inode);
if (inode->i_sb && inode->i_sb->s_op->clear_inode)
inode->i_sb->s_op->clear_inode(inode);
@@ -1038,7 +1038,7 @@ static void generic_forget_inode(struct
struct super_block *sb = inode->i_sb;

if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_LOCK)))
+ if (!(inode->i_state & (I_DIRTY|I_SYNC)))
list_move(&inode->i_list, &inode_unused);
inodes_stat.nr_unused++;
if (!sb || (sb->s_flags & MS_ACTIVE)) {
--- linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c~I_LOCK 2007-05-07 10:28:55.000000000 +0200
+++ linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c 2007-05-29 13:10:32.000000000 +0200
@@ -1286,7 +1286,14 @@ int txCommit(tid_t tid, /* transaction
* commit the transaction synchronously, so the last iput
* will be done by the calling thread (or later)
*/
- if (tblk->u.ip->i_state & I_LOCK)
+ /*
+ * I believe this code is no longer needed. Splitting I_LOCK
+ * into two bits, I_LOCK and I_SYNC should prevent this
+ * deadlock as well. But since I don't have a JFS testload
+ * to verify this, only a trivial s/I_LOCK/I_SYNC/ was done.
+ * Joern
+ */
+ if (tblk->u.ip->i_state & I_SYNC)
tblk->xflag &= ~COMMIT_LAZY;
}

--- linux-2.6.21logfs/fs/xfs/linux-2.6/xfs_iops.c~I_LOCK 2007-05-07 10:29:00.000000000 +0200
+++ linux-2.6.21logfs/fs/xfs/linux-2.6/xfs_iops.c 2007-05-29 13:15:14.000000000 +0200
@@ -133,7 +133,7 @@ xfs_ichgtime(
*/
SYNCHRONIZE();
ip->i_update_core = 1;
- if (!(inode->i_state & I_LOCK))
+ if (!(inode->i_state & I_SYNC))
mark_inode_dirty_sync(inode);
}

@@ -185,7 +185,7 @@ xfs_ichgtime_fast(
*/
SYNCHRONIZE();
ip->i_update_core = 1;
- if (!(inode->i_state & I_LOCK))
+ if (!(inode->i_state & I_SYNC))
mark_inode_dirty_sync(inode);
}

--- linux-2.6.21logfs/fs/hugetlbfs/inode.c~I_LOCK 2007-05-07 10:28:55.000000000 +0200
+++ linux-2.6.21logfs/fs/hugetlbfs/inode.c 2007-05-29 13:25:05.000000000 +0200
@@ -229,7 +229,7 @@ static void hugetlbfs_forget_inode(struc
struct super_block *sb = inode->i_sb;

if (!hlist_unhashed(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_LOCK)))
+ if (!(inode->i_state & (I_DIRTY|I_SYNC)))
list_move(&inode->i_list, &inode_unused);
inodes_stat.nr_unused++;
if (!sb || (sb->s_flags & MS_ACTIVE)) {
--- linux-2.6.21logfs/mm/page-writeback.c~I_LOCK 2007-05-07 13:24:03.000000000 +0200
+++ linux-2.6.21logfs/mm/page-writeback.c 2007-05-29 13:28:10.000000000 +0200
@@ -36,7 +36,7 @@

/*
* The maximum number of pages to writeout in a single bdflush/kupdate
- * operation. We do this so we don't hold I_LOCK against an inode for
+ * operation. We do this so we don't hold I_SYNC against an inode for
* enormous amounts of time, which would block a userspace task which has
* been forced to throttle against that inode. Also, the code reevaluates
* the dirty each time it has written this many pages.

2007-05-31 18:06:20

by Dave Kleikamp

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

On Thu, 2007-05-31 at 16:25 +0200, J?rn Engel wrote:
> --- linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c~I_LOCK 2007-05-07
> 10:28:55.000000000 +0200
> +++ linux-2.6.21logfs/fs/jfs/jfs_txnmgr.c 2007-05-29
> 13:10:32.000000000 +0200
> @@ -1286,7 +1286,14 @@ int txCommit(tid_t tid, /*
> transaction
> * commit the transaction synchronously, so the last
> iput
> * will be done by the calling thread (or later)
> */
> - if (tblk->u.ip->i_state & I_LOCK)
> + /*
> + * I believe this code is no longer needed. Splitting
> I_LOCK
> + * into two bits, I_LOCK and I_SYNC should prevent
> this
> + * deadlock as well. But since I don't have a JFS
> testload
> + * to verify this, only a trivial s/I_LOCK/I_SYNC/ was
> done.
> + * Joern
> + */
> + if (tblk->u.ip->i_state & I_SYNC)
> tblk->xflag &= ~COMMIT_LAZY;
> }

I think the code is still needed, and I think this change is correct.
The deadlock that this code is avoiding is caused by clear_inode()
calling wait_on_inode(). Since clear_inode() now calls
inode_sync_wait(inode), we want to avoid the lazily committing this
transaction when the I_SYNC flag is set.

Unfortunately, recreating the deadlock is hard, and I haven't been able
to recreate it with this code commented out.

Thanks,
Shaggy
--
David Kleikamp
IBM Linux Technology Center

2007-05-31 22:48:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

On Thu, 31 May 2007 16:25:35 +0200
J?rn Engel <[email protected]> wrote:

> On Wed, 16 May 2007 10:15:35 -0700, Andrew Morton wrote:
> >
> > If we're going to do this then please let's get some exhaustive commentary
> > in there so that others have a chance of understanding these flags without
> > having to do the amount of reverse-engineering which you've been put through.
>
> Done. Found and fixed some bugs in the process. By now I feal
> reasonable certain that the patch fixes more than it breaks.
>

>
> --
> Good warriors cause others to come to them and do not go to others.
> -- Sun Tzu
>
> Introduce I_SYNC.
>
> I_LOCK was used for several unrelated purposes, which caused deadlock
> situations in certain filesystems as a side effect. One of the purposes
> now uses the new I_SYNC bit.

Do we know what those deadlocks were? It's a bit of a mystery patch otherwise.

Put yourself in the position of random-distro-engineer wondering "should I
backport this?".

> Also document the various bits and change their order from historical to
> logical.

What a nice comment you added ;)

2007-06-01 08:59:44

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

Hi,

On 16 May 2007, at 18:01, J?rn Engel wrote:
> Patches fixes a deadlock problem well enough for LogFS to survive.
> The
> problem itself is generic and seems to be ancient. Shaggy has code in
> JFS from about 2.4.20 that seems to work around the deadlock. Dave
> Chinner indicated that this could cause latency problems (not a
> deadlock) on the NFS server side.

I agree that your patch is a good idea. I reviewed the latest
incarnation and it makes sense to me. And your comment concerning
the flags is a very welcome addition. Probably ought to find its way
into Documentation/filesystems/Locking or vfs.txt or somewhere like
that also.

Note that once your patch is applied I think it would make sense to
follow up with a second patch to remove the I_LOCK flag completely.
The only remaining uses are either together with I_NEW in which case
I_LOCK can be removed altogether or can be substituted with I_NEW
when only I_LOCK is used. This is because no places remain where we
set I_LOCK by itself any more with your patch. The only place where
we set it is the place where a new inode gets created in memory and
in that place we also set I_NEW at the same time as I_LOCK.
wait_on_inode() can then be changed to wait on I_NEW instead of on
I_LOCKED. That way we have one less confusing flag to worry about
and things are much easier to understand.

> I still suspect that NTFS has hit the same deadlock and its current
> "fix" will cause data corruption instead.

The NTFS "fix" will not cause data corruption at all. The usage in
NTFS is very different... I am afraid your patch does not address
the deadlock with NTFS or rather it only addresses the inode write
deadlock and does not address the get_new_inode() deadlock that
exists with ilookup5() and is avoided by ilookup5_nowait(). This
deadlock is inherent to what NTFS does so you don't need to worry
about it. (If you want I am happy to explain it but I would rather
not waste my time explaining if no-one except me cares about it...)

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


2007-06-01 12:01:35

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

On Fri, 1 June 2007 09:59:17 +0100, Anton Altaparmakov wrote:
>
> I agree that your patch is a good idea. I reviewed the latest
> incarnation and it makes sense to me. And your comment concerning

Thanks.

> the flags is a very welcome addition. Probably ought to find its way
> into Documentation/filesystems/Locking or vfs.txt or somewhere like
> that also.

Might make sense. Right now I would be more interested in getting the
questions at the bottom answered. Possibly the right answer might be
"here is a patch to fix it".

> Note that once your patch is applied I think it would make sense to
> follow up with a second patch to remove the I_LOCK flag completely.
> The only remaining uses are either together with I_NEW in which case
> I_LOCK can be removed altogether or can be substituted with I_NEW
> when only I_LOCK is used. This is because no places remain where we
> set I_LOCK by itself any more with your patch. The only place where
> we set it is the place where a new inode gets created in memory and
> in that place we also set I_NEW at the same time as I_LOCK.
> wait_on_inode() can then be changed to wait on I_NEW instead of on
> I_LOCKED. That way we have one less confusing flag to worry about
> and things are much easier to understand.

True. Waiting on I_NEW would be equivalent to waiting on I_LOCK. To
some degree I still prefer the current method. I_NEW is a state, while
I_LOCK is a lock or completion method. Having a confusing mix of
state/lock/completion bits is bad enough. Having such a mix of uses for
a single bit could be even worse.

> >I still suspect that NTFS has hit the same deadlock and its current
> >"fix" will cause data corruption instead.
>
> The NTFS "fix" will not cause data corruption at all. The usage in
> NTFS is very different... I am afraid your patch does not address
> the deadlock with NTFS or rather it only addresses the inode write
> deadlock and does not address the get_new_inode() deadlock that
> exists with ilookup5() and is avoided by ilookup5_nowait(). This
> deadlock is inherent to what NTFS does so you don't need to worry
> about it. (If you want I am happy to explain it but I would rather
> not waste my time explaining if no-one except me cares about it...)

Two seperate deadlocks exist, we agree on this. I_SYNC only solves one
of the two. LogFS solved the second deadlock by implementing its own
destroy_inode() and drop_inode() methods. Any inodes that would cause
the get_new_inode() deadlock get cached and logfs implements its own
iget() method to return a cached inode from any deadlock-prone codepath.

It is not a pretty solution, but ilookup_nowait() would definitely cause
data corruption for logfs, so that was not an option. Better ideas are
very welcome.

Jörn

--
The strong give up and move away, while the weak give up and stay.
-- unknown

2007-06-01 12:04:53

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH resend] introduce I_SYNC

On Thu, 31 May 2007 15:46:48 -0700, Andrew Morton wrote:
> >
> > I_LOCK was used for several unrelated purposes, which caused deadlock
> > situations in certain filesystems as a side effect. One of the purposes
> > now uses the new I_SYNC bit.
>
> Do we know what those deadlocks were? It's a bit of a mystery patch otherwise.
>
> Put yourself in the position of random-distro-engineer wondering "should I
> backport this?".

The logfs deadlock is well-known. All others are very handwavy and may
or may not really exist.

Will resend with description and without the jfs comment.

> > Also document the various bits and change their order from historical to
> > logical.
>
> What a nice comment you added ;)

And now I know how to bribe you into accepting patches. ;)

Jörn

--
Unless something dramatically changes, by 2015 we'll be largely
wondering what all the fuss surrounding Linux was really about.
-- Rob Enderle