2008-06-26 04:41:32

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 0/6] Remove most users of semaphores from XFS.

This series aims to convert all but one of the remaining users of
semaphores in the XFS code to use completions. Two of these
semaphores don't quite match to completion semantics, but a small
amount of additional code on top of the completions fixes this
problem. I'm open to suggestions on different/better ways to implement
this.

The patch series does not touch the b_lock semaphore in the
xfs_buf_t. At this point I'm not sure what we want to do with that
semaphore so I've ignored that for now. Also, this lock uses
linux primitives, not the xfs sema_t primitives so it doesn't
need changing to allow me to remove the sema_t.


2008-06-26 04:41:44

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 1/6] Extend completions to provide XFS object flush requirements

XFS object flushing doesn't quite match existing completion semantics. It
mixed exclusive access with completion. That is, we need to mark an object as
being flushed before flushing it to disk, and then block any other attempt to
flush it until the completion occurs.

To do this we introduce:

void init_completion_flush(struct completion *x)
which initialises x->done = 1

void completion_flush_start(struct completion *x)
which blocks if done == 0, otherwise decrements done to zero and
allows the caller to continue.

bool completion_flush_start_nowait(struct completion *x)
returns a failure status if done == 0, otherwise decrements done
to zero and returns a "flush started" status. This is provided
to allow flushing to begin safely while holding object locks in
inverted order.

This replaces the use of semaphores for providing this exclusion
and completion mechanism.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/completion.h | 65 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 65 insertions(+), 0 deletions(-)

diff --git a/include/linux/completion.h b/include/linux/completion.h
index d2961b6..82d9fa4 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -55,4 +55,69 @@ extern void complete_all(struct completion *);

#define INIT_COMPLETION(x) ((x).done = 0)

+/*
+ * Give completions flush lock semantics.
+ *
+ * A flush lock is a completion that allows only a single thread to be flushing
+ * a queue. That is, the first thread does not block on the completion, but
+ * forces all subsequent threads to block until the first thread issues a
+ * complete(). This allows only a single I/O at a time on an object protected
+ * by such a completion queue.
+ *
+ * Rather than make the code using this confusing by calling
+ * "wait_for_completion()", introduce a new interface that "starts
+ * a flush." In some cases, we may want to try to start a flush
+ * but not do so if it would involve sleeping, so allow those
+ * semantics as well.
+ */
+
+static inline void init_completion_flush(struct completion *x)
+{
+ x->done = 1;
+ init_waitqueue_head(&x->wait);
+}
+
+
+static inline void completion_flush_start(struct completion *x)
+{
+ wait_for_completion(x);
+}
+
+/*
+ * Start a flush without blocking if possible.
+ *
+ * Returns 0 if a completion is in progress without
+ * modifying the done counter. Otherwise, take a count
+ * so that others will see a completion in progress and
+ * return 1 to indicate a flush can be started.
+ */
+static inline bool completion_flush_start_nowait(struct completion *x)
+{
+ int ret = 1;
+
+ spin_lock_irq(&x->wait.lock);
+ if (!x->done)
+ ret = 0;
+ else
+ x->done--;
+ spin_unlock_irq(&x->wait.lock);
+ return ret;
+}
+
+/*
+ * Test to see if a flush is in progress.
+ *
+ * Returns 1 if a flush is in progress, otherwise 0.
+ */
+static inline bool completion_flush_inprogress(struct completion *x)
+{
+ int ret = 0;
+
+ spin_lock_irq(&x->wait.lock);
+ if (!x->done)
+ ret = 1;
+ spin_unlock_irq(&x->wait.lock);
+ return ret;
+}
+
#endif
--
1.5.5.4

2008-06-26 04:42:21

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 5/6] Remove the sema_t from XFS.

Now that all users of the sema_t are gone from XFS we
can finally kill it.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/sema.h | 52 ------------------------------------------
fs/xfs/linux-2.6/xfs_linux.h | 2 +-
2 files changed, 1 insertions(+), 53 deletions(-)
delete mode 100644 fs/xfs/linux-2.6/sema.h

diff --git a/fs/xfs/linux-2.6/sema.h b/fs/xfs/linux-2.6/sema.h
deleted file mode 100644
index 3abe7e9..0000000
--- a/fs/xfs/linux-2.6/sema.h
+++ /dev/null
@@ -1,52 +0,0 @@
-/*
- * Copyright (c) 2000-2002,2005 Silicon Graphics, Inc.
- * All Rights Reserved.
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it would be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write the Free Software Foundation,
- * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
- */
-#ifndef __XFS_SUPPORT_SEMA_H__
-#define __XFS_SUPPORT_SEMA_H__
-
-#include <linux/time.h>
-#include <linux/wait.h>
-#include <linux/semaphore.h>
-#include <asm/atomic.h>
-
-/*
- * sema_t structure just maps to struct semaphore in Linux kernel.
- */
-
-typedef struct semaphore sema_t;
-
-#define initnsema(sp, val, name) sema_init(sp, val)
-#define psema(sp, b) down(sp)
-#define vsema(sp) up(sp)
-#define freesema(sema) do { } while (0)
-
-static inline int issemalocked(sema_t *sp)
-{
- return down_trylock(sp) || (up(sp), 0);
-}
-
-/*
- * Map cpsema (try to get the sema) to down_trylock. We need to switch
- * the return values since cpsema returns 1 (acquired) 0 (failed) and
- * down_trylock returns the reverse 0 (acquired) 1 (failed).
- */
-static inline int cpsema(sema_t *sp)
-{
- return down_trylock(sp) ? 0 : 1;
-}
-
-#endif /* __XFS_SUPPORT_SEMA_H__ */
diff --git a/fs/xfs/linux-2.6/xfs_linux.h b/fs/xfs/linux-2.6/xfs_linux.h
index 4d45d93..4fb838f 100644
--- a/fs/xfs/linux-2.6/xfs_linux.h
+++ b/fs/xfs/linux-2.6/xfs_linux.h
@@ -45,13 +45,13 @@
#include <mrlock.h>
#include <sv.h>
#include <mutex.h>
-#include <sema.h>
#include <time.h>

#include <support/ktrace.h>
#include <support/debug.h>
#include <support/uuid.h>

+#include <linux/semaphore.h>
#include <linux/mm.h>
#include <linux/kernel.h>
#include <linux/blkdev.h>
--
1.5.5.4

2008-06-26 04:41:58

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 2/6] Replace inode flush semaphore with a completion

Use the new completion flush code to implement the inode
flush lock. Removes one of the final users of semaphores
in the XFS code base.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_iget.c | 15 +++++++--------
fs/xfs/xfs_inode.c | 11 +++++------
fs/xfs/xfs_inode.h | 2 +-
fs/xfs/xfs_inode_item.c | 11 +++++------
4 files changed, 18 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/xfs_iget.c b/fs/xfs/xfs_iget.c
index b07604b..5eed054 100644
--- a/fs/xfs/xfs_iget.c
+++ b/fs/xfs/xfs_iget.c
@@ -216,7 +216,7 @@ finish_inode:
mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
init_waitqueue_head(&ip->i_ipin_wait);
atomic_set(&ip->i_pincount, 0);
- initnsema(&ip->i_flock, 1, "xfsfino");
+ init_completion_flush(&ip->i_flush);

if (lock_flags)
xfs_ilock(ip, lock_flags);
@@ -776,25 +776,24 @@ xfs_isilocked(
#endif

/*
- * The following three routines simply manage the i_flock
- * semaphore embedded in the inode. This semaphore synchronizes
- * processes attempting to flush the in-core inode back to disk.
+ * Manage the i_flush queue embedded in the inode. This completion
+ * queue synchronizes processes attempting to flush the in-core
+ * inode back to disk.
*/
void
xfs_iflock(xfs_inode_t *ip)
{
- psema(&(ip->i_flock), PINOD|PLTWAIT);
+ completion_flush_start(&ip->i_flush);
}

int
xfs_iflock_nowait(xfs_inode_t *ip)
{
- return (cpsema(&(ip->i_flock)));
+ return completion_flush_start_nowait(&ip->i_flush);
}

void
xfs_ifunlock(xfs_inode_t *ip)
{
- ASSERT(issemalocked(&(ip->i_flock)));
- vsema(&(ip->i_flock));
+ complete(&ip->i_flush);
}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index bedc661..81e2040 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2630,7 +2630,6 @@ xfs_idestroy(
xfs_idestroy_fork(ip, XFS_ATTR_FORK);
mrfree(&ip->i_lock);
mrfree(&ip->i_iolock);
- freesema(&ip->i_flock);

#ifdef XFS_INODE_TRACE
ktrace_free(ip->i_trace);
@@ -3048,10 +3047,10 @@ cluster_corrupt_out:
/*
* xfs_iflush() will write a modified inode's changes out to the
* inode's on disk home. The caller must have the inode lock held
- * in at least shared mode and the inode flush semaphore must be
- * held as well. The inode lock will still be held upon return from
+ * in at least shared mode and the inode flush completion must be
+ * active as well. The inode lock will still be held upon return from
* the call and the caller is free to unlock it.
- * The inode flush lock will be unlocked when the inode reaches the disk.
+ * The inode flush will be completed when the inode reaches the disk.
* The flags indicate how the inode's buffer should be written out.
*/
int
@@ -3070,7 +3069,7 @@ xfs_iflush(
XFS_STATS_INC(xs_iflush_count);

ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
- ASSERT(issemalocked(&(ip->i_flock)));
+ ASSERT(completion_flush_inprogress(&ip->i_flush));
ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
ip->i_d.di_nextents > ip->i_df.if_ext_max);

@@ -3233,7 +3232,7 @@ xfs_iflush_int(
#endif

ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
- ASSERT(issemalocked(&(ip->i_flock)));
+ ASSERT(completion_flush_inprogress(&ip->i_flush));
ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
ip->i_d.di_nextents > ip->i_df.if_ext_max);

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 17a04b6..caaec3c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -223,7 +223,7 @@ typedef struct xfs_inode {
struct xfs_inode_log_item *i_itemp; /* logging information */
mrlock_t i_lock; /* inode lock */
mrlock_t i_iolock; /* inode IO lock */
- sema_t i_flock; /* inode flush lock */
+ struct completion i_flush; /* inode flush completion q */
atomic_t i_pincount; /* inode pin count */
wait_queue_head_t i_ipin_wait; /* inode pinning wait queue */
spinlock_t i_flags_lock; /* inode i_flags lock */
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 0eee08a..7c54046 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -779,11 +779,10 @@ xfs_inode_item_pushbuf(
ASSERT(iip->ili_push_owner == current_pid());

/*
- * If flushlock isn't locked anymore, chances are that the
- * inode flush completed and the inode was taken off the AIL.
- * So, just get out.
+ * If a flush is not in progress anymore, chances are that the
+ * inode was taken off the AIL. So, just get out.
*/
- if (!issemalocked(&(ip->i_flock)) ||
+ if (!completion_flush_inprogress(&ip->i_flush) ||
((iip->ili_item.li_flags & XFS_LI_IN_AIL) == 0)) {
iip->ili_pushbuf_flag = 0;
xfs_iunlock(ip, XFS_ILOCK_SHARED);
@@ -805,7 +804,7 @@ xfs_inode_item_pushbuf(
* If not, we can flush it async.
*/
dopush = ((iip->ili_item.li_flags & XFS_LI_IN_AIL) &&
- issemalocked(&(ip->i_flock)));
+ completion_flush_inprogress(&ip->i_flush));
iip->ili_pushbuf_flag = 0;
xfs_iunlock(ip, XFS_ILOCK_SHARED);
xfs_buftrace("INODE ITEM PUSH", bp);
@@ -858,7 +857,7 @@ xfs_inode_item_push(
ip = iip->ili_inode;

ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
- ASSERT(issemalocked(&(ip->i_flock)));
+ ASSERT(completion_flush_inprogress(&ip->i_flush));
/*
* Since we were able to lock the inode's flush lock and
* we found it on the AIL, the inode must be dirty. This
--
1.5.5.4

2008-06-26 04:42:37

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 3/6] Replace dquot flush semaphore with a completion

Use the new completion flush code to implement the dquot
flush lock. Removes one of the final users of semaphores
in the XFS code base.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/quota/xfs_dquot.c | 18 +++++-------------
fs/xfs/quota/xfs_dquot.h | 23 +++++++++++++----------
fs/xfs/quota/xfs_dquot_item.c | 6 +++---
3 files changed, 21 insertions(+), 26 deletions(-)

diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index fc9f3fb..c3e3080 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -101,7 +101,7 @@ xfs_qm_dqinit(
if (brandnewdquot) {
dqp->dq_flnext = dqp->dq_flprev = dqp;
mutex_init(&dqp->q_qlock);
- initnsema(&dqp->q_flock, 1, "fdq");
+ init_completion_flush(&dqp->q_flush);
sv_init(&dqp->q_pinwait, SV_DEFAULT, "pdq");

#ifdef XFS_DQUOT_TRACE
@@ -150,7 +150,6 @@ xfs_qm_dqdestroy(
ASSERT(! XFS_DQ_IS_ON_FREELIST(dqp));

mutex_destroy(&dqp->q_qlock);
- freesema(&dqp->q_flock);
sv_destroy(&dqp->q_pinwait);

#ifdef XFS_DQUOT_TRACE
@@ -1211,7 +1210,7 @@ xfs_qm_dqflush(
int error;

ASSERT(XFS_DQ_IS_LOCKED(dqp));
- ASSERT(XFS_DQ_IS_FLUSH_LOCKED(dqp));
+ ASSERT(completion_flush_inprogress(&dqp->q_flush));
xfs_dqtrace_entry(dqp, "DQFLUSH");

/*
@@ -1353,14 +1352,7 @@ int
xfs_qm_dqflock_nowait(
xfs_dquot_t *dqp)
{
- int locked;
-
- locked = cpsema(&((dqp)->q_flock));
-
- /* XXX ifdef these out */
- if (locked)
- (dqp)->dq_flags |= XFS_DQ_FLOCKED;
- return (locked);
+ return completion_flush_start_nowait(&dqp->q_flush);
}


@@ -1368,14 +1360,14 @@ int
xfs_qm_dqlock_nowait(
xfs_dquot_t *dqp)
{
- return (mutex_trylock(&((dqp)->q_qlock)));
+ return mutex_trylock(&dqp->q_qlock);
}

void
xfs_dqlock(
xfs_dquot_t *dqp)
{
- mutex_lock(&(dqp->q_qlock));
+ mutex_lock(&dqp->q_qlock);
}

void
diff --git a/fs/xfs/quota/xfs_dquot.h b/fs/xfs/quota/xfs_dquot.h
index f7393bb..c127a45 100644
--- a/fs/xfs/quota/xfs_dquot.h
+++ b/fs/xfs/quota/xfs_dquot.h
@@ -82,7 +82,7 @@ typedef struct xfs_dquot {
xfs_qcnt_t q_res_icount; /* total inos allocd+reserved */
xfs_qcnt_t q_res_rtbcount;/* total realtime blks used+reserved */
mutex_t q_qlock; /* quota lock */
- sema_t q_flock; /* flush lock */
+ struct completion q_flush; /* flush completion queue */
uint q_pincount; /* pin count for this dquot */
sv_t q_pinwait; /* sync var for pinning */
#ifdef XFS_DQUOT_TRACE
@@ -113,17 +113,20 @@ XFS_DQ_IS_LOCKED(xfs_dquot_t *dqp)


/*
- * The following three routines simply manage the q_flock
- * semaphore embedded in the dquot. This semaphore synchronizes
- * processes attempting to flush the in-core dquot back to disk.
+ * Manage the q_flush completion queue embedded in the dquot. This completion
+ * queue synchronizes processes attempting to flush the in-core dquot back to
+ * disk.
*/
-#define xfs_dqflock(dqp) { psema(&((dqp)->q_flock), PINOD | PRECALC);\
- (dqp)->dq_flags |= XFS_DQ_FLOCKED; }
-#define xfs_dqfunlock(dqp) { ASSERT(issemalocked(&((dqp)->q_flock))); \
- vsema(&((dqp)->q_flock)); \
- (dqp)->dq_flags &= ~(XFS_DQ_FLOCKED); }
+static inline void xfs_dqflock(xfs_dquot_t *dqp)
+{
+ completion_flush_start(&dqp->q_flush);
+}
+
+static inline void xfs_dqfunlock(xfs_dquot_t *dqp)
+{
+ complete(&dqp->q_flush);
+}

-#define XFS_DQ_IS_FLUSH_LOCKED(dqp) (issemalocked(&((dqp)->q_flock)))
#define XFS_DQ_IS_ON_FREELIST(dqp) ((dqp)->dq_flnext != (dqp))
#define XFS_DQ_IS_DIRTY(dqp) ((dqp)->dq_flags & XFS_DQ_DIRTY)
#define XFS_QM_ISUDQ(dqp) ((dqp)->dq_flags & XFS_DQ_USER)
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index 08d2fc8..a91efa8 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -151,7 +151,7 @@ xfs_qm_dquot_logitem_push(
dqp = logitem->qli_dquot;

ASSERT(XFS_DQ_IS_LOCKED(dqp));
- ASSERT(XFS_DQ_IS_FLUSH_LOCKED(dqp));
+ ASSERT(completion_flush_inprogress(&dqp->q_flush));

/*
* Since we were able to lock the dquot's flush lock and
@@ -245,7 +245,7 @@ xfs_qm_dquot_logitem_pushbuf(
* inode flush completed and the inode was taken off the AIL.
* So, just get out.
*/
- if (!issemalocked(&(dqp->q_flock)) ||
+ if (!completion_flush_inprogress(&dqp->q_flush) ||
((qip->qli_item.li_flags & XFS_LI_IN_AIL) == 0)) {
qip->qli_pushbuf_flag = 0;
xfs_dqunlock(dqp);
@@ -258,7 +258,7 @@ xfs_qm_dquot_logitem_pushbuf(
if (bp != NULL) {
if (XFS_BUF_ISDELAYWRITE(bp)) {
dopush = ((qip->qli_item.li_flags & XFS_LI_IN_AIL) &&
- issemalocked(&(dqp->q_flock)));
+ completion_flush_inprogress(&dqp->q_flush));
qip->qli_pushbuf_flag = 0;
xfs_dqunlock(dqp);

--
1.5.5.4

2008-06-26 04:43:15

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 6/6] Clean up stale references to semaphores

A lot of code has been converted away from semaphores,
but there are still comments that reference semaphore
behaviour. The log code is the worst offender. Update
the comments to reflect what the code really does now.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_vnode.c | 6 ++--
fs/xfs/xfs_log.c | 67 ++++++++++++++++++++----------------------
fs/xfs/xfs_log_priv.h | 12 ++++----
3 files changed, 41 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_vnode.c b/fs/xfs/linux-2.6/xfs_vnode.c
index bc7afe0..1881c79 100644
--- a/fs/xfs/linux-2.6/xfs_vnode.c
+++ b/fs/xfs/linux-2.6/xfs_vnode.c
@@ -33,7 +33,7 @@


/*
- * Dedicated vnode inactive/reclaim sync semaphores.
+ * Dedicated vnode inactive/reclaim sync wait queues.
* Prime number of hash buckets since address is used as the key.
*/
#define NVSYNC 37
@@ -84,8 +84,8 @@ vn_ioerror(

/*
* Revalidate the Linux inode from the XFS inode.
- * Note: i_size _not_ updated; we must hold the inode
- * semaphore when doing that - callers responsibility.
+ * Note: i_size _not_ updated; we must hold the i_mutex when doing
+ * that - callers responsibility.
*/
int
vn_revalidate(
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 2497de8..760d543 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -357,11 +357,11 @@ xfs_log_done(xfs_mount_t *mp,
* Asynchronous forces are implemented by setting the WANT_SYNC
* bit in the appropriate in-core log and then returning.
*
- * Synchronous forces are implemented with a semaphore. All callers
- * to force a given lsn to disk will wait on a semaphore attached to the
+ * Synchronous forces are implemented with a signal variable. All callers
+ * to force a given lsn to disk will wait on a the sv attached to the
* specific in-core log. When given in-core log finally completes its
* write to disk, that thread will wake up all threads waiting on the
- * semaphore.
+ * sv.
*/
int
_xfs_log_force(
@@ -707,7 +707,7 @@ xfs_log_unmount_write(xfs_mount_t *mp)
if (!(iclog->ic_state == XLOG_STATE_ACTIVE ||
iclog->ic_state == XLOG_STATE_DIRTY)) {
if (!XLOG_FORCED_SHUTDOWN(log)) {
- sv_wait(&iclog->ic_forcesema, PMEM,
+ sv_wait(&iclog->ic_force_wait, PMEM,
&log->l_icloglock, s);
} else {
spin_unlock(&log->l_icloglock);
@@ -748,7 +748,7 @@ xfs_log_unmount_write(xfs_mount_t *mp)
|| iclog->ic_state == XLOG_STATE_DIRTY
|| iclog->ic_state == XLOG_STATE_IOERROR) ) {

- sv_wait(&iclog->ic_forcesema, PMEM,
+ sv_wait(&iclog->ic_force_wait, PMEM,
&log->l_icloglock, s);
} else {
spin_unlock(&log->l_icloglock);
@@ -838,7 +838,7 @@ xfs_log_move_tail(xfs_mount_t *mp,
break;
tail_lsn = 0;
free_bytes -= tic->t_unit_res;
- sv_signal(&tic->t_sema);
+ sv_signal(&tic->t_wait);
tic = tic->t_next;
} while (tic != log->l_write_headq);
}
@@ -859,7 +859,7 @@ xfs_log_move_tail(xfs_mount_t *mp,
break;
tail_lsn = 0;
free_bytes -= need_bytes;
- sv_signal(&tic->t_sema);
+ sv_signal(&tic->t_wait);
tic = tic->t_next;
} while (tic != log->l_reserve_headq);
}
@@ -1285,8 +1285,8 @@ xlog_alloc_log(xfs_mount_t *mp,

ASSERT(XFS_BUF_ISBUSY(iclog->ic_bp));
ASSERT(XFS_BUF_VALUSEMA(iclog->ic_bp) <= 0);
- sv_init(&iclog->ic_forcesema, SV_DEFAULT, "iclog-force");
- sv_init(&iclog->ic_writesema, SV_DEFAULT, "iclog-write");
+ sv_init(&iclog->ic_force_wait, SV_DEFAULT, "iclog-force");
+ sv_init(&iclog->ic_write_wait, SV_DEFAULT, "iclog-write");

iclogp = &iclog->ic_next;
}
@@ -1565,8 +1565,8 @@ xlog_dealloc_log(xlog_t *log)

iclog = log->l_iclog;
for (i=0; i<log->l_iclog_bufs; i++) {
- sv_destroy(&iclog->ic_forcesema);
- sv_destroy(&iclog->ic_writesema);
+ sv_destroy(&iclog->ic_force_wait);
+ sv_destroy(&iclog->ic_write_wait);
xfs_buf_free(iclog->ic_bp);
#ifdef XFS_LOG_TRACE
if (iclog->ic_trace != NULL) {
@@ -1976,7 +1976,7 @@ xlog_write(xfs_mount_t * mp,
/* Clean iclogs starting from the head. This ordering must be
* maintained, so an iclog doesn't become ACTIVE beyond one that
* is SYNCING. This is also required to maintain the notion that we use
- * a counting semaphore to hold off would be writers to the log when every
+ * a ordered wait queue to hold off would be writers to the log when every
* iclog is trying to sync to disk.
*
* State Change: DIRTY -> ACTIVE
@@ -2240,7 +2240,7 @@ xlog_state_do_callback(
xlog_state_clean_log(log);

/* wake up threads waiting in xfs_log_force() */
- sv_broadcast(&iclog->ic_forcesema);
+ sv_broadcast(&iclog->ic_force_wait);

iclog = iclog->ic_next;
} while (first_iclog != iclog);
@@ -2302,8 +2302,7 @@ xlog_state_do_callback(
* the second completion goes through.
*
* Callbacks could take time, so they are done outside the scope of the
- * global state machine log lock. Assume that the calls to cvsema won't
- * take a long time. At least we know it won't sleep.
+ * global state machine log lock.
*/
STATIC void
xlog_state_done_syncing(
@@ -2339,7 +2338,7 @@ xlog_state_done_syncing(
* iclog buffer, we wake them all, one will get to do the
* I/O, the others get to wait for the result.
*/
- sv_broadcast(&iclog->ic_writesema);
+ sv_broadcast(&iclog->ic_write_wait);
spin_unlock(&log->l_icloglock);
xlog_state_do_callback(log, aborted, iclog); /* also cleans log */
} /* xlog_state_done_syncing */
@@ -2347,11 +2346,9 @@ xlog_state_done_syncing(

/*
* If the head of the in-core log ring is not (ACTIVE or DIRTY), then we must
- * sleep. The flush semaphore is set to the number of in-core buffers and
- * decremented around disk syncing. Therefore, if all buffers are syncing,
- * this semaphore will cause new writes to sleep until a sync completes.
- * Otherwise, this code just does p() followed by v(). This approximates
- * a sleep/wakeup except we can't race.
+ * sleep. We wait on the flush queue on the head iclog as that should be
+ * the first iclog to complete flushing. Hence if all iclogs are syncing,
+ * we will wait here and all new writes will sleep until a sync completes.
*
* The in-core logs are used in a circular fashion. They are not used
* out-of-order even when an iclog past the head is free.
@@ -2501,7 +2498,7 @@ xlog_grant_log_space(xlog_t *log,
goto error_return;

XFS_STATS_INC(xs_sleep_logspace);
- sv_wait(&tic->t_sema, PINOD|PLTWAIT, &log->l_grant_lock, s);
+ sv_wait(&tic->t_wait, PINOD|PLTWAIT, &log->l_grant_lock, s);
/*
* If we got an error, and the filesystem is shutting down,
* we'll catch it down below. So just continue...
@@ -2527,7 +2524,7 @@ redo:
xlog_trace_loggrant(log, tic,
"xlog_grant_log_space: sleep 2");
XFS_STATS_INC(xs_sleep_logspace);
- sv_wait(&tic->t_sema, PINOD|PLTWAIT, &log->l_grant_lock, s);
+ sv_wait(&tic->t_wait, PINOD|PLTWAIT, &log->l_grant_lock, s);

if (XLOG_FORCED_SHUTDOWN(log)) {
spin_lock(&log->l_grant_lock);
@@ -2626,7 +2623,7 @@ xlog_regrant_write_log_space(xlog_t *log,
if (free_bytes < ntic->t_unit_res)
break;
free_bytes -= ntic->t_unit_res;
- sv_signal(&ntic->t_sema);
+ sv_signal(&ntic->t_wait);
ntic = ntic->t_next;
} while (ntic != log->l_write_headq);

@@ -2637,7 +2634,7 @@ xlog_regrant_write_log_space(xlog_t *log,
xlog_trace_loggrant(log, tic,
"xlog_regrant_write_log_space: sleep 1");
XFS_STATS_INC(xs_sleep_logspace);
- sv_wait(&tic->t_sema, PINOD|PLTWAIT,
+ sv_wait(&tic->t_wait, PINOD|PLTWAIT,
&log->l_grant_lock, s);

/* If we're shutting down, this tic is already
@@ -2666,7 +2663,7 @@ redo:
if ((tic->t_flags & XLOG_TIC_IN_Q) == 0)
xlog_ins_ticketq(&log->l_write_headq, tic);
XFS_STATS_INC(xs_sleep_logspace);
- sv_wait(&tic->t_sema, PINOD|PLTWAIT, &log->l_grant_lock, s);
+ sv_wait(&tic->t_wait, PINOD|PLTWAIT, &log->l_grant_lock, s);

/* If we're shutting down, this tic is already off the queue */
if (XLOG_FORCED_SHUTDOWN(log)) {
@@ -2909,7 +2906,7 @@ xlog_state_switch_iclogs(xlog_t *log,
* 2. the current iclog is drity, and the previous iclog is in the
* active or dirty state.
*
- * We may sleep (call psema) if:
+ * We may sleep if:
*
* 1. the current iclog is not in the active nor dirty state.
* 2. the current iclog dirty, and the previous iclog is not in the
@@ -3006,7 +3003,7 @@ maybe_sleep:
return XFS_ERROR(EIO);
}
XFS_STATS_INC(xs_log_force_sleep);
- sv_wait(&iclog->ic_forcesema, PINOD, &log->l_icloglock, s);
+ sv_wait(&iclog->ic_force_wait, PINOD, &log->l_icloglock, s);
/*
* No need to grab the log lock here since we're
* only deciding whether or not to return EIO
@@ -3089,7 +3086,7 @@ try_again:
XLOG_STATE_SYNCING))) {
ASSERT(!(iclog->ic_state & XLOG_STATE_IOERROR));
XFS_STATS_INC(xs_log_force_sleep);
- sv_wait(&iclog->ic_prev->ic_writesema, PSWP,
+ sv_wait(&iclog->ic_prev->ic_write_wait, PSWP,
&log->l_icloglock, s);
*log_flushed = 1;
already_slept = 1;
@@ -3109,7 +3106,7 @@ try_again:
!(iclog->ic_state & (XLOG_STATE_ACTIVE | XLOG_STATE_DIRTY))) {

/*
- * Don't wait on the forcesema if we know that we've
+ * Don't wait on completion if we know that we've
* gotten a log write error.
*/
if (iclog->ic_state & XLOG_STATE_IOERROR) {
@@ -3117,7 +3114,7 @@ try_again:
return XFS_ERROR(EIO);
}
XFS_STATS_INC(xs_log_force_sleep);
- sv_wait(&iclog->ic_forcesema, PSWP, &log->l_icloglock, s);
+ sv_wait(&iclog->ic_force_wait, PSWP, &log->l_icloglock, s);
/*
* No need to grab the log lock here since we're
* only deciding whether or not to return EIO
@@ -3173,7 +3170,7 @@ STATIC void
xlog_ticket_put(xlog_t *log,
xlog_ticket_t *ticket)
{
- sv_destroy(&ticket->t_sema);
+ sv_destroy(&ticket->t_wait);
kmem_zone_free(xfs_log_ticket_zone, ticket);
} /* xlog_ticket_put */

@@ -3263,7 +3260,7 @@ xlog_ticket_get(xlog_t *log,
tic->t_trans_type = 0;
if (xflags & XFS_LOG_PERM_RESERV)
tic->t_flags |= XLOG_TIC_PERM_RESERV;
- sv_init(&(tic->t_sema), SV_DEFAULT, "logtick");
+ sv_init(&(tic->t_wait), SV_DEFAULT, "logtick");

xlog_tic_reset_res(tic);

@@ -3550,14 +3547,14 @@ xfs_log_force_umount(
*/
if ((tic = log->l_reserve_headq)) {
do {
- sv_signal(&tic->t_sema);
+ sv_signal(&tic->t_wait);
tic = tic->t_next;
} while (tic != log->l_reserve_headq);
}

if ((tic = log->l_write_headq)) {
do {
- sv_signal(&tic->t_sema);
+ sv_signal(&tic->t_wait);
tic = tic->t_next;
} while (tic != log->l_write_headq);
}
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 6245913..7dcf11e 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -241,7 +241,7 @@ typedef struct xlog_res {
} xlog_res_t;

typedef struct xlog_ticket {
- sv_t t_sema; /* sleep on this semaphore : 20 */
+ sv_t t_wait; /* ticket wait queue : 20 */
struct xlog_ticket *t_next; /* :4|8 */
struct xlog_ticket *t_prev; /* :4|8 */
xlog_tid_t t_tid; /* transaction identifier : 4 */
@@ -314,7 +314,7 @@ typedef struct xlog_rec_ext_header {
* xlog_rec_header_t into the reserved space.
* - ic_data follows, so a write to disk can start at the beginning of
* the iclog.
- * - ic_forcesema is used to implement synchronous forcing of the iclog to disk.
+ * - ic_forcewait is used to implement synchronous forcing of the iclog to disk.
* - ic_next is the pointer to the next iclog in the ring.
* - ic_bp is a pointer to the buffer used to write this incore log to disk.
* - ic_log is a pointer back to the global log structure.
@@ -339,8 +339,8 @@ typedef struct xlog_rec_ext_header {
* and move everything else out to subsequent cachelines.
*/
typedef struct xlog_iclog_fields {
- sv_t ic_forcesema;
- sv_t ic_writesema;
+ sv_t ic_force_wait;
+ sv_t ic_write_wait;
struct xlog_in_core *ic_next;
struct xlog_in_core *ic_prev;
struct xfs_buf *ic_bp;
@@ -377,8 +377,8 @@ typedef struct xlog_in_core {
/*
* Defines to save our code from this glop.
*/
-#define ic_forcesema hic_fields.ic_forcesema
-#define ic_writesema hic_fields.ic_writesema
+#define ic_force_wait hic_fields.ic_force_wait
+#define ic_write_wait hic_fields.ic_write_wait
#define ic_next hic_fields.ic_next
#define ic_prev hic_fields.ic_prev
#define ic_bp hic_fields.ic_bp
--
1.5.5.4

2008-06-26 04:42:51

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 4/6] Replace the XFS buf iodone semaphore with a completion

The xfs_buf_t b_iodonesema is really just a semaphore
that wants to be a completion. Change it to a completion
and remove the last user of the sema_t from XFS.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_buf.c | 6 +++---
fs/xfs/linux-2.6/xfs_buf.h | 4 ++--
fs/xfs/xfs_buf_item.c | 2 +-
fs/xfs/xfs_rw.c | 2 +-
4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 9cc8f02..9438c90 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -253,7 +253,7 @@ _xfs_buf_initialize(

memset(bp, 0, sizeof(xfs_buf_t));
atomic_set(&bp->b_hold, 1);
- init_MUTEX_LOCKED(&bp->b_iodonesema);
+ init_completion(&bp->b_iowait);
INIT_LIST_HEAD(&bp->b_list);
INIT_LIST_HEAD(&bp->b_hash_list);
init_MUTEX_LOCKED(&bp->b_sema); /* held, no waiters */
@@ -1037,7 +1037,7 @@ xfs_buf_ioend(
xfs_buf_iodone_work(&bp->b_iodone_work);
}
} else {
- up(&bp->b_iodonesema);
+ complete(&bp->b_iowait);
}
}

@@ -1275,7 +1275,7 @@ xfs_buf_iowait(
XB_TRACE(bp, "iowait", 0);
if (atomic_read(&bp->b_io_remaining))
blk_run_address_space(bp->b_target->bt_mapping);
- down(&bp->b_iodonesema);
+ wait_for_completion(&bp->b_iowait);
XB_TRACE(bp, "iowaited", (long)bp->b_error);
return bp->b_error;
}
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index 29d1d4a..fe01099 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -157,7 +157,7 @@ typedef struct xfs_buf {
xfs_buf_iodone_t b_iodone; /* I/O completion function */
xfs_buf_relse_t b_relse; /* releasing function */
xfs_buf_bdstrat_t b_strat; /* pre-write function */
- struct semaphore b_iodonesema; /* Semaphore for I/O waiters */
+ struct completion b_iowait; /* queue for I/O waiters */
void *b_fspriv;
void *b_fspriv2;
void *b_fspriv3;
@@ -352,7 +352,7 @@ extern void xfs_buf_trace(xfs_buf_t *, char *, void *, void *);
#define XFS_BUF_CPSEMA(bp) (xfs_buf_cond_lock(bp) == 0)
#define XFS_BUF_VSEMA(bp) xfs_buf_unlock(bp)
#define XFS_BUF_PSEMA(bp,x) xfs_buf_lock(bp)
-#define XFS_BUF_V_IODONESEMA(bp) up(&bp->b_iodonesema);
+#define XFS_BUF_FINISH_IOWAIT(bp) complete(&bp->b_iowait);

#define XFS_BUF_SET_TARGET(bp, target) ((bp)->b_target = (target))
#define XFS_BUF_TARGET(bp) ((bp)->b_target)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index d86ca2c..215c36d 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1056,7 +1056,7 @@ xfs_buf_iodone_callbacks(
anyway. */
XFS_BUF_SET_BRELSE_FUNC(bp,xfs_buf_error_relse);
XFS_BUF_DONE(bp);
- XFS_BUF_V_IODONESEMA(bp);
+ XFS_BUF_FINISH_IOWAIT(bp);
}
return;
}
diff --git a/fs/xfs/xfs_rw.c b/fs/xfs/xfs_rw.c
index b0f31c0..3a82576 100644
--- a/fs/xfs/xfs_rw.c
+++ b/fs/xfs/xfs_rw.c
@@ -314,7 +314,7 @@ xfs_bioerror_relse(
* ASYNC buffers.
*/
XFS_BUF_ERROR(bp, EIO);
- XFS_BUF_V_IODONESEMA(bp);
+ XFS_BUF_FINISH_IOWAIT(bp);
} else {
xfs_buf_relse(bp);
}
--
1.5.5.4

2008-06-26 07:41:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 4/6] Replace the XFS buf iodone semaphore with a completion

On Thu, Jun 26, 2008 at 02:41:15PM +1000, Dave Chinner wrote:
> The xfs_buf_t b_iodonesema is really just a semaphore
> that wants to be a completion. Change it to a completion
> and remove the last user of the sema_t from XFS.

Looks good and I've been running with an equivalent patch in my tree
for a while.

2008-06-26 07:46:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 02:41:12PM +1000, Dave Chinner wrote:
> XFS object flushing doesn't quite match existing completion semantics. It
> mixed exclusive access with completion. That is, we need to mark an object as
> being flushed before flushing it to disk, and then block any other attempt to
> flush it until the completion occurs.
>
> To do this we introduce:
>
> void init_completion_flush(struct completion *x)
> which initialises x->done = 1
>
> void completion_flush_start(struct completion *x)
> which blocks if done == 0, otherwise decrements done to zero and
> allows the caller to continue.
>
> bool completion_flush_start_nowait(struct completion *x)
> returns a failure status if done == 0, otherwise decrements done
> to zero and returns a "flush started" status. This is provided
> to allow flushing to begin safely while holding object locks in
> inverted order.
>
> This replaces the use of semaphores for providing this exclusion
> and completion mechanism.

Given that the only API call shared with normal completions is
complete() I'd rather make this a primitive of it's own, even if
internally implemented as completions.

Also please add kerneldoc comments for all new APIs eported to modules.

2008-06-26 07:47:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 6/6] Clean up stale references to semaphores

On Thu, Jun 26, 2008 at 02:41:17PM +1000, Dave Chinner wrote:
> A lot of code has been converted away from semaphores,
> but there are still comments that reference semaphore
> behaviour. The log code is the worst offender. Update
> the comments to reflect what the code really does now.

Looks good.

2008-06-26 11:21:48

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 03:46:36AM -0400, Christoph Hellwig wrote:
> On Thu, Jun 26, 2008 at 02:41:12PM +1000, Dave Chinner wrote:
> > XFS object flushing doesn't quite match existing completion semantics. It
> > mixed exclusive access with completion. That is, we need to mark an object as
> > being flushed before flushing it to disk, and then block any other attempt to
> > flush it until the completion occurs.
> >
> > To do this we introduce:
> >
> > void init_completion_flush(struct completion *x)
> > which initialises x->done = 1
> >
> > void completion_flush_start(struct completion *x)
> > which blocks if done == 0, otherwise decrements done to zero and
> > allows the caller to continue.
> >
> > bool completion_flush_start_nowait(struct completion *x)
> > returns a failure status if done == 0, otherwise decrements done
> > to zero and returns a "flush started" status. This is provided
> > to allow flushing to begin safely while holding object locks in
> > inverted order.
> >
> > This replaces the use of semaphores for providing this exclusion
> > and completion mechanism.
>
> Given that the only API call shared with normal completions is
> complete() I'd rather make this a primitive of it's own, even if
> internally implemented as completions.

Ok, so that involves exactly what? A new header file, a new API name
(ideas anyone?) and kerneldoc comments?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-26 11:26:38

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 02:41:12PM +1000, Dave Chinner wrote:
> XFS object flushing doesn't quite match existing completion semantics. It
> mixed exclusive access with completion. That is, we need to mark an object as
> being flushed before flushing it to disk, and then block any other attempt to
> flush it until the completion occurs.

This sounds like mutex semantics. Why are the existing mutexes not
appropriate for your needs?

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-26 11:32:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 05:26:12AM -0600, Matthew Wilcox wrote:
> On Thu, Jun 26, 2008 at 02:41:12PM +1000, Dave Chinner wrote:
> > XFS object flushing doesn't quite match existing completion semantics. It
> > mixed exclusive access with completion. That is, we need to mark an object as
> > being flushed before flushing it to disk, and then block any other attempt to
> > flush it until the completion occurs.
>
> This sounds like mutex semantics. Why are the existing mutexes not
> appropriate for your needs?

Different threads doing wait and complete.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-26 11:43:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 09:32:09PM +1000, Dave Chinner wrote:
> On Thu, Jun 26, 2008 at 05:26:12AM -0600, Matthew Wilcox wrote:
> > On Thu, Jun 26, 2008 at 02:41:12PM +1000, Dave Chinner wrote:
> > > XFS object flushing doesn't quite match existing completion semantics. It
> > > mixed exclusive access with completion. That is, we need to mark an object as
> > > being flushed before flushing it to disk, and then block any other attempt to
> > > flush it until the completion occurs.
> >
> > This sounds like mutex semantics. Why are the existing mutexes not
> > appropriate for your needs?
>
> Different threads doing wait and complete.

Then let's leave it as a semaphore. You can get rid of the sema_t if
you like, but I don't think that turning completions into semaphores is
a good idea (because it's confusing).

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-26 12:21:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 05:42:42AM -0600, Matthew Wilcox wrote:
> On Thu, Jun 26, 2008 at 09:32:09PM +1000, Dave Chinner wrote:
> > On Thu, Jun 26, 2008 at 05:26:12AM -0600, Matthew Wilcox wrote:
> > > On Thu, Jun 26, 2008 at 02:41:12PM +1000, Dave Chinner wrote:
> > > > XFS object flushing doesn't quite match existing completion semantics. It
> > > > mixed exclusive access with completion. That is, we need to mark an object as
> > > > being flushed before flushing it to disk, and then block any other attempt to
> > > > flush it until the completion occurs.
> > >
> > > This sounds like mutex semantics. Why are the existing mutexes not
> > > appropriate for your needs?
> >
> > Different threads doing wait and complete.
>
> Then let's leave it as a semaphore. You can get rid of the sema_t if
> you like, but I don't think that turning completions into semaphores is
> a good idea (because it's confusing).

So remind me what the point of the semaphore removal tree is again?

As Christoph suggested, I can put this under another API that
is implemented using completions. If I have to do that in XFS,
so be it....

The main reason for this that we've just uncovered the fact that the
way XFS uses semaphores is completely unsafe [*] on x86/x86_64 for
kernels prior to the new generic semaphores.

[*] 2.6.20 panics in up() because of this race when I/O completion
(the up call) races with a simultaneous down() (iowaiter):

T1 T2
up() down()
kmem_free()

When the down() call completes, the up() call can still be
referencing the semaphore, and hence if we free the structure after
the down call then the up() will reference freed memory. This is
probably the cause of many unexplained log replay or unmount panics
that we've been hitting for years with buffers that been freed while
apparently still in use....

Hence I'd prefer just to move completely away from semaphores for
this flush interface. I'd like to start with getting the upstream
code fixed in a sane manner so all the backports to older kernels
start from the same series of commits.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-26 12:55:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 10:21:12PM +1000, Dave Chinner wrote:
> On Thu, Jun 26, 2008 at 05:42:42AM -0600, Matthew Wilcox wrote:
> > Then let's leave it as a semaphore. You can get rid of the sema_t if
> > you like, but I don't think that turning completions into semaphores is
> > a good idea (because it's confusing).
>
> So remind me what the point of the semaphore removal tree is again?

To remove the semaphores which don't need to be semaphores any more.

> As Christoph suggested, I can put this under another API that
> is implemented using completions. If I have to do that in XFS,
> so be it....

You could, yes. But you could just use completions directly ...

> The main reason for this that we've just uncovered the fact that the
> way XFS uses semaphores is completely unsafe [*] on x86/x86_64 for
> kernels prior to the new generic semaphores.
>
> [*] 2.6.20 panics in up() because of this race when I/O completion
> (the up call) races with a simultaneous down() (iowaiter):
>
> T1 T2
> up() down()
> kmem_free()
>
> When the down() call completes, the up() call can still be
> referencing the semaphore, and hence if we free the structure after
> the down call then the up() will reference freed memory. This is
> probably the cause of many unexplained log replay or unmount panics
> that we've been hitting for years with buffers that been freed while
> apparently still in use....

This is exactly the kind of thing completions were supposed to be used
for. T1 should be calling complete() and T2 should be calling
wait_for_completion().

> Hence I'd prefer just to move completely away from semaphores for
> this flush interface. I'd like to start with getting the upstream
> code fixed in a sane manner so all the backports to older kernels
> start from the same series of commits.

That's sane.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-26 13:07:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 06:40:09AM -0600, Matthew Wilcox wrote:
> On Thu, Jun 26, 2008 at 10:21:12PM +1000, Dave Chinner wrote:
> > On Thu, Jun 26, 2008 at 05:42:42AM -0600, Matthew Wilcox wrote:
> > > Then let's leave it as a semaphore. You can get rid of the sema_t if
> > > you like, but I don't think that turning completions into semaphores is
> > > a good idea (because it's confusing).
> >
> > So remind me what the point of the semaphore removal tree is again?
>
> To remove the semaphores which don't need to be semaphores any more.
>
> > As Christoph suggested, I can put this under another API that
> > is implemented using completions. If I have to do that in XFS,
> > so be it....
>
> You could, yes. But you could just use completions directly ...
>
> > The main reason for this that we've just uncovered the fact that the
> > way XFS uses semaphores is completely unsafe [*] on x86/x86_64 for
> > kernels prior to the new generic semaphores.
> >
> > [*] 2.6.20 panics in up() because of this race when I/O completion
> > (the up call) races with a simultaneous down() (iowaiter):
> >
> > T1 T2
> > up() down()
> > kmem_free()
> >
> > When the down() call completes, the up() call can still be
> > referencing the semaphore, and hence if we free the structure after
> > the down call then the up() will reference freed memory. This is
> > probably the cause of many unexplained log replay or unmount panics
> > that we've been hitting for years with buffers that been freed while
> > apparently still in use....
>
> This is exactly the kind of thing completions were supposed to be used
> for. T1 should be calling complete() and T2 should be calling
> wait_for_completion().

Please read Dave's introductionary mail. What XFS wants if completions
with a little bit extra, so he implemented the little bit extra. This
little bit extra is pretty well described in the mail starting this
thread.

2008-06-26 13:11:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 06:40:09AM -0600, Matthew Wilcox wrote:
> On Thu, Jun 26, 2008 at 10:21:12PM +1000, Dave Chinner wrote:
> > On Thu, Jun 26, 2008 at 05:42:42AM -0600, Matthew Wilcox wrote:
> > > Then let's leave it as a semaphore. You can get rid of the sema_t if
> > > you like, but I don't think that turning completions into semaphores is
> > > a good idea (because it's confusing).
> >
> > So remind me what the point of the semaphore removal tree is again?
>
> To remove the semaphores which don't need to be semaphores any more.

Or shouldn't be semaphores in the first place?

> > As Christoph suggested, I can put this under another API that
> > is implemented using completions. If I have to do that in XFS,
> > so be it....
>
> You could, yes. But you could just use completions directly ...

Not that I can see.

> > The main reason for this that we've just uncovered the fact that the
> > way XFS uses semaphores is completely unsafe [*] on x86/x86_64 for
> > kernels prior to the new generic semaphores.
> >
> > [*] 2.6.20 panics in up() because of this race when I/O completion
> > (the up call) races with a simultaneous down() (iowaiter):
> >
> > T1 T2
> > up() down()
> > kmem_free()
> >
> > When the down() call completes, the up() call can still be
> > referencing the semaphore, and hence if we free the structure after
> > the down call then the up() will reference freed memory. This is
> > probably the cause of many unexplained log replay or unmount panics
> > that we've been hitting for years with buffers that been freed while
> > apparently still in use....
>
> This is exactly the kind of thing completions were supposed to be used
> for. T1 should be calling complete() and T2 should be calling
> wait_for_completion().

Yes, certainly. But as should be obvious by now completions don't
quite fit the bill for XFS - they only work for *synchronisation*
after the I/O. XFS needs *exclusion* during the I/O as well as
*synchronisation* after the I/O. The completion extensions provided the
exclusion part of the deal. How else do you suggest I implement
this?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-26 13:12:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 09:21:33PM +1000, Dave Chinner wrote:
> Ok, so that involves exactly what? A new header file, a new API name
> (ideas anyone?) and kerneldoc comments?

Yes, probably just a new header with properly documented functions.
Thew two non-trivial ones you added should probably not be inlines,
btw.

flush_lock_init/flush_lock/flush_trylock/flush_done/flush_is_locked?

2008-06-26 13:18:45

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 09:07:00AM -0400, Christoph Hellwig wrote:
> On Thu, Jun 26, 2008 at 09:21:33PM +1000, Dave Chinner wrote:
> > Ok, so that involves exactly what? A new header file, a new API name
> > (ideas anyone?) and kerneldoc comments?
>
> Yes, probably just a new header with properly documented functions.
> Thew two non-trivial ones you added should probably not be inlines,
> btw.

Yeah, they grew a little bigger than expected...

> flush_lock_init/flush_lock/flush_trylock/flush_done/flush_is_locked?

Ok, I was thinking along those lines. I'll redo the patch series
using that interface tomorrow.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-26 20:33:41

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements


On Thu, 2008-06-26 at 14:41 +1000, Dave Chinner wrote:
> XFS object flushing doesn't quite match existing completion semantics. It
> mixed exclusive access with completion. That is, we need to mark an object as
> being flushed before flushing it to disk, and then block any other attempt to
> flush it until the completion occurs.
>
> To do this we introduce:
>
> void init_completion_flush(struct completion *x)
> which initialises x->done = 1
>
> void completion_flush_start(struct completion *x)
> which blocks if done == 0, otherwise decrements done to zero and
> allows the caller to continue.
>
> bool completion_flush_start_nowait(struct completion *x)
> returns a failure status if done == 0, otherwise decrements done
> to zero and returns a "flush started" status. This is provided
> to allow flushing to begin safely while holding object locks in
> inverted order.
>
> This replaces the use of semaphores for providing this exclusion
> and completion mechanism.

I think there is some basis to make the changes that you have here.
Specifically this email and thread,

http://lkml.org/lkml/2008/4/15/232

However, I don't like how your implementing this as specifically a
"flush" mechanism for XFS, and the count is limited to just 1 .. There
are several other places that do this kind of counting with semaphores,
and have counts above 1..

> +
> +static inline void completion_flush_start(struct completion *x)
> +{
> + wait_for_completion(x);
> +}

Above seems completely pointless.. I would just call
wait_for_completion(), and make the rest of the interface generic.

Daniel

2008-06-27 01:52:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 01:33:25PM -0700, Daniel Walker wrote:
>
> On Thu, 2008-06-26 at 14:41 +1000, Dave Chinner wrote:
> > XFS object flushing doesn't quite match existing completion semantics. It
> > mixed exclusive access with completion. That is, we need to mark an object as
> > being flushed before flushing it to disk, and then block any other attempt to
> > flush it until the completion occurs.
> >
> > To do this we introduce:
> >
> > void init_completion_flush(struct completion *x)
> > which initialises x->done = 1
> >
> > void completion_flush_start(struct completion *x)
> > which blocks if done == 0, otherwise decrements done to zero and
> > allows the caller to continue.
> >
> > bool completion_flush_start_nowait(struct completion *x)
> > returns a failure status if done == 0, otherwise decrements done
> > to zero and returns a "flush started" status. This is provided
> > to allow flushing to begin safely while holding object locks in
> > inverted order.
> >
> > This replaces the use of semaphores for providing this exclusion
> > and completion mechanism.
>
> I think there is some basis to make the changes that you have here.
> Specifically this email and thread,
>
> http://lkml.org/lkml/2008/4/15/232
>
> However, I don't like how your implementing this as specifically a
> "flush" mechanism for XFS, and the count is limited to just 1 .. There
> are several other places that do this kind of counting with semaphores,
> and have counts above 1..

Agreed - but the extension has to start somewhere. So, do I simply
add a "init_completion_count()" that passes a count value for the
completion (i.e. replaces init_completion_flush())?

> > +
> > +static inline void completion_flush_start(struct completion *x)
> > +{
> > + wait_for_completion(x);
> > +}
>
> Above seems completely pointless.. I would just call
> wait_for_completion(), and make the rest of the interface generic.

Except then wait_for_completion_nowait() makes absolutely no sense ;)
If i use wait_for_completion() for this, then perhaps the
non-blocking version becomes "try_wait_for_completion()". Would
this be acceptible?

i.e. the extra functions in the completion API would be:

void init_completion_count(struct completion *x, int count);
int try_wait_for_completion(struct completion *x);
int completion_in_progress(struct completion *x);

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-27 02:24:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 01:33:25PM -0700, Daniel Walker wrote:
> I think there is some basis to make the changes that you have here.
> Specifically this email and thread,
>
> http://lkml.org/lkml/2008/4/15/232

You've completely missed the point. The current semaphore code is
_more_ efficient than the current completion code. I'm very comfortable
having two APIs here, one for completion-like semantics and one for
mutex-like semantics. Confusing them like this makes no sense at all.

> However, I don't like how your implementing this as specifically a
> "flush" mechanism for XFS, and the count is limited to just 1 .. There
> are several other places that do this kind of counting with semaphores,
> and have counts above 1..

Then leave them as semaphores. Really.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-27 02:30:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/6] Replace inode flush semaphore with a completion

On Thu, Jun 26, 2008 at 02:41:13PM +1000, Dave Chinner wrote:
> Use the new completion flush code to implement the inode
> flush lock. Removes one of the final users of semaphores
> in the XFS code base.

Let's demonstrate converting this one to completions ...

> --- a/fs/xfs/xfs_iget.c
> +++ b/fs/xfs/xfs_iget.c
> @@ -216,7 +216,7 @@ finish_inode:
> mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
> init_waitqueue_head(&ip->i_ipin_wait);
> atomic_set(&ip->i_pincount, 0);
> - initnsema(&ip->i_flock, 1, "xfsfino");
> + init_completion_flush(&ip->i_flush);

+ init_completion(&ip->i_flush);
+ complete(&ip->i_flush);

>
> if (lock_flags)
> xfs_ilock(ip, lock_flags);
> @@ -776,25 +776,24 @@ xfs_isilocked(
> #endif
>
> /*
> - * The following three routines simply manage the i_flock
> - * semaphore embedded in the inode. This semaphore synchronizes
> - * processes attempting to flush the in-core inode back to disk.
> + * Manage the i_flush queue embedded in the inode. This completion
> + * queue synchronizes processes attempting to flush the in-core
> + * inode back to disk.
> */
> void
> xfs_iflock(xfs_inode_t *ip)
> {
> - psema(&(ip->i_flock), PINOD|PLTWAIT);
> + completion_flush_start(&ip->i_flush);

+ wait_for_completion(&ip->i_flush);

> }
>
> int
> xfs_iflock_nowait(xfs_inode_t *ip)
> {
> - return (cpsema(&(ip->i_flock)));
> + return completion_flush_start_nowait(&ip->i_flush);

This is where you need a new function ...

+ return nowait_for_completion(&ip->i_flush);

Yes, we probably need a better name for the down_trylock() equivalent.

> }
>
> void
> xfs_ifunlock(xfs_inode_t *ip)
> {
> - ASSERT(issemalocked(&(ip->i_flock)));
> - vsema(&(ip->i_flock));
> + complete(&ip->i_flush);

Yep.

> }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index bedc661..81e2040 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2630,7 +2630,6 @@ xfs_idestroy(
> xfs_idestroy_fork(ip, XFS_ATTR_FORK);
> mrfree(&ip->i_lock);
> mrfree(&ip->i_iolock);
> - freesema(&ip->i_flock);
>
> #ifdef XFS_INODE_TRACE
> ktrace_free(ip->i_trace);
> @@ -3048,10 +3047,10 @@ cluster_corrupt_out:
> /*
> * xfs_iflush() will write a modified inode's changes out to the
> * inode's on disk home. The caller must have the inode lock held
> - * in at least shared mode and the inode flush semaphore must be
> - * held as well. The inode lock will still be held upon return from
> + * in at least shared mode and the inode flush completion must be
> + * active as well. The inode lock will still be held upon return from
> * the call and the caller is free to unlock it.
> - * The inode flush lock will be unlocked when the inode reaches the disk.
> + * The inode flush will be completed when the inode reaches the disk.
> * The flags indicate how the inode's buffer should be written out.
> */
> int
> @@ -3070,7 +3069,7 @@ xfs_iflush(
> XFS_STATS_INC(xs_iflush_count);
>
> ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
> - ASSERT(issemalocked(&(ip->i_flock)));
> + ASSERT(completion_flush_inprogress(&ip->i_flush));

is_complete()?

> ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
> ip->i_d.di_nextents > ip->i_df.if_ext_max);

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-06-27 03:26:59

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements


On Thu, 2008-06-26 at 20:24 -0600, Matthew Wilcox wrote:

>
> Then leave them as semaphores. Really.
>

Are you saying you want to keep semaphores? Cause I think that goes
against the overall agenda that I've seen, which is complete semaphore
removal .. If you have issues with the removal method then we could
change that ..

Daniel

2008-06-27 04:14:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/6] Replace inode flush semaphore with a completion

On Thu, Jun 26, 2008 at 08:30:11PM -0600, Matthew Wilcox wrote:
> On Thu, Jun 26, 2008 at 02:41:13PM +1000, Dave Chinner wrote:
> > Use the new completion flush code to implement the inode
> > flush lock. Removes one of the final users of semaphores
> > in the XFS code base.
>
> Let's demonstrate converting this one to completions ...

Great. I need all the help I can get. ;)

> > --- a/fs/xfs/xfs_iget.c
> > +++ b/fs/xfs/xfs_iget.c
> > @@ -216,7 +216,7 @@ finish_inode:
> > mrlock_init(&ip->i_iolock, MRLOCK_BARRIER, "xfsio", ip->i_ino);
> > init_waitqueue_head(&ip->i_ipin_wait);
> > atomic_set(&ip->i_pincount, 0);
> > - initnsema(&ip->i_flock, 1, "xfsfino");
> > + init_completion_flush(&ip->i_flush);
>
> + init_completion(&ip->i_flush);
> + complete(&ip->i_flush);

Ok, that would work. Need commenting, though.

> > int
> > xfs_iflock_nowait(xfs_inode_t *ip)
> > {
> > - return (cpsema(&(ip->i_flock)));
> > + return completion_flush_start_nowait(&ip->i_flush);
>
> This is where you need a new function ...
>
> + return nowait_for_completion(&ip->i_flush);
>
> Yes, we probably need a better name for the down_trylock() equivalent.

I suggested try_wait_for_completion() in a different email - it's
a difficult on to say "a wait_for_completion() call that doesn't
wait"....

I think I'll stick with the try_.... version to match other "try"
operations.


> > int
> > @@ -3070,7 +3069,7 @@ xfs_iflush(
> > XFS_STATS_INC(xs_iflush_count);
> >
> > ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
> > - ASSERT(issemalocked(&(ip->i_flock)));
> > + ASSERT(completion_flush_inprogress(&ip->i_flush));
>
> is_complete()?

That's a little to general for my liking - I note that there
are already an existing is_complete() function in a driver.
completion_done() perhaps?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-06-27 09:15:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements

On Thu, Jun 26, 2008 at 08:26:39PM -0700, Daniel Walker wrote:
> Are you saying you want to keep semaphores? Cause I think that goes
> against the overall agenda that I've seen,

Which overall agenda?

2008-06-27 14:37:31

by Daniel Walker

[permalink] [raw]
Subject: Re: [PATCH 1/6] Extend completions to provide XFS object flush requirements


On Fri, 2008-06-27 at 05:15 -0400, Christoph Hellwig wrote:
> On Thu, Jun 26, 2008 at 08:26:39PM -0700, Daniel Walker wrote:
> > Are you saying you want to keep semaphores? Cause I think that goes
> > against the overall agenda that I've seen,
>
> Which overall agenda?

Semaphore removal ..

Daniel