2010-08-04 16:39:18

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH -v2 0/3] jbd2 scalability patches

This version fixes three bugs in the 2nd patch of this series that
caused kernel BUG when the system was under race. We weren't accounting
with t_oustanding_credits correctly, and there were race conditions
caused by the fact the I had overlooked the fact that
__jbd2_log_wait_for_space() and jbd2_get_transaction() requires
j_state_lock to be write locked.

Theodore Ts'o (3):
jbd2: Use atomic variables to avoid taking t_handle_lock in
jbd2_journal_stop
jbd2: Change j_state_lock to be a rwlock_t
jbd2: Remove t_handle_lock from start_this_handle()

fs/ext4/inode.c | 4 +-
fs/ext4/super.c | 4 +-
fs/jbd2/checkpoint.c | 18 +++---
fs/jbd2/commit.c | 42 ++++++------
fs/jbd2/journal.c | 94 +++++++++++++--------------
fs/jbd2/transaction.c | 172 ++++++++++++++++++++++++++++---------------------
fs/ocfs2/journal.c | 4 +-
include/linux/jbd2.h | 12 ++--
8 files changed, 188 insertions(+), 162 deletions(-)



2010-08-04 16:39:18

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH -v2 3/3] jbd2: Remove t_handle_lock from start_this_handle()

This should remove the last exclusive lock from start_this_handle(),
so that we should now be able to start multiple transactions at the
same time on large SMP systems.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/commit.c | 3 ++-
fs/jbd2/transaction.c | 33 ++++++++++++++++++++++-----------
include/linux/jbd2.h | 2 +-
3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 67bb0a2..f52e5e8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -1004,7 +1004,8 @@ restart_loop:
* File the transaction statistics
*/
stats.ts_tid = commit_transaction->t_tid;
- stats.run.rs_handle_count = commit_transaction->t_handle_count;
+ stats.run.rs_handle_count =
+ atomic_read(&commit_transaction->t_handle_count);
trace_jbd2_run_stats(journal->j_fs_dev->bd_dev,
commit_transaction->t_tid, &stats.run);

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6630651..0752bcd 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -57,6 +57,7 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
spin_lock_init(&transaction->t_handle_lock);
atomic_set(&transaction->t_updates, 0);
atomic_set(&transaction->t_outstanding_credits, 0);
+ atomic_set(&transaction->t_handle_count, 0);
INIT_LIST_HEAD(&transaction->t_inode_list);
INIT_LIST_HEAD(&transaction->t_private_list);

@@ -180,8 +181,8 @@ repeat:
* buffers requested by this operation, we need to stall pending a log
* checkpoint to free some more log space.
*/
- spin_lock(&transaction->t_handle_lock);
- needed = atomic_read(&transaction->t_outstanding_credits) + nblocks;
+ needed = atomic_add_return(nblocks,
+ &transaction->t_outstanding_credits);

if (needed > journal->j_max_transaction_buffers) {
/*
@@ -192,7 +193,7 @@ repeat:
DEFINE_WAIT(wait);

jbd_debug(2, "Handle %p starting new commit...\n", handle);
- spin_unlock(&transaction->t_handle_lock);
+ atomic_sub(nblocks, &transaction->t_outstanding_credits);
prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
TASK_UNINTERRUPTIBLE);
__jbd2_log_start_commit(journal, transaction->t_tid);
@@ -229,7 +230,7 @@ repeat:
*/
if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) {
jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
- spin_unlock(&transaction->t_handle_lock);
+ atomic_sub(nblocks, &transaction->t_outstanding_credits);
read_unlock(&journal->j_state_lock);
write_lock(&journal->j_state_lock);
if (__jbd2_log_space_left(journal) < jbd_space_needed(journal))
@@ -239,23 +240,33 @@ repeat:
}

/* OK, account for the buffers that this operation expects to
- * use and add the handle to the running transaction. */
-
- if (time_after(transaction->t_start, ts)) {
+ * use and add the handle to the running transaction.
+ *
+ * In order for t_max_wait to be reliable, it must be
+ * protected by a lock. But doing so will mean that
+ * start_this_handle() can not be run in parallel on SMP
+ * systems, which limits our scalability. So we only enable
+ * it when debugging is enabled. We may want to use a
+ * separate flag, eventually, so we can enable this
+ * independently of debugging.
+ */
+#ifdef CONFIG_JBD2_DEBUG
+ if (jbd2_journal_enable_debug &&
+ time_after(transaction->t_start, ts)) {
ts = jbd2_time_diff(ts, transaction->t_start);
+ spin_lock(&transaction->t_handle_lock);
if (ts > transaction->t_max_wait)
transaction->t_max_wait = ts;
+ spin_unlock(&transaction->t_handle_lock);
}
-
+#endif
handle->h_transaction = transaction;
- atomic_add(nblocks, &transaction->t_outstanding_credits);
atomic_inc(&transaction->t_updates);
- transaction->t_handle_count++;
+ atomic_inc(&transaction->t_handle_count);
jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n",
handle, nblocks,
atomic_read(&transaction->t_outstanding_credits),
__jbd2_log_space_left(journal));
- spin_unlock(&transaction->t_handle_lock);
read_unlock(&journal->j_state_lock);

lock_map_acquire(&handle->h_lockdep_map);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 15d5743..01743b5 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -629,7 +629,7 @@ struct transaction_s
/*
* How many handles used this transaction? [t_handle_lock]
*/
- int t_handle_count;
+ atomic_t t_handle_count;

/*
* This transaction is being forced and some process is
--
1.7.0.4


2010-08-04 16:39:18

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH -v2 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop

By using an atomic_t for t_updates and t_outstanding credits, this
should allow us to not need to take transaction t_handle_lock in
jbd2_journal_stop().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/jbd2/checkpoint.c | 2 +-
fs/jbd2/commit.c | 13 +++++----
fs/jbd2/transaction.c | 69 +++++++++++++++++++++++++++---------------------
include/linux/jbd2.h | 8 +++---
4 files changed, 51 insertions(+), 41 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 076d1cc..f8cdc02 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -775,7 +775,7 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
J_ASSERT(transaction->t_log_list == NULL);
J_ASSERT(transaction->t_checkpoint_list == NULL);
J_ASSERT(transaction->t_checkpoint_io_list == NULL);
- J_ASSERT(transaction->t_updates == 0);
+ J_ASSERT(atomic_read(&transaction->t_updates) == 0);
J_ASSERT(journal->j_committing_transaction != transaction);
J_ASSERT(journal->j_running_transaction != transaction);

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index af05681..fbd2c56 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -417,12 +417,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
stats.run.rs_locked);

spin_lock(&commit_transaction->t_handle_lock);
- while (commit_transaction->t_updates) {
+ while (atomic_read(&commit_transaction->t_updates)) {
DEFINE_WAIT(wait);

prepare_to_wait(&journal->j_wait_updates, &wait,
TASK_UNINTERRUPTIBLE);
- if (commit_transaction->t_updates) {
+ if (atomic_read(&commit_transaction->t_updates)) {
spin_unlock(&commit_transaction->t_handle_lock);
spin_unlock(&journal->j_state_lock);
schedule();
@@ -433,7 +433,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
}
spin_unlock(&commit_transaction->t_handle_lock);

- J_ASSERT (commit_transaction->t_outstanding_credits <=
+ J_ASSERT (atomic_read(&commit_transaction->t_outstanding_credits) <=
journal->j_max_transaction_buffers);

/*
@@ -527,11 +527,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
stats.run.rs_logging = jiffies;
stats.run.rs_flushing = jbd2_time_diff(stats.run.rs_flushing,
stats.run.rs_logging);
- stats.run.rs_blocks = commit_transaction->t_outstanding_credits;
+ stats.run.rs_blocks =
+ atomic_read(&commit_transaction->t_outstanding_credits);
stats.run.rs_blocks_logged = 0;

J_ASSERT(commit_transaction->t_nr_buffers <=
- commit_transaction->t_outstanding_credits);
+ atomic_read(&commit_transaction->t_outstanding_credits));

err = 0;
descriptor = NULL;
@@ -616,7 +617,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
* the free space in the log, but this counter is changed
* by jbd2_journal_next_log_block() also.
*/
- commit_transaction->t_outstanding_credits--;
+ atomic_dec(&commit_transaction->t_outstanding_credits);

/* Bump b_count to prevent truncate from stumbling over
the shadowed buffer! @@@ This can go if we ever get
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 001e95f..9c64c7e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -55,6 +55,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
transaction->t_tid = journal->j_transaction_sequence++;
transaction->t_expires = jiffies + journal->j_commit_interval;
spin_lock_init(&transaction->t_handle_lock);
+ atomic_set(&transaction->t_updates, 0);
+ atomic_set(&transaction->t_outstanding_credits, 0);
INIT_LIST_HEAD(&transaction->t_inode_list);
INIT_LIST_HEAD(&transaction->t_private_list);

@@ -177,7 +179,7 @@ repeat_locked:
* checkpoint to free some more log space.
*/
spin_lock(&transaction->t_handle_lock);
- needed = transaction->t_outstanding_credits + nblocks;
+ needed = atomic_read(&transaction->t_outstanding_credits) + nblocks;

if (needed > journal->j_max_transaction_buffers) {
/*
@@ -240,11 +242,12 @@ repeat_locked:
}

handle->h_transaction = transaction;
- transaction->t_outstanding_credits += nblocks;
- transaction->t_updates++;
+ atomic_add(nblocks, &transaction->t_outstanding_credits);
+ atomic_inc(&transaction->t_updates);
transaction->t_handle_count++;
jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n",
- handle, nblocks, transaction->t_outstanding_credits,
+ handle, nblocks,
+ atomic_read(&transaction->t_outstanding_credits),
__jbd2_log_space_left(journal));
spin_unlock(&transaction->t_handle_lock);
spin_unlock(&journal->j_state_lock);
@@ -369,7 +372,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
}

spin_lock(&transaction->t_handle_lock);
- wanted = transaction->t_outstanding_credits + nblocks;
+ wanted = atomic_read(&transaction->t_outstanding_credits) + nblocks;

if (wanted > journal->j_max_transaction_buffers) {
jbd_debug(3, "denied handle %p %d blocks: "
@@ -384,7 +387,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
}

handle->h_buffer_credits += nblocks;
- transaction->t_outstanding_credits += nblocks;
+ atomic_add(nblocks, &transaction->t_outstanding_credits);
result = 0;

jbd_debug(3, "extended handle %p by %d\n", handle, nblocks);
@@ -426,15 +429,14 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
* First unlink the handle from its current transaction, and start the
* commit on that.
*/
- J_ASSERT(transaction->t_updates > 0);
+ J_ASSERT(atomic_read(&transaction->t_updates) > 0);
J_ASSERT(journal_current_handle() == handle);

spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
- transaction->t_outstanding_credits -= handle->h_buffer_credits;
- transaction->t_updates--;
-
- if (!transaction->t_updates)
+ atomic_sub(handle->h_buffer_credits,
+ &transaction->t_outstanding_credits);
+ if (atomic_dec_and_test(&transaction->t_updates))
wake_up(&journal->j_wait_updates);
spin_unlock(&transaction->t_handle_lock);

@@ -481,7 +483,7 @@ void jbd2_journal_lock_updates(journal_t *journal)
break;

spin_lock(&transaction->t_handle_lock);
- if (!transaction->t_updates) {
+ if (!atomic_read(&transaction->t_updates)) {
spin_unlock(&transaction->t_handle_lock);
break;
}
@@ -1258,7 +1260,8 @@ int jbd2_journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int err;
+ int err, wait_for_commit = 0;
+ tid_t tid;
pid_t pid;

J_ASSERT(journal_current_handle() == handle);
@@ -1266,7 +1269,7 @@ int jbd2_journal_stop(handle_t *handle)
if (is_handle_aborted(handle))
err = -EIO;
else {
- J_ASSERT(transaction->t_updates > 0);
+ J_ASSERT(atomic_read(&transaction->t_updates) > 0);
err = 0;
}

@@ -1334,14 +1337,8 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_sync)
transaction->t_synchronous_commit = 1;
current->journal_info = NULL;
- spin_lock(&transaction->t_handle_lock);
- transaction->t_outstanding_credits -= handle->h_buffer_credits;
- transaction->t_updates--;
- if (!transaction->t_updates) {
- wake_up(&journal->j_wait_updates);
- if (journal->j_barrier_count)
- wake_up(&journal->j_wait_transaction_locked);
- }
+ atomic_sub(handle->h_buffer_credits,
+ &transaction->t_outstanding_credits);

/*
* If the handle is marked SYNC, we need to set another commit
@@ -1350,15 +1347,13 @@ int jbd2_journal_stop(handle_t *handle)
* transaction is too old now.
*/
if (handle->h_sync ||
- transaction->t_outstanding_credits >
- journal->j_max_transaction_buffers ||
- time_after_eq(jiffies, transaction->t_expires)) {
+ (atomic_read(&transaction->t_outstanding_credits) >
+ journal->j_max_transaction_buffers) ||
+ time_after_eq(jiffies, transaction->t_expires)) {
/* Do this even for aborted journals: an abort still
* completes the commit thread, it just doesn't write
* anything to disk. */
- tid_t tid = transaction->t_tid;

- spin_unlock(&transaction->t_handle_lock);
jbd_debug(2, "transaction too old, requesting commit for "
"handle %p\n", handle);
/* This is non-blocking */
@@ -1369,11 +1364,25 @@ int jbd2_journal_stop(handle_t *handle)
* to wait for the commit to complete.
*/
if (handle->h_sync && !(current->flags & PF_MEMALLOC))
- err = jbd2_log_wait_commit(journal, tid);
- } else {
- spin_unlock(&transaction->t_handle_lock);
+ wait_for_commit = 1;
}

+ /*
+ * Once we drop t_updates, if it goes to zero the transaction
+ * could start commiting on us and eventually disappear. So
+ * once we do this, we must not dereference transaction
+ * pointer again.
+ */
+ tid = transaction->t_tid;
+ if (atomic_dec_and_test(&transaction->t_updates)) {
+ wake_up(&journal->j_wait_updates);
+ if (journal->j_barrier_count)
+ wake_up(&journal->j_wait_transaction_locked);
+ }
+
+ if (wait_for_commit)
+ err = jbd2_log_wait_commit(journal, tid);
+
lock_map_release(&handle->h_lockdep_map);

jbd2_free_handle(handle);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 5a72bc7..a72ce21 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -601,13 +601,13 @@ struct transaction_s
* Number of outstanding updates running on this transaction
* [t_handle_lock]
*/
- int t_updates;
+ atomic_t t_updates;

/*
* Number of buffers reserved for use by all handles in this transaction
* handle but not yet modified. [t_handle_lock]
*/
- int t_outstanding_credits;
+ atomic_t t_outstanding_credits;

/*
* Forward and backward links for the circular list of all transactions
@@ -1258,8 +1258,8 @@ static inline int jbd_space_needed(journal_t *journal)
{
int nblocks = journal->j_max_transaction_buffers;
if (journal->j_committing_transaction)
- nblocks += journal->j_committing_transaction->
- t_outstanding_credits;
+ nblocks += atomic_read(&journal->j_committing_transaction->
+ t_outstanding_credits);
return nblocks;
}

--
1.7.0.4


2010-08-04 16:39:18

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH -v2 2/3] jbd2: Change j_state_lock to be a rwlock_t

Lockstat reports have shown that j_state_lock is a major source of
lock contention, especially on systems with more than 4 CPU cores. So
change it to be a read/write spinlock.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/inode.c | 4 +-
fs/ext4/super.c | 4 +-
fs/jbd2/checkpoint.c | 16 ++++----
fs/jbd2/commit.c | 26 +++++++-------
fs/jbd2/journal.c | 94 ++++++++++++++++++++++++-------------------------
fs/jbd2/transaction.c | 74 +++++++++++++++++++++------------------
fs/ocfs2/journal.c | 4 +-
include/linux/jbd2.h | 2 +-
8 files changed, 114 insertions(+), 110 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 533b607..ab2247d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5066,7 +5066,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
transaction_t *transaction;
tid_t tid;

- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
if (journal->j_running_transaction)
transaction = journal->j_running_transaction;
else
@@ -5075,7 +5075,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
tid = transaction->t_tid;
else
tid = journal->j_commit_sequence;
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
ei->i_sync_tid = tid;
ei->i_datasync_tid = tid;
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3fd65eb..81cb3fc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3232,7 +3232,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
journal->j_min_batch_time = sbi->s_min_batch_time;
journal->j_max_batch_time = sbi->s_max_batch_time;

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
if (test_opt(sb, BARRIER))
journal->j_flags |= JBD2_BARRIER;
else
@@ -3241,7 +3241,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
journal->j_flags |= JBD2_ABORT_ON_SYNCDATA_ERR;
else
journal->j_flags &= ~JBD2_ABORT_ON_SYNCDATA_ERR;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
}

static journal_t *ext4_get_journal(struct super_block *sb,
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index f8cdc02..1c23a0f 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -118,13 +118,13 @@ static int __try_to_free_cp_buf(struct journal_head *jh)
void __jbd2_log_wait_for_space(journal_t *journal)
{
int nblocks, space_left;
- assert_spin_locked(&journal->j_state_lock);
+ /* assert_spin_locked(&journal->j_state_lock); */

nblocks = jbd_space_needed(journal);
while (__jbd2_log_space_left(journal) < nblocks) {
if (journal->j_flags & JBD2_ABORT)
return;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
mutex_lock(&journal->j_checkpoint_mutex);

/*
@@ -138,7 +138,7 @@ void __jbd2_log_wait_for_space(journal_t *journal)
* filesystem, so abort the journal and leave a stack
* trace for forensic evidence.
*/
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
spin_lock(&journal->j_list_lock);
nblocks = jbd_space_needed(journal);
space_left = __jbd2_log_space_left(journal);
@@ -149,7 +149,7 @@ void __jbd2_log_wait_for_space(journal_t *journal)
if (journal->j_committing_transaction)
tid = journal->j_committing_transaction->t_tid;
spin_unlock(&journal->j_list_lock);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
if (chkpt) {
jbd2_log_do_checkpoint(journal);
} else if (jbd2_cleanup_journal_tail(journal) == 0) {
@@ -167,7 +167,7 @@ void __jbd2_log_wait_for_space(journal_t *journal)
WARN_ON(1);
jbd2_journal_abort(journal, 0);
}
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
} else {
spin_unlock(&journal->j_list_lock);
}
@@ -474,7 +474,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
* next transaction ID we will write, and where it will
* start. */

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
spin_lock(&journal->j_list_lock);
transaction = journal->j_checkpoint_transactions;
if (transaction) {
@@ -496,7 +496,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
/* If the oldest pinned transaction is at the tail of the log
already then there's not much we can do right now. */
if (journal->j_tail_sequence == first_tid) {
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return 1;
}

@@ -516,7 +516,7 @@ int jbd2_cleanup_journal_tail(journal_t *journal)
journal->j_free += freed;
journal->j_tail_sequence = first_tid;
journal->j_tail = blocknr;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);

/*
* If there is an external journal, we need to make sure that
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index fbd2c56..67bb0a2 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -152,9 +152,9 @@ static int journal_submit_commit_record(journal_t *journal,
printk(KERN_WARNING
"JBD2: Disabling barriers on %s, "
"not supported by device\n", journal->j_devname);
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
journal->j_flags &= ~JBD2_BARRIER;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);

/* And try again, without the barrier */
lock_buffer(bh);
@@ -182,9 +182,9 @@ retry:
printk(KERN_WARNING
"JBD2: %s: disabling barries on %s - not supported "
"by device\n", __func__, journal->j_devname);
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
journal->j_flags &= ~JBD2_BARRIER;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);

lock_buffer(bh);
clear_buffer_dirty(bh);
@@ -400,7 +400,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
jbd_debug(1, "JBD: starting commit of transaction %d\n",
commit_transaction->t_tid);

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
commit_transaction->t_state = T_LOCKED;

/*
@@ -424,9 +424,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
TASK_UNINTERRUPTIBLE);
if (atomic_read(&commit_transaction->t_updates)) {
spin_unlock(&commit_transaction->t_handle_lock);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
schedule();
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
spin_lock(&commit_transaction->t_handle_lock);
}
finish_wait(&journal->j_wait_updates, &wait);
@@ -497,7 +497,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
start_time = ktime_get();
commit_transaction->t_log_start = journal->j_head;
wake_up(&journal->j_wait_transaction_locked);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);

jbd_debug (3, "JBD: commit phase 2\n");

@@ -519,9 +519,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
* transaction! Now comes the tricky part: we need to write out
* metadata. Loop over the transaction's entire buffer list:
*/
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
commit_transaction->t_state = T_COMMIT;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);

trace_jbd2_commit_logging(journal, commit_transaction);
stats.run.rs_logging = jiffies;
@@ -978,7 +978,7 @@ restart_loop:
* __jbd2_journal_drop_transaction(). Otherwise we could race with
* other checkpointing code processing the transaction...
*/
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
spin_lock(&journal->j_list_lock);
/*
* Now recheck if some buffers did not get attached to the transaction
@@ -986,7 +986,7 @@ restart_loop:
*/
if (commit_transaction->t_forget) {
spin_unlock(&journal->j_list_lock);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
goto restart_loop;
}

@@ -1038,7 +1038,7 @@ restart_loop:
journal->j_average_commit_time*3) / 4;
else
journal->j_average_commit_time = commit_time;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);

if (commit_transaction->t_checkpoint_list == NULL &&
commit_transaction->t_checkpoint_io_list == NULL) {
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a79d334..e7bf0fd 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -142,7 +142,7 @@ static int kjournald2(void *arg)
/*
* And now, wait forever for commit wakeup events.
*/
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);

loop:
if (journal->j_flags & JBD2_UNMOUNT)
@@ -153,10 +153,10 @@ loop:

if (journal->j_commit_sequence != journal->j_commit_request) {
jbd_debug(1, "OK, requests differ\n");
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
del_timer_sync(&journal->j_commit_timer);
jbd2_journal_commit_transaction(journal);
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
goto loop;
}

@@ -168,9 +168,9 @@ loop:
* be already stopped.
*/
jbd_debug(1, "Now suspending kjournald2\n");
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
refrigerator();
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
} else {
/*
* We assume on resume that commits are already there,
@@ -190,9 +190,9 @@ loop:
if (journal->j_flags & JBD2_UNMOUNT)
should_sleep = 0;
if (should_sleep) {
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
schedule();
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
}
finish_wait(&journal->j_wait_commit, &wait);
}
@@ -210,7 +210,7 @@ loop:
goto loop;

end_loop:
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
del_timer_sync(&journal->j_commit_timer);
journal->j_task = NULL;
wake_up(&journal->j_wait_done_commit);
@@ -233,16 +233,16 @@ static int jbd2_journal_start_thread(journal_t *journal)

static void journal_kill_thread(journal_t *journal)
{
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
journal->j_flags |= JBD2_UNMOUNT;

while (journal->j_task) {
wake_up(&journal->j_wait_commit);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit, journal->j_task == NULL);
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
}
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
}

/*
@@ -452,7 +452,7 @@ int __jbd2_log_space_left(journal_t *journal)
{
int left = journal->j_free;

- assert_spin_locked(&journal->j_state_lock);
+ /* assert_spin_locked(&journal->j_state_lock); */

/*
* Be pessimistic here about the number of those free blocks which
@@ -497,9 +497,9 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid)
{
int ret;

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
ret = __jbd2_log_start_commit(journal, tid);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return ret;
}

@@ -518,7 +518,7 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
transaction_t *transaction = NULL;
tid_t tid;

- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
if (journal->j_running_transaction && !current->journal_info) {
transaction = journal->j_running_transaction;
__jbd2_log_start_commit(journal, transaction->t_tid);
@@ -526,12 +526,12 @@ int jbd2_journal_force_commit_nested(journal_t *journal)
transaction = journal->j_committing_transaction;

if (!transaction) {
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
return 0; /* Nothing to retry */
}

tid = transaction->t_tid;
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
jbd2_log_wait_commit(journal, tid);
return 1;
}
@@ -545,7 +545,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
{
int ret = 0;

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
if (journal->j_running_transaction) {
tid_t tid = journal->j_running_transaction->t_tid;

@@ -564,7 +564,7 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
*ptid = journal->j_committing_transaction->t_tid;
ret = 1;
}
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return ret;
}

@@ -576,26 +576,24 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
{
int err = 0;

+ read_lock(&journal->j_state_lock);
#ifdef CONFIG_JBD2_DEBUG
- spin_lock(&journal->j_state_lock);
if (!tid_geq(journal->j_commit_request, tid)) {
printk(KERN_EMERG
"%s: error: j_commit_request=%d, tid=%d\n",
__func__, journal->j_commit_request, tid);
}
- spin_unlock(&journal->j_state_lock);
#endif
- spin_lock(&journal->j_state_lock);
while (tid_gt(tid, journal->j_commit_sequence)) {
jbd_debug(1, "JBD: want %d, j_commit_sequence=%d\n",
tid, journal->j_commit_sequence);
wake_up(&journal->j_wait_commit);
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_done_commit,
!tid_gt(tid, journal->j_commit_sequence));
- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
}
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);

if (unlikely(is_journal_aborted(journal))) {
printk(KERN_EMERG "journal commit I/O error\n");
@@ -612,7 +610,7 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
{
unsigned long blocknr;

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
J_ASSERT(journal->j_free > 1);

blocknr = journal->j_head;
@@ -620,7 +618,7 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
journal->j_free--;
if (journal->j_head == journal->j_last)
journal->j_head = journal->j_first;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return jbd2_journal_bmap(journal, blocknr, retp);
}

@@ -840,7 +838,7 @@ static journal_t * journal_init_common (void)
mutex_init(&journal->j_checkpoint_mutex);
spin_lock_init(&journal->j_revoke_lock);
spin_lock_init(&journal->j_list_lock);
- spin_lock_init(&journal->j_state_lock);
+ rwlock_init(&journal->j_state_lock);

journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
journal->j_min_batch_time = 0;
@@ -1106,14 +1104,14 @@ void jbd2_journal_update_superblock(journal_t *journal, int wait)
set_buffer_uptodate(bh);
}

- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
jbd_debug(1,"JBD: updating superblock (start %ld, seq %d, errno %d)\n",
journal->j_tail, journal->j_tail_sequence, journal->j_errno);

sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
sb->s_start = cpu_to_be32(journal->j_tail);
sb->s_errno = cpu_to_be32(journal->j_errno);
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);

BUFFER_TRACE(bh, "marking dirty");
mark_buffer_dirty(bh);
@@ -1134,12 +1132,12 @@ out:
* any future commit will have to be careful to update the
* superblock again to re-record the true start of the log. */

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
if (sb->s_start)
journal->j_flags &= ~JBD2_FLUSHED;
else
journal->j_flags |= JBD2_FLUSHED;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
}

/*
@@ -1551,7 +1549,7 @@ int jbd2_journal_flush(journal_t *journal)
transaction_t *transaction = NULL;
unsigned long old_tail;

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);

/* Force everything buffered to the log... */
if (journal->j_running_transaction) {
@@ -1564,10 +1562,10 @@ int jbd2_journal_flush(journal_t *journal)
if (transaction) {
tid_t tid = transaction->t_tid;

- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
jbd2_log_wait_commit(journal, tid);
} else {
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
}

/* ...and flush everything in the log out to disk. */
@@ -1591,12 +1589,12 @@ int jbd2_journal_flush(journal_t *journal)
* the magic code for a fully-recovered superblock. Any future
* commits of data to the journal will restore the current
* s_start value. */
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
old_tail = journal->j_tail;
journal->j_tail = 0;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
jbd2_journal_update_superblock(journal, 1);
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
journal->j_tail = old_tail;

J_ASSERT(!journal->j_running_transaction);
@@ -1604,7 +1602,7 @@ int jbd2_journal_flush(journal_t *journal)
J_ASSERT(!journal->j_checkpoint_transactions);
J_ASSERT(journal->j_head == journal->j_tail);
J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return 0;
}

@@ -1668,12 +1666,12 @@ void __jbd2_journal_abort_hard(journal_t *journal)
printk(KERN_ERR "Aborting journal on device %s.\n",
journal->j_devname);

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
journal->j_flags |= JBD2_ABORT;
transaction = journal->j_running_transaction;
if (transaction)
__jbd2_log_start_commit(journal, transaction->t_tid);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
}

/* Soft abort: record the abort error status in the journal superblock,
@@ -1758,12 +1756,12 @@ int jbd2_journal_errno(journal_t *journal)
{
int err;

- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
if (journal->j_flags & JBD2_ABORT)
err = -EROFS;
else
err = journal->j_errno;
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
return err;
}

@@ -1778,12 +1776,12 @@ int jbd2_journal_clear_err(journal_t *journal)
{
int err = 0;

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
if (journal->j_flags & JBD2_ABORT)
err = -EROFS;
else
journal->j_errno = 0;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return err;
}

@@ -1796,10 +1794,10 @@ int jbd2_journal_clear_err(journal_t *journal)
*/
void jbd2_journal_ack_err(journal_t *journal)
{
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
if (journal->j_errno)
journal->j_flags |= JBD2_ACK_ERR;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
}

int jbd2_journal_blocks_per_page(struct inode *inode)
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 9c64c7e..6630651 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -124,36 +124,38 @@ alloc_transaction:

jbd_debug(3, "New handle %p going live.\n", handle);

-repeat:
-
/*
* We need to hold j_state_lock until t_updates has been incremented,
* for proper journal barrier handling
*/
- spin_lock(&journal->j_state_lock);
-repeat_locked:
+repeat:
+ read_lock(&journal->j_state_lock);
if (is_journal_aborted(journal) ||
(journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
kfree(new_transaction);
return -EROFS;
}

/* Wait on the journal's transaction barrier if necessary */
if (journal->j_barrier_count) {
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
wait_event(journal->j_wait_transaction_locked,
journal->j_barrier_count == 0);
goto repeat;
}

if (!journal->j_running_transaction) {
- if (!new_transaction) {
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
+ if (!new_transaction)
goto alloc_transaction;
+ write_lock(&journal->j_state_lock);
+ if (!journal->j_running_transaction) {
+ jbd2_get_transaction(journal, new_transaction);
+ new_transaction = NULL;
}
- jbd2_get_transaction(journal, new_transaction);
- new_transaction = NULL;
+ write_unlock(&journal->j_state_lock);
+ goto repeat;
}

transaction = journal->j_running_transaction;
@@ -167,7 +169,7 @@ repeat_locked:

prepare_to_wait(&journal->j_wait_transaction_locked,
&wait, TASK_UNINTERRUPTIBLE);
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
schedule();
finish_wait(&journal->j_wait_transaction_locked, &wait);
goto repeat;
@@ -194,7 +196,7 @@ repeat_locked:
prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
TASK_UNINTERRUPTIBLE);
__jbd2_log_start_commit(journal, transaction->t_tid);
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
schedule();
finish_wait(&journal->j_wait_transaction_locked, &wait);
goto repeat;
@@ -228,8 +230,12 @@ repeat_locked:
if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) {
jbd_debug(2, "Handle %p waiting for checkpoint...\n", handle);
spin_unlock(&transaction->t_handle_lock);
- __jbd2_log_wait_for_space(journal);
- goto repeat_locked;
+ read_unlock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
+ if (__jbd2_log_space_left(journal) < jbd_space_needed(journal))
+ __jbd2_log_wait_for_space(journal);
+ write_unlock(&journal->j_state_lock);
+ goto repeat;
}

/* OK, account for the buffers that this operation expects to
@@ -250,7 +256,7 @@ repeat_locked:
atomic_read(&transaction->t_outstanding_credits),
__jbd2_log_space_left(journal));
spin_unlock(&transaction->t_handle_lock);
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);

lock_map_acquire(&handle->h_lockdep_map);
kfree(new_transaction);
@@ -362,7 +368,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)

result = 1;

- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);

/* Don't extend a locked-down transaction! */
if (handle->h_transaction->t_state != T_RUNNING) {
@@ -394,7 +400,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
unlock:
spin_unlock(&transaction->t_handle_lock);
error_out:
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
out:
return result;
}
@@ -432,7 +438,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
J_ASSERT(atomic_read(&transaction->t_updates) > 0);
J_ASSERT(journal_current_handle() == handle);

- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
atomic_sub(handle->h_buffer_credits,
&transaction->t_outstanding_credits);
@@ -442,7 +448,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)

jbd_debug(2, "restarting handle %p\n", handle);
__jbd2_log_start_commit(journal, transaction->t_tid);
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);

lock_map_release(&handle->h_lockdep_map);
handle->h_buffer_credits = nblocks;
@@ -472,7 +478,7 @@ void jbd2_journal_lock_updates(journal_t *journal)
{
DEFINE_WAIT(wait);

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
++journal->j_barrier_count;

/* Wait until there are no running updates */
@@ -490,12 +496,12 @@ void jbd2_journal_lock_updates(journal_t *journal)
prepare_to_wait(&journal->j_wait_updates, &wait,
TASK_UNINTERRUPTIBLE);
spin_unlock(&transaction->t_handle_lock);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
schedule();
finish_wait(&journal->j_wait_updates, &wait);
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
}
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);

/*
* We have now established a barrier against other normal updates, but
@@ -519,9 +525,9 @@ void jbd2_journal_unlock_updates (journal_t *journal)
J_ASSERT(journal->j_barrier_count != 0);

mutex_unlock(&journal->j_barrier);
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
--journal->j_barrier_count;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
wake_up(&journal->j_wait_transaction_locked);
}

@@ -1314,9 +1320,9 @@ int jbd2_journal_stop(handle_t *handle)

journal->j_last_sync_writer = pid;

- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
commit_time = journal->j_average_commit_time;
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);

trans_time = ktime_to_ns(ktime_sub(ktime_get(),
transaction->t_start_time));
@@ -1748,7 +1754,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh)
goto zap_buffer_unlocked;

/* OK, we have data buffer in journaled mode */
- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
jbd_lock_bh_state(bh);
spin_lock(&journal->j_list_lock);

@@ -1801,7 +1807,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh)
jbd2_journal_put_journal_head(jh);
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return ret;
} else {
/* There is no currently-running transaction. So the
@@ -1815,7 +1821,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh)
jbd2_journal_put_journal_head(jh);
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return ret;
} else {
/* The orphan record's transaction has
@@ -1839,7 +1845,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh)
jbd2_journal_put_journal_head(jh);
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
return 0;
} else {
/* Good, the buffer belongs to the running transaction.
@@ -1858,7 +1864,7 @@ zap_buffer:
zap_buffer_no_jh:
spin_unlock(&journal->j_list_lock);
jbd_unlock_bh_state(bh);
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
zap_buffer_unlocked:
clear_buffer_dirty(bh);
J_ASSERT_BH(bh, !buffer_jbddirty(bh));
@@ -2165,9 +2171,9 @@ int jbd2_journal_begin_ordered_truncate(journal_t *journal,
/* Locks are here just to force reading of recent values, it is
* enough that the transaction was not committing before we started
* a transaction adding the inode to orphan list */
- spin_lock(&journal->j_state_lock);
+ read_lock(&journal->j_state_lock);
commit_trans = journal->j_committing_transaction;
- spin_unlock(&journal->j_state_lock);
+ read_unlock(&journal->j_state_lock);
spin_lock(&journal->j_list_lock);
inode_trans = jinode->i_transaction;
spin_unlock(&journal->j_list_lock);
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 47878cf..9c1b92e 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -760,13 +760,13 @@ void ocfs2_set_journal_params(struct ocfs2_super *osb)
if (osb->osb_commit_interval)
commit_interval = osb->osb_commit_interval;

- spin_lock(&journal->j_state_lock);
+ write_lock(&journal->j_state_lock);
journal->j_commit_interval = commit_interval;
if (osb->s_mount_opt & OCFS2_MOUNT_BARRIER)
journal->j_flags |= JBD2_BARRIER;
else
journal->j_flags &= ~JBD2_BARRIER;
- spin_unlock(&journal->j_state_lock);
+ write_unlock(&journal->j_state_lock);
}

int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty)
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a72ce21..15d5743 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -764,7 +764,7 @@ struct journal_s
/*
* Protect the various scalars in the journal
*/
- spinlock_t j_state_lock;
+ rwlock_t j_state_lock;

/*
* Number of processes waiting to create a barrier lock [j_state_lock]
--
1.7.0.4


2010-08-05 01:59:03

by john stultz

[permalink] [raw]
Subject: Re: [PATCH -v2 0/3] jbd2 scalability patches

On Wed, 2010-08-04 at 12:39 -0400, Theodore Ts'o wrote:
> This version fixes three bugs in the 2nd patch of this series that
> caused kernel BUG when the system was under race. We weren't accounting
> with t_oustanding_credits correctly, and there were race conditions
> caused by the fact the I had overlooked the fact that
> __jbd2_log_wait_for_space() and jbd2_get_transaction() requires
> j_state_lock to be write locked.

So without the vfs patches, I don't see much change with this patchset
(similar to the last).

novfs + j_state lock
Throughput 763.105 MB/sec 8 procs
Throughput 1056.81 MB/sec 4 procs
Throughput 681.761 MB/sec 2 procs
Throughput 409.25 MB/sec 1 procs

vs

no vfs + j_state lock + jdb2 scalability queue
Throughput 767.778 MB/sec 8 procs
Throughput 1069.58 MB/sec 4 procs
Throughput 679.786 MB/sec 2 procs
Throughput 401.419 MB/sec 1 procs


But with the vfs patchset, there's a nice increase @8cpus.
vfs + j_state lock
Throughput 1061.44 MB/sec 8 procs
Throughput 1126.55 MB/sec 4 procs
Throughput 706.306 MB/sec 2 procs
Throughput 402.102 MB/sec 1 procs

vs

vfs + j_state lock + jdb2 scalability queue
Throughput 1214.21 MB/sec 8 procs
Throughput 1175.49 MB/sec 4 procs
Throughput 716.294 MB/sec 2 procs
Throughput 402.988 MB/sec 1 procs


I'll post perf log data tomorrow.

thanks
-john




2010-08-05 05:42:14

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -v2 0/3] jbd2 scalability patches

Thanks, John, for doing these numbers!!!

> vfs + j_state lock + jdb2 scalability queue
> Throughput 1214.21 MB/sec 8 procs
> Throughput 1175.49 MB/sec 4 procs
> Throughput 716.294 MB/sec 2 procs
> Throughput 402.988 MB/sec 1 procs

2.6.35-rc6-vfs + atomic
Throughput 2648.42 MB/sec 8 procs
Throughput 1586.68 MB/sec 4 procs
Throughput 851.545 MB/sec 2 procs
Throughput 453.106 MB/sec 1 procs

Your earlier numbers were quite a bit higher than this latest set. I
assume that's due to hardware differences?

- Ted

2010-08-05 17:42:31

by john stultz

[permalink] [raw]
Subject: Re: [PATCH -v2 0/3] jbd2 scalability patches

On Thu, 2010-08-05 at 01:42 -0400, Ted Ts'o wrote:
> Thanks, John, for doing these numbers!!!
>
> > vfs + j_state lock + jdb2 scalability queue
> > Throughput 1214.21 MB/sec 8 procs
> > Throughput 1175.49 MB/sec 4 procs
> > Throughput 716.294 MB/sec 2 procs
> > Throughput 402.988 MB/sec 1 procs
>
> 2.6.35-rc6-vfs + atomic
> Throughput 2648.42 MB/sec 8 procs
> Throughput 1586.68 MB/sec 4 procs
> Throughput 851.545 MB/sec 2 procs
> Throughput 453.106 MB/sec 1 procs
>
> Your earlier numbers were quite a bit higher than this latest set. I
> assume that's due to hardware differences?

So yes, it was run on a different but similar system.

But the numbers I just provided here were with CONFIG_PREEMPT_RT (vfs:
2.6.33-rt23 or novfs: 2.6.33-rt29). Sorry for not making that explicit.

Do you want me to generate non-rt 2.6.35 numbers as well?

thanks
-john



2010-08-09 16:07:30

by Joel Becker

[permalink] [raw]
Subject: Re: [Ocfs2-devel] [PATCH -v2 0/3] jbd2 scalability patches

On Wed, Aug 04, 2010 at 12:39:14PM -0400, Theodore Ts'o wrote:
> This version fixes three bugs in the 2nd patch of this series that
> caused kernel BUG when the system was under race. We weren't accounting
> with t_oustanding_credits correctly, and there were race conditions
> caused by the fact the I had overlooked the fact that
> __jbd2_log_wait_for_space() and jbd2_get_transaction() requires
> j_state_lock to be write locked.
>

Here's an initial set of runs. Not the fastest disk ever, but it shows
the patches don't hurt us.

http://oss.oracle.com/~mmatsuna/ocfs2-test/FFSB/FFSB.html

Joel

--

"I inject pure kryptonite into my brain.
It improves my kung fu, and it eases the pain."


Joel Becker
Consulting Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2010-08-09 17:02:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -v2 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop

> By using an atomic_t for t_updates and t_outstanding credits, this
> should allow us to not need to take transaction t_handle_lock in
> jbd2_journal_stop().
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
The patch looks OK to me besides:
@@ -426,15 +429,14 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
* First unlink the handle from its current transaction, and start the
* commit on that.
*/
- J_ASSERT(transaction->t_updates > 0);
+ J_ASSERT(atomic_read(&transaction->t_updates) > 0);
J_ASSERT(journal_current_handle() == handle);

spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
- transaction->t_outstanding_credits -= handle->h_buffer_credits;
- transaction->t_updates--;

2010-08-09 19:05:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -v2 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop

On Mon, Aug 09, 2010 at 07:02:16PM +0200, Jan Kara wrote:
> spin_lock(&journal->j_state_lock);
> spin_lock(&transaction->t_handle_lock);
> - transaction->t_outstanding_credits -= handle->h_buffer_credits;
> - transaction->t_updates--;
> -
> - if (!transaction->t_updates)
> + atomic_sub(handle->h_buffer_credits,
> + &transaction->t_outstanding_credits);
> + if (atomic_dec_and_test(&transaction->t_updates))
>
> After this a transaction can disappear so subsequent
> __jbd2_log_start_commit shouldn't dereference transaction->t_tid,
> right?

I think it should be ok because we're holding j_state_lock(), so the
transaction can't disappear until we release the j_state_lock.

- Ted

2010-08-09 19:46:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -v2 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop

On Mon 09-08-10 15:05:13, Ted Ts'o wrote:
> On Mon, Aug 09, 2010 at 07:02:16PM +0200, Jan Kara wrote:
> > spin_lock(&journal->j_state_lock);
> > spin_lock(&transaction->t_handle_lock);
> > - transaction->t_outstanding_credits -= handle->h_buffer_credits;
> > - transaction->t_updates--;
> > -
> > - if (!transaction->t_updates)
> > + atomic_sub(handle->h_buffer_credits,
> > + &transaction->t_outstanding_credits);
> > + if (atomic_dec_and_test(&transaction->t_updates))
> >
> > After this a transaction can disappear so subsequent
> > __jbd2_log_start_commit shouldn't dereference transaction->t_tid,
> > right?
>
> I think it should be ok because we're holding j_state_lock(), so the
> transaction can't disappear until we release the j_state_lock.
Ah, OK. You're right. I just thought we eventually want to remove the
lock but you're right that currently the code is fine. Sorry for the noise.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-08-10 16:30:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -v2 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop

On Mon, Aug 09, 2010 at 09:45:55PM +0200, Jan Kara wrote:
> Ah, OK. You're right. I just thought we eventually want to remove the
> lock but you're right that currently the code is fine. Sorry for the noise.

I would love to get rid of the j_state_lock, but looking through the
code, I couldn't figure out how to do this safely. Hence my
conversion of the j_state_lock to a rwlock_t, with the downside of
this causing more cache line bounces. If someone can suggest a way to
drop needing a global spinlock (whether it is an exclusive or rwlock)
in start_this_handle(), I'd love to hear them.

- Ted

2010-08-11 22:18:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -v2 1/3] jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop

On Tue 10-08-10 12:30:46, Ted Ts'o wrote:
> On Mon, Aug 09, 2010 at 09:45:55PM +0200, Jan Kara wrote:
> > Ah, OK. You're right. I just thought we eventually want to remove the
> > lock but you're right that currently the code is fine. Sorry for the noise.
>
> I would love to get rid of the j_state_lock, but looking through the
> code, I couldn't figure out how to do this safely. Hence my
> conversion of the j_state_lock to a rwlock_t, with the downside of
> this causing more cache line bounces. If someone can suggest a way to
> drop needing a global spinlock (whether it is an exclusive or rwlock)
> in start_this_handle(), I'd love to hear them.
Thinking about this, I think there's a way:
1) Make transaction structures freed by RCU so when we get transaction
pointer from a journal we can operate on it without being afraid of
touching freed structure.
2) Increment t_updates count before doing anything else - with this, we are
sure that if a transaction is in LOCKED state or in some earlier state, it
won't proceed further before we drop our reference.
3) smp_mb() to get values from transaction structure after the ref count
increase.
4) Check that the transaction is actually running and has enough credits.
5) The check
if (__jbd2_log_space_left(journal) < jbd_space_needed(journal)) {
seems just useless if you look at it more in detail. It just transforms
to
7/8*(journal->j_free-32) < journal->j_max_transaction_buffers +
journal->j_committing_transaction->t_outstanding_credits
So it just doesn't seem to make sense to call it whenever we get a
handle. It should be enough to do the check only when we really start a
new transaction. If it is satisfied at that moment, it should be satisfied
during the whole lifetime of a transaction.

This should make the fast path (when a transaction does not need to be
started) of start_this_handle() lockless. Of course when any of the checks
fails, we have to bite the bullet and take the lock.
I can have a look into transforming this ideas into a patch but I'm not
sure when I get to it...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR