By using an atomic_t for t_updates and t_outstanding credits, this
should allow us to not need to take transaction t_handle_lock in
jbd2_journal_stop().
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Keith Maanthey <[email protected]>
Cc: Eric Whitney <[email protected]>
---
It would be interesting to get some quick performance tests with and
without this patch on (a) dbench w/ PREEMPT-RT, (b) Keith's 8-way, and
(c) Eric's 48-core machine. I have some other ideas for how to improve
things in start_this_handle(), but they're going to be tricker, and I
want to do some hard testing using xfstests on a big machine to make
sure this is really safe before I keep going. This has passed light
testing, but I haven't had a chance to do stress tests on a large SMP
machine yet. That'll come later this week, hopefully.
fs/jbd2/checkpoint.c | 2 +-
fs/jbd2/commit.c | 13 +++++----
fs/jbd2/transaction.c | 69 +++++++++++++++++++++++++++---------------------
include/linux/jbd2.h | 8 +++---
4 files changed, 51 insertions(+), 41 deletions(-)
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index 076d1cc..f8cdc02 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -775,7 +775,7 @@ void __jbd2_journal_drop_transaction(journal_t *journal, transaction_t *transact
J_ASSERT(transaction->t_log_list == NULL);
J_ASSERT(transaction->t_checkpoint_list == NULL);
J_ASSERT(transaction->t_checkpoint_io_list == NULL);
- J_ASSERT(transaction->t_updates == 0);
+ J_ASSERT(atomic_read(&transaction->t_updates) == 0);
J_ASSERT(journal->j_committing_transaction != transaction);
J_ASSERT(journal->j_running_transaction != transaction);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index af05681..fbd2c56 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -417,12 +417,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
stats.run.rs_locked);
spin_lock(&commit_transaction->t_handle_lock);
- while (commit_transaction->t_updates) {
+ while (atomic_read(&commit_transaction->t_updates)) {
DEFINE_WAIT(wait);
prepare_to_wait(&journal->j_wait_updates, &wait,
TASK_UNINTERRUPTIBLE);
- if (commit_transaction->t_updates) {
+ if (atomic_read(&commit_transaction->t_updates)) {
spin_unlock(&commit_transaction->t_handle_lock);
spin_unlock(&journal->j_state_lock);
schedule();
@@ -433,7 +433,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
}
spin_unlock(&commit_transaction->t_handle_lock);
- J_ASSERT (commit_transaction->t_outstanding_credits <=
+ J_ASSERT (atomic_read(&commit_transaction->t_outstanding_credits) <=
journal->j_max_transaction_buffers);
/*
@@ -527,11 +527,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
stats.run.rs_logging = jiffies;
stats.run.rs_flushing = jbd2_time_diff(stats.run.rs_flushing,
stats.run.rs_logging);
- stats.run.rs_blocks = commit_transaction->t_outstanding_credits;
+ stats.run.rs_blocks =
+ atomic_read(&commit_transaction->t_outstanding_credits);
stats.run.rs_blocks_logged = 0;
J_ASSERT(commit_transaction->t_nr_buffers <=
- commit_transaction->t_outstanding_credits);
+ atomic_read(&commit_transaction->t_outstanding_credits));
err = 0;
descriptor = NULL;
@@ -616,7 +617,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
* the free space in the log, but this counter is changed
* by jbd2_journal_next_log_block() also.
*/
- commit_transaction->t_outstanding_credits--;
+ atomic_dec(&commit_transaction->t_outstanding_credits);
/* Bump b_count to prevent truncate from stumbling over
the shadowed buffer! @@@ This can go if we ever get
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 001e95f..9c64c7e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -55,6 +55,8 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
transaction->t_tid = journal->j_transaction_sequence++;
transaction->t_expires = jiffies + journal->j_commit_interval;
spin_lock_init(&transaction->t_handle_lock);
+ atomic_set(&transaction->t_updates, 0);
+ atomic_set(&transaction->t_outstanding_credits, 0);
INIT_LIST_HEAD(&transaction->t_inode_list);
INIT_LIST_HEAD(&transaction->t_private_list);
@@ -177,7 +179,7 @@ repeat_locked:
* checkpoint to free some more log space.
*/
spin_lock(&transaction->t_handle_lock);
- needed = transaction->t_outstanding_credits + nblocks;
+ needed = atomic_read(&transaction->t_outstanding_credits) + nblocks;
if (needed > journal->j_max_transaction_buffers) {
/*
@@ -240,11 +242,12 @@ repeat_locked:
}
handle->h_transaction = transaction;
- transaction->t_outstanding_credits += nblocks;
- transaction->t_updates++;
+ atomic_add(nblocks, &transaction->t_outstanding_credits);
+ atomic_inc(&transaction->t_updates);
transaction->t_handle_count++;
jbd_debug(4, "Handle %p given %d credits (total %d, free %d)\n",
- handle, nblocks, transaction->t_outstanding_credits,
+ handle, nblocks,
+ atomic_read(&transaction->t_outstanding_credits),
__jbd2_log_space_left(journal));
spin_unlock(&transaction->t_handle_lock);
spin_unlock(&journal->j_state_lock);
@@ -369,7 +372,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
}
spin_lock(&transaction->t_handle_lock);
- wanted = transaction->t_outstanding_credits + nblocks;
+ wanted = atomic_read(&transaction->t_outstanding_credits) + nblocks;
if (wanted > journal->j_max_transaction_buffers) {
jbd_debug(3, "denied handle %p %d blocks: "
@@ -384,7 +387,7 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
}
handle->h_buffer_credits += nblocks;
- transaction->t_outstanding_credits += nblocks;
+ atomic_add(nblocks, &transaction->t_outstanding_credits);
result = 0;
jbd_debug(3, "extended handle %p by %d\n", handle, nblocks);
@@ -426,15 +429,14 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int gfp_mask)
* First unlink the handle from its current transaction, and start the
* commit on that.
*/
- J_ASSERT(transaction->t_updates > 0);
+ J_ASSERT(atomic_read(&transaction->t_updates) > 0);
J_ASSERT(journal_current_handle() == handle);
spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
- transaction->t_outstanding_credits -= handle->h_buffer_credits;
- transaction->t_updates--;
-
- if (!transaction->t_updates)
+ atomic_sub(handle->h_buffer_credits,
+ &transaction->t_outstanding_credits);
+ if (atomic_dec_and_test(&transaction->t_updates))
wake_up(&journal->j_wait_updates);
spin_unlock(&transaction->t_handle_lock);
@@ -481,7 +483,7 @@ void jbd2_journal_lock_updates(journal_t *journal)
break;
spin_lock(&transaction->t_handle_lock);
- if (!transaction->t_updates) {
+ if (!atomic_read(&transaction->t_updates)) {
spin_unlock(&transaction->t_handle_lock);
break;
}
@@ -1258,7 +1260,8 @@ int jbd2_journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int err;
+ int err, wait_for_commit = 0;
+ tid_t tid;
pid_t pid;
J_ASSERT(journal_current_handle() == handle);
@@ -1266,7 +1269,7 @@ int jbd2_journal_stop(handle_t *handle)
if (is_handle_aborted(handle))
err = -EIO;
else {
- J_ASSERT(transaction->t_updates > 0);
+ J_ASSERT(atomic_read(&transaction->t_updates) > 0);
err = 0;
}
@@ -1334,14 +1337,8 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_sync)
transaction->t_synchronous_commit = 1;
current->journal_info = NULL;
- spin_lock(&transaction->t_handle_lock);
- transaction->t_outstanding_credits -= handle->h_buffer_credits;
- transaction->t_updates--;
- if (!transaction->t_updates) {
- wake_up(&journal->j_wait_updates);
- if (journal->j_barrier_count)
- wake_up(&journal->j_wait_transaction_locked);
- }
+ atomic_sub(handle->h_buffer_credits,
+ &transaction->t_outstanding_credits);
/*
* If the handle is marked SYNC, we need to set another commit
@@ -1350,15 +1347,13 @@ int jbd2_journal_stop(handle_t *handle)
* transaction is too old now.
*/
if (handle->h_sync ||
- transaction->t_outstanding_credits >
- journal->j_max_transaction_buffers ||
- time_after_eq(jiffies, transaction->t_expires)) {
+ (atomic_read(&transaction->t_outstanding_credits) >
+ journal->j_max_transaction_buffers) ||
+ time_after_eq(jiffies, transaction->t_expires)) {
/* Do this even for aborted journals: an abort still
* completes the commit thread, it just doesn't write
* anything to disk. */
- tid_t tid = transaction->t_tid;
- spin_unlock(&transaction->t_handle_lock);
jbd_debug(2, "transaction too old, requesting commit for "
"handle %p\n", handle);
/* This is non-blocking */
@@ -1369,11 +1364,25 @@ int jbd2_journal_stop(handle_t *handle)
* to wait for the commit to complete.
*/
if (handle->h_sync && !(current->flags & PF_MEMALLOC))
- err = jbd2_log_wait_commit(journal, tid);
- } else {
- spin_unlock(&transaction->t_handle_lock);
+ wait_for_commit = 1;
}
+ /*
+ * Once we drop t_updates, if it goes to zero the transaction
+ * could start commiting on us and eventually disappear. So
+ * once we do this, we must not dereference transaction
+ * pointer again.
+ */
+ tid = transaction->t_tid;
+ if (atomic_dec_and_test(&transaction->t_updates)) {
+ wake_up(&journal->j_wait_updates);
+ if (journal->j_barrier_count)
+ wake_up(&journal->j_wait_transaction_locked);
+ }
+
+ if (wait_for_commit)
+ err = jbd2_log_wait_commit(journal, tid);
+
lock_map_release(&handle->h_lockdep_map);
jbd2_free_handle(handle);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 5a72bc7..a72ce21 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -601,13 +601,13 @@ struct transaction_s
* Number of outstanding updates running on this transaction
* [t_handle_lock]
*/
- int t_updates;
+ atomic_t t_updates;
/*
* Number of buffers reserved for use by all handles in this transaction
* handle but not yet modified. [t_handle_lock]
*/
- int t_outstanding_credits;
+ atomic_t t_outstanding_credits;
/*
* Forward and backward links for the circular list of all transactions
@@ -1258,8 +1258,8 @@ static inline int jbd_space_needed(journal_t *journal)
{
int nblocks = journal->j_max_transaction_buffers;
if (journal->j_committing_transaction)
- nblocks += journal->j_committing_transaction->
- t_outstanding_credits;
+ nblocks += atomic_read(&journal->j_committing_transaction->
+ t_outstanding_credits);
return nblocks;
}
--
1.7.0.4
On Mon, 2010-08-02 at 08:48 -0400, Theodore Ts'o wrote:
> By using an atomic_t for t_updates and t_outstanding credits, this
> should allow us to not need to take transaction t_handle_lock in
> jbd2_journal_stop().
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Keith Maanthey <[email protected]>
> Cc: Eric Whitney <[email protected]>
> ---
>
> It would be interesting to get some quick performance tests with and
> without this patch on (a) dbench w/ PREEMPT-RT, (b) Keith's 8-way, and
> (c) Eric's 48-core machine. I have some other ideas for how to improve
> things in start_this_handle(), but they're going to be tricker, and I
> want to do some hard testing using xfstests on a big machine to make
> sure this is really safe before I keep going. This has passed light
> testing, but I haven't had a chance to do stress tests on a large SMP
> machine yet. That'll come later this week, hopefully.
So sort of quick and dirty numbers. I'd not trust them too far, but
gives a basic feel for how things are doing.
So this is dbench #s on an 8-core blade:
1 2 4 8
vfs+j_state lock 401 724 1203 1142
vfs+j_state lock+atomic 408 745 1266 1263
no vfs, no ext4 patches 410 711 1071 434
no vfs, j_state lock 411 714 1113 806
no vfs, j_state lock+atomic 411 723 1149 816
Graphically:
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33-rt-ext4-atomic/preempt-rt-ext4-dbench.png
So the vfs-scalability patchset was reverted from -rt due to some
difficult stability issues that were seen in testing. Hopefully we'll be
able to sort them out when -rt moves forward to 2.6.35 or later and we
can use Nick's more cleaned up current queue rather then my poor forward
port of his work from last September.
Anyway, I wanted to show how this patch affects dbench numbers on -rt
both with and without the vfs-scalability patches and the earlier
j_state lock change.
The vfs versions use the 2.6.33-rt23 kernel, and the non vfs use the
2.6.33-rt26 kernel.
>From these numbers, it looks like the atomic variables are a minor
improvement for -rt, but the improvement isn't as drastic as the earlier
j_state lock change, or the vfs scalability patchset.
Didn't run into any troubles with this patch in my testing, but again,
it was a fairly quick set of runs.
thanks
-john
On Mon, Aug 02, 2010 at 04:02:32PM -0700, john stultz wrote:
> >From these numbers, it looks like the atomic variables are a minor
> improvement for -rt, but the improvement isn't as drastic as the earlier
> j_state lock change, or the vfs scalability patchset.
Thanks for doing this quick test run! I was expecting to see a more
dramatic difference, since the j_state_lock patch removed one of the
two global locks in jbd2_journal_stop, and the t_handle_lock patch
removed the second of the two global locks. But I guess the
j_state_lock contention in start_this_handle() is still the dominating factor.
It's interesting that apparently the latest t_handle_lock patch
doesn't seem to make much difference unless the VFS scalability patch
is also applied. I'm not sure why that makes a difference, but it's
nice to know that with the VFS scalability patch it does seem to help,
even if it doesn't help as much as I had hoped.
OK, I guess we'll have to start working on the more aggressive
scalability fix ups....
- Ted
On Mon, 2010-08-02 at 20:06 -0400, Ted Ts'o wrote:
> On Mon, Aug 02, 2010 at 04:02:32PM -0700, john stultz wrote:
> > >From these numbers, it looks like the atomic variables are a minor
> > improvement for -rt, but the improvement isn't as drastic as the earlier
> > j_state lock change, or the vfs scalability patchset.
>
> Thanks for doing this quick test run! I was expecting to see a more
> dramatic difference, since the j_state_lock patch removed one of the
> two global locks in jbd2_journal_stop, and the t_handle_lock patch
> removed the second of the two global locks. But I guess the
> j_state_lock contention in start_this_handle() is still the dominating factor.
>
> It's interesting that apparently the latest t_handle_lock patch
> doesn't seem to make much difference unless the VFS scalability patch
> is also applied. I'm not sure why that makes a difference, but it's
> nice to know that with the VFS scalability patch it does seem to help,
> even if it doesn't help as much as I had hoped.
Well, its likely that with the -rt kernel and without the
vfs-scalability changes, we're just burning way more time on vfs lock
contention then we are on anything in the ext4 code. Just a theory, but
I can try to verify with perf logs if you'd like.
> OK, I guess we'll have to start working on the more aggressive
> scalability fix ups....
I'm generated mainline results w/ w/o Nick's current vfs-scalability
tree. So far any benefit from the atomic patch seems to be < 1% there,
but I'm probably not hitting much contention at only 8 cores:
2.6.35-rc6
Throughput 2345.72 MB/sec 8 procs
Throughput 1424.11 MB/sec 4 procs
Throughput 811.371 MB/sec 2 procs
Throughput 444.129 MB/sec 1 procs
2.6.35-rc6 + atomic
Throughput 2354.66 MB/sec 8 procs
Throughput 1427.64 MB/sec 4 procs
Throughput 794.961 MB/sec 2 procs
Throughput 443.464 MB/sec 1 procs
2.6.35-rc6-vfs
Throughput 2639.04 MB/sec 8 procs
Throughput 1583.28 MB/sec 4 procs
Throughput 858.337 MB/sec 2 procs
Throughput 452.774 MB/sec 1 procs
2.6.35-rc6-vfs + atomic
Throughput 2648.42 MB/sec 8 procs
Throughput 1586.68 MB/sec 4 procs
Throughput 851.545 MB/sec 2 procs
Throughput 453.106 MB/sec 1 procs
thanks
-john
On Mon, 2010-08-02 at 17:53 -0700, john stultz wrote:
> On Mon, 2010-08-02 at 20:06 -0400, Ted Ts'o wrote:
> > On Mon, Aug 02, 2010 at 04:02:32PM -0700, john stultz wrote:
> > > >From these numbers, it looks like the atomic variables are a minor
> > > improvement for -rt, but the improvement isn't as drastic as the earlier
> > > j_state lock change, or the vfs scalability patchset.
> >
> > Thanks for doing this quick test run! I was expecting to see a more
> > dramatic difference, since the j_state_lock patch removed one of the
> > two global locks in jbd2_journal_stop, and the t_handle_lock patch
> > removed the second of the two global locks. But I guess the
> > j_state_lock contention in start_this_handle() is still the dominating factor.
> >
> > It's interesting that apparently the latest t_handle_lock patch
> > doesn't seem to make much difference unless the VFS scalability patch
> > is also applied. I'm not sure why that makes a difference, but it's
> > nice to know that with the VFS scalability patch it does seem to help,
> > even if it doesn't help as much as I had hoped.
>
> Well, its likely that with the -rt kernel and without the
> vfs-scalability changes, we're just burning way more time on vfs lock
> contention then we are on anything in the ext4 code. Just a theory, but
> I can try to verify with perf logs if you'd like.
I went ahead and generated perf data for the PREEMPT_RT cases that you
can find here:
http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.33-rt-ext4-atomic/
As a reminder the 2.6.33.5-rt23 kernel includes both the vfs-scalability
patches and the j_state locking change. The 2.6.33.6-rt26 kernel does
not include those changes.
>From those logs you can see that the atomic change on top of the vfs
patch (ie: comparing the two 2.6.33.5-rt23 logs) pulls start_this_handle
down quite a bit.
With the non-vfs scalability patched kernels, we see that the j_state
lock and atomic changes pull start_this_handle out of the top contender
handle, but there is still quite a large amount of contention on the
dput paths.
So yea, the change does help, but its just not the top cause of
contention when aren't using the vfs patches, so we don't see as much
benefit at this point.
thanks
-john
On Mon, Aug 02, 2010 at 07:52:29PM -0700, john stultz wrote:
> With the non-vfs scalability patched kernels, we see that the j_state
> lock and atomic changes pull start_this_handle out of the top contender
> handle, but there is still quite a large amount of contention on the
> dput paths.
>
> So yea, the change does help, but its just not the top cause of
> contention when aren't using the vfs patches, so we don't see as much
> benefit at this point.
Great, thanks for uploading the lockstats. Since dbench is so
metadata heavy, it makes a lot of sense that further jbd2
optimizations probably won't make much difference until the VFS
bottlenecks can be solved.
Other benchmarks, such as the FFSB benchmarks used by Steven Pratt and
Eric Whitney, would probably show more of a difference.
In any case, I've just sent two more patches which completely remove
any exclusive spinlocks from start_this_handle() by converting
j_state_lock to a rwlock_t, and dropping the need to take
t_handle_lock. This will add more cache line bouncing, so on NUMA
workloads this may make things worse, but I guess we'll have to see.
Anyone have access to an SGI Altix? I'm assuming the old Sequent NUMA
boxes are long gone by now...
- Ted
Ted Ts'o wrote:
> On Mon, Aug 02, 2010 at 07:52:29PM -0700, john stultz wrote:
>> With the non-vfs scalability patched kernels, we see that the j_state
>> lock and atomic changes pull start_this_handle out of the top contender
>> handle, but there is still quite a large amount of contention on the
>> dput paths.
>>
>> So yea, the change does help, but its just not the top cause of
>> contention when aren't using the vfs patches, so we don't see as much
>> benefit at this point.
>
> Great, thanks for uploading the lockstats. Since dbench is so
> metadata heavy, it makes a lot of sense that further jbd2
> optimizations probably won't make much difference until the VFS
> bottlenecks can be solved.
>
> Other benchmarks, such as the FFSB benchmarks used by Steven Pratt and
> Eric Whitney, would probably show more of a difference.
>
> In any case, I've just sent two more patches which completely remove
> any exclusive spinlocks from start_this_handle() by converting
> j_state_lock to a rwlock_t, and dropping the need to take
> t_handle_lock. This will add more cache line bouncing, so on NUMA
> workloads this may make things worse, but I guess we'll have to see.
> Anyone have access to an SGI Altix? I'm assuming the old Sequent NUMA
> boxes are long gone by now...
>
> - Ted
Ted:
The 48 core system I'm running on is an eight node NUMA box with three
hop worst case latency, and it tends to let you know if you're bouncing
cache lines too enthusiastically. If someone's got access to a larger
system with more hops across the topology, that would naturally be even
better.
I'm taking my 2.6.35 ext4 baseline now. It takes me about 36 hours of
running time to get in a complete set of runs, so with luck I should
have data and lockstats to post in a few days.
Eric
P.S. And yes, I think I'll make a set of non-accidental no-journal runs
this time as well...