LinuxLists.cc - ext4 dbench performance with CONFIG_PREEMPT

2010-04-07 23:21:24

Subject: ext4 dbench performance with CONFIG_PREEMPT_RT

I've recently been working on scalability issues with the PREEMPT_RT
kernel patch, and one area I've hit has been ext4 journal j_state_lock
contention.

With the PREEMPT_RT kernel, most spinlocks become pi-aware mutexes, so
they sleep when the lock cannot be acquired. This means lock contention
has a *much* greater impact on the PREEMPT_RT kernel then mainline. Thus
scalability issues hit the PREEMPT_RT kernel at lower cpu counts then
mainline.

Now, you might not care about PREEMPT_RT, but consider that any lock
contention will get worse in mainline as the cpu count increases. So in
this way, the PREEMPT_RT kernel allows us to see what scalability issues
are on the horizon for mainline kernels running on larger systems.

When running dbench on ext4 with the -rt kernel, I saw a fairly severe
performance drop off (~-60%) going from 4 to 8 clients (on an 8 cpu
machine).

Looking at the perf log, I could see there was quite a bit of lock
contention in the jdb2 start_This_handle/jbd2_journal_stop code paths:

27.39% dbench [kernel] [k] _raw_spin_lock_irqsave
|
|--90.91%-- rt_spin_lock_slowlock
| rt_spin_lock
| |
| |--66.92%-- start_this_handle
| | jbd2_journal_start
| | ext4_journal_start_sb
| | |
...
| |
| |--32.31%-- jbd2_journal_stop
| | __ext4_journal_stop
| | |
| | |--92.86%-- ext4_da_write_end
| | | generic_file_buffered_write

Full perf log here:
http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.33/linux-2.6.33-rt11-vfs-ext4-8cpu.log

The perf logs for mainline showed similar contention, but at 8 cpus it
was not severe enough to drastically hurt performance in the same way.

Further using lockstat I was able to isolate it the contention down to
the journal j_state_lock, and then adding some lock owner tracking, I
was able to see that the lock owners were almost always in
start_this_handle, and jbd2_journal_stop when we saw contention (with
the freq breakdown being about 55% in jbd2_journal_stop and 45% in
start_this_handle).

Now, I'm very new to this code, and I don't fully understand the locking
rules. There is some documentation in the journal_s structure about
what's protected by the j_state_lock, but I'm not sure if its 100%
correct or not.

For the most part, in jbd2_jorunal_stop we're only reading values in the
journal_t struct, so throwing caution to the wind, I simply removed the
locking there (trying to be careful in the one case of a structure write
to use cmpxchg so its sort of an atomic write).

Doing this more then doubled performance (2.5x increase).

Now, I know its terribly terribly wrong, in probably more ways then I
imagine, and I'm in no way suggesting this "locks are for data integrity
nerds, look at this performance!" mentality is valid. But it does show
exactly where contention is hurting us, and maybe what is possibly
achievable if we can rework some of the locking here.

The terrible terrible disk-eating* patch can be found here:
http://sr71.net/~jstultz/dbench-scalability/patches/2.6.33-ext4-hack/state_lock_hack.patch

Here's a chart comparing 2.6.33 vs 2.6.33-rt11-vfs both with and
without the terrible terrible disk-eating* locking hack:
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33-ext4-hack/2.6.33-ext4-hack.png

All the perf logs for the 4 cases above:
http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.33-ext4-hack/

As you can see in the charts, the hack doesn't seem to give any benefit
to mainline at only 8 cpus, but looking at the perf logs it does seem to
lower the overhead of the journal code.

The charts also show that start_this_handle still sees some contention
so there is still more gains to be had by reworking the j_state_lock.

So questions from there are:
1) Some of the values protected by the j_state_lock are protected only
for quick modification or reading. Could these be converted to a
atomic_t? If not, why?

2) Alternatively can some of the values protected by the j_state_lock
which seem to be mostly read type values be converted to something like
a seq_lock? Again, I'm looking more for the locking dependency reasons
then just yes or no.

3) Is RCU too crazy for filesystems?

If there's any better documentation on the locking rules here, I'd also
appreciate a pointer.

Any other thoughts or feedback would be greatly appreciated.

thanks
-john

* No disks were actually eaten in the creation of this data. The patch
actually ran fine the whole time, but that said, DON'T TRY IT AT HOME!

2010-04-08 03:46:35

by Theodore Ts'o

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Wed, Apr 07, 2010 at 04:21:18PM -0700, john stultz wrote:
> Further using lockstat I was able to isolate it the contention down to
> the journal j_state_lock, and then adding some lock owner tracking, I
> was able to see that the lock owners were almost always in
> start_this_handle, and jbd2_journal_stop when we saw contention (with
> the freq breakdown being about 55% in jbd2_journal_stop and 45% in
> start_this_handle).

Hmm.... I've taken a very close look at jbd2_journal_stop(), and I
don't think we need to take j_state_lock() at all except if we need to
call jbd2_log_start_commit(). t_outstanding_credits,
h_buffer_credits, and t_updates are all documented (and verified by
me) to be protected by the t_handle_lock spinlock.

So I ***think*** the following might be safe. WARNING! WARNING!! No
real testing done on this patch, other than "it compiles! ship it!!".

I'll let other people review it, and maybe you could give this a run
and see what happens with this patch?

- Ted

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index bfc70f5..e214d68 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1311,7 +1311,6 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_sync)
transaction->t_synchronous_commit = 1;
current->journal_info = NULL;
- spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
transaction->t_outstanding_credits -= handle->h_buffer_credits;
transaction->t_updates--;
@@ -1340,8 +1339,7 @@ int jbd2_journal_stop(handle_t *handle)
jbd_debug(2, "transaction too old, requesting commit for "
"handle %p\n", handle);
/* This is non-blocking */
- __jbd2_log_start_commit(journal, transaction->t_tid);
- spin_unlock(&journal->j_state_lock);
+ jbd2_log_start_commit(journal, transaction->t_tid);

/*
* Special case: JBD2_SYNC synchronous updates require us
@@ -1351,7 +1349,6 @@ int jbd2_journal_stop(handle_t *handle)
err = jbd2_log_wait_commit(journal, tid);
} else {
spin_unlock(&transaction->t_handle_lock);
- spin_unlock(&journal->j_state_lock);
}

lock_map_release(&handle->h_lockdep_map);

2010-04-08 10:18:36

by Theodore Ts'o

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Apr 7, 2010, at 11:46 PM, [email protected] wrote:

> On Wed, Apr 07, 2010 at 04:21:18PM -0700, john stultz wrote:
>> Further using lockstat I was able to isolate it the contention down to
>> the journal j_state_lock, and then adding some lock owner tracking, I
>> was able to see that the lock owners were almost always in
>> start_this_handle, and jbd2_journal_stop when we saw contention (with
>> the freq breakdown being about 55% in jbd2_journal_stop and 45% in
>> start_this_handle).
>
> Hmm.... I've taken a very close look at jbd2_journal_stop(), and I
> don't think we need to take j_state_lock() at all except if we need to
> call jbd2_log_start_commit(). t_outstanding_credits,
> h_buffer_credits, and t_updates are all documented (and verified by
> me) to be protected by the t_handle_lock spinlock.

BTW, it might be possible to remove the need to take t_handle_lock by converting t_outstanding_credits and t_updates to be atomic_t's, but that might have other performance impacts for other cases. This patch shouldn't cause any performance regressions because we're just removing code. As I said, I'm pretty sure it's safe but it could use more review and I should look at it again with fresh eyes, but in the meantime, it would be great if you could let us know what sort of results you get with this patch.

- Ted

2010-04-08 20:42:04

by john stultz

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

2010-04-08 21:11:21

by Theodore Ts'o

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Thu, Apr 08, 2010 at 01:41:57PM -0700, john stultz wrote:
>
> I'll continue to play with your patch and see if I can con some some
> folks with more interesting storage setups to do some testing as well.

You might want to ask djwong to play with it with his nice big
machine. (We don't need a big file system, but we want as many CPU's
as possible, and to use his "mailserver" workload to really stress the
journal. I'd recommend using barrier=0 for additional journal
lock-level stress testing, and then try some forced sysrq-b reboots
and then make sure that the filesystem is consistent after the journal
replay.)

I've since done basic two-CPU testing using xfstests under KVM, but
that's really not going to test any locking issues.

> Any thoughts for ways to rework the state_lock in start_this_handle?
> (Now that its at the top of the contention logs? :)

That's going to be much harder. We're going to have to take
j_state_lock at some point inside start_this_handle. We might be able
to decrease the amount of code which is run while the spinlock is
taken, but I very much doubt it's possible to eliminate that spinlock
entirely.

Do you have detailed lockstat information showing the hold-time and
wait-time of j_lock_stat (especially in start_this_handle)?

- Ted

2010-04-08 22:37:30

by Mingming Cao

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Wed, 2010-04-07 at 23:46 -0400, [email protected] wrote:
> On Wed, Apr 07, 2010 at 04:21:18PM -0700, john stultz wrote:
> > Further using lockstat I was able to isolate it the contention down to
> > the journal j_state_lock, and then adding some lock owner tracking, I
> > was able to see that the lock owners were almost always in
> > start_this_handle, and jbd2_journal_stop when we saw contention (with
> > the freq breakdown being about 55% in jbd2_journal_stop and 45% in
> > start_this_handle).
>
> Hmm.... I've taken a very close look at jbd2_journal_stop(), and I
> don't think we need to take j_state_lock() at all except if we need to
> call jbd2_log_start_commit(). t_outstanding_credits,
> h_buffer_credits, and t_updates are all documented (and verified by
> me) to be protected by the t_handle_lock spinlock.
>

Seems so, I verified the code, looks we could drop the j_state_lock()
there.

Also, I wonder if we could make the journal->j_average_commit_time as
atomic, so we could drop the j_state_lock() more in jbd2_journal_stop()?
Not sure how much this will improve the rt kernel, but might be worth
doing since j_state_lock() seems to be the hottest one.

Mingming
> So I ***think*** the following might be safe. WARNING! WARNING!! No
> real testing done on this patch, other than "it compiles! ship it!!".
>
> I'll let other people review it, and maybe you could give this a run
> and see what happens with this patch?
>
> - Ted
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index bfc70f5..e214d68 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1311,7 +1311,6 @@ int jbd2_journal_stop(handle_t *handle)
> if (handle->h_sync)
> transaction->t_synchronous_commit = 1;
> current->journal_info = NULL;
> - spin_lock(&journal->j_state_lock);
> spin_lock(&transaction->t_handle_lock);
> transaction->t_outstanding_credits -= handle->h_buffer_credits;
> transaction->t_updates--;
> @@ -1340,8 +1339,7 @@ int jbd2_journal_stop(handle_t *handle)
> jbd_debug(2, "transaction too old, requesting commit for "
> "handle %p\n", handle);
> /* This is non-blocking */
> - __jbd2_log_start_commit(journal, transaction->t_tid);
> - spin_unlock(&journal->j_state_lock);
> + jbd2_log_start_commit(journal, transaction->t_tid);
>
> /*
> * Special case: JBD2_SYNC synchronous updates require us
> @@ -1351,7 +1349,6 @@ int jbd2_journal_stop(handle_t *handle)
> err = jbd2_log_wait_commit(journal, tid);
> } else {
> spin_unlock(&transaction->t_handle_lock);
> - spin_unlock(&journal->j_state_lock);
> }
>
> lock_map_release(&handle->h_lockdep_map);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-04-09 15:50:01

by Andi Kleen

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

john stultz <[email protected]> writes:
>
> Further using lockstat I was able to isolate it the contention down to
> the journal j_state_lock, and then adding some lock owner tracking, I
> was able to see that the lock owners were almost always in
> start_this_handle, and jbd2_journal_stop when we saw contention (with
> the freq breakdown being about 55% in jbd2_journal_stop and 45% in
> start_this_handle).

FWIW we've been also seeing this on larger systems without RT.
The journal locks are the number one contention in some workloads.
So it's not just a RT problem.

-Andi

--
[email protected] -- Speaking for myself only.

2010-04-09 23:33:38

by Theodore Ts'o

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Fri, Apr 09, 2010 at 05:49:54PM +0200, Andi Kleen wrote:
> john stultz <[email protected]> writes:
> >
> > Further using lockstat I was able to isolate it the contention down to
> > the journal j_state_lock, and then adding some lock owner tracking, I
> > was able to see that the lock owners were almost always in
> > start_this_handle, and jbd2_journal_stop when we saw contention (with
> > the freq breakdown being about 55% in jbd2_journal_stop and 45% in
> > start_this_handle).
>
> FWIW we've been also seeing this on larger systems without RT.
> The journal locks are the number one contention in some workloads.
> So it's not just a RT problem.

Yeah, I'm very much aware of that. What worries me is that locking
problems in the jbd2 layer could be very hard to debug, so we need to
make sure we have some really good testing as we make any changes.

Not taking the j_state_lock spinlock in jbd2_stop_lock() was relatively
easy to prove to be safe, but I'm really worried about
start_this_handle() the locking around that is going to be subtle, and
it's not just the specific fields in the transaction and journal
handle.

And even with the jbd2_stop_lock() change, I'd really prefer some
pretty exhaustive testing, including power fail testing, just to make
sure we're in practice when/if we make more subtle or more invasive
changes to the jbd2 layer...

So I'm mot waving the red flag, but the yellow flag (as they would say
in auto racing circles).

Regards,

- Ted

2010-04-09 23:48:25

by Chen, Tim C

[permalink] [raw]

Subject: RE: ext4 dbench performance with CONFIG_PREEMPT_RT

>[email protected] wrote
>
>Yeah, I'm very much aware of that. What worries me is that locking
>problems in the jbd2 layer could be very hard to debug, so we need to
>make sure we have some really good testing as we make any changes.
>
>Not taking the j_state_lock spinlock in jbd2_stop_lock() was relatively
>easy to prove to be safe, but I'm really worried about
>start_this_handle() the locking around that is going to be subtle, and
>it's not just the specific fields in the transaction and journal
>handle.
>
>And even with the jbd2_stop_lock() change, I'd really prefer some
>pretty exhaustive testing, including power fail testing, just to make
>sure we're in practice when/if we make more subtle or more invasive
>changes to the jbd2 layer...
>
>So I'm mot waving the red flag, but the yellow flag (as they would say
>in auto racing circles).
>

Your patch did remove the contention on the j_state_lock for dbench
in my testing with 64 threads. The contention point now
moves dcache_lock, which is also another tricky bottleneck.

In our other testing with FFSB that creates/rename/remove a lot of directories,
we found that journal->j_revoke_lock was also heavily contended.

Tim

2010-04-09 23:57:43

by john stultz

[permalink] [raw]

Subject: RE: ext4 dbench performance with CONFIG_PREEMPT_RT

On Fri, 2010-04-09 at 17:48 -0600, Chen, Tim C wrote:
>
> >[email protected] wrote
> >
> >Yeah, I'm very much aware of that. What worries me is that locking
> >problems in the jbd2 layer could be very hard to debug, so we need to
> >make sure we have some really good testing as we make any changes.
> >
> >Not taking the j_state_lock spinlock in jbd2_stop_lock() was relatively
> >easy to prove to be safe, but I'm really worried about
> >start_this_handle() the locking around that is going to be subtle, and
> >it's not just the specific fields in the transaction and journal
> >handle.
> >
> >And even with the jbd2_stop_lock() change, I'd really prefer some
> >pretty exhaustive testing, including power fail testing, just to make
> >sure we're in practice when/if we make more subtle or more invasive
> >changes to the jbd2 layer...
> >
> >So I'm mot waving the red flag, but the yellow flag (as they would say
> >in auto racing circles).
> >
>
> Your patch did remove the contention on the j_state_lock for dbench
> in my testing with 64 threads. The contention point now
> moves dcache_lock, which is also another tricky bottleneck.

Nick Piggin's vfs scalability patches takes care of the dcache_lock
contention. I'm actually using them with the -rt patch in my testing
here.

> In our other testing with FFSB that creates/rename/remove a lot of directories,
> we found that journal->j_revoke_lock was also heavily contended.

Yep. This also shows up in my -rt patch testing with Ted's patch.

thanks
-john

2010-04-10 11:59:28

by Theodore Ts'o

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Fri, Apr 09, 2010 at 05:48:23PM -0600, Chen, Tim C wrote:
>
> Your patch did remove the contention on the j_state_lock for dbench
> in my testing with 64 threads. The contention point now
> moves dcache_lock, which is also another tricky bottleneck.
>
> In our other testing with FFSB that creates/rename/remove a lot of directories,
> we found that journal->j_revoke_lock was also heavily contended.
>

Do you have lock_stat reports handy? I'd love to take a look at them.

Thanks,

- Ted

2010-04-12 19:46:29

by Jan Kara

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

> On Wed, Apr 07, 2010 at 04:21:18PM -0700, john stultz wrote:
> > Further using lockstat I was able to isolate it the contention down to
> > the journal j_state_lock, and then adding some lock owner tracking, I
> > was able to see that the lock owners were almost always in
> > start_this_handle, and jbd2_journal_stop when we saw contention (with
> > the freq breakdown being about 55% in jbd2_journal_stop and 45% in
> > start_this_handle).
>
> Hmm.... I've taken a very close look at jbd2_journal_stop(), and I
> don't think we need to take j_state_lock() at all except if we need to
> call jbd2_log_start_commit(). t_outstanding_credits,
> h_buffer_credits, and t_updates are all documented (and verified by
> me) to be protected by the t_handle_lock spinlock.
>
> So I ***think*** the following might be safe. WARNING! WARNING!! No
> real testing done on this patch, other than "it compiles! ship it!!".
>
> I'll let other people review it, and maybe you could give this a run
> and see what happens with this patch?
I had a look and it really seems that t_handle_lock is enough for
all we do in jbd2_journal_stop (except for jbd2_log_start_commit). So
I think your patch is fine.
I also had a look at jbd2_journal_start. What probably makes things bad there
is that lots of threads accumulate waiting for transaction to get out of
T_LOCKED state. When that happens, all the threads are woken up and start
pondering at j_state_lock which creates contention. This is just a theory and I
might be completely wrong... Some lockstat data would be useful to confirm /
refute this.

Honza

> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index bfc70f5..e214d68 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1311,7 +1311,6 @@ int jbd2_journal_stop(handle_t *handle)
> if (handle->h_sync)
> transaction->t_synchronous_commit = 1;
> current->journal_info = NULL;
> - spin_lock(&journal->j_state_lock);
> spin_lock(&transaction->t_handle_lock);
> transaction->t_outstanding_credits -= handle->h_buffer_credits;
> transaction->t_updates--;
> @@ -1340,8 +1339,7 @@ int jbd2_journal_stop(handle_t *handle)
> jbd_debug(2, "transaction too old, requesting commit for "
> "handle %p\n", handle);
> /* This is non-blocking */
> - __jbd2_log_start_commit(journal, transaction->t_tid);
> - spin_unlock(&journal->j_state_lock);
> + jbd2_log_start_commit(journal, transaction->t_tid);
>
> /*
> * Special case: JBD2_SYNC synchronous updates require us
> @@ -1351,7 +1349,6 @@ int jbd2_journal_stop(handle_t *handle)
> err = jbd2_log_wait_commit(journal, tid);
> } else {
> spin_unlock(&transaction->t_handle_lock);
> - spin_unlock(&journal->j_state_lock);
> }
>
> lock_map_release(&handle->h_lockdep_map);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SuSE CR Labs

2010-04-12 19:54:30

by Chen, Tim C

[permalink] [raw]

Subject: RE: ext4 dbench performance with CONFIG_PREEMPT_RT

>[email protected] wrote
Chen, Tim C wrote:
on the j_state_lock for dbench
contention point now
another tricky bottleneck.
creates/rename/remove a
was also heavily contended.
I'd love to take a look at them.
/> waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
6290348 0.11 19839.44 1798082999.37 6444352 6455257 0.09 1098.09 18850261.22
73901 76686 0.11 60.15 113381.57 1453130 1729218 0.11 77.81 1256803.55
70544 70816 0.10 81.63 57218.45 3260359 9367952 0.10 87.68 2882612.48
13957 0.15 16.34 11674.72 191861 191900 0.15 18.25 127363.61
11469 11471 9.81 23837.58 15425101.39 19705 19832 0.99 12681.28 9526135.06
6074 6078 0.14 119.79 61347.68 195893 526772 0.09 107.31 250906.16
4119 4201 0.11 58.20 5152.25 170931 485773 0.10 74.01 171147.12
1303 1323 0.15 48.76 5762.88 32798 53769 0.46 47.28 132041.92
1072 1077 0.16 44.15 3567.97 51339 71398 0.18 66.81 145312.14
862 870 0.16 19.43 896.55 122469 189716 0.10 39.86 132519.21
waittime-min waittime-max waittime-total acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
2776732 2797303 0.09 1226.54 25572406.37 5534106 7180109 0.08 940.11 4072313.72
877686 881325 0.10 1012.54 6852161.82 3877066 5037811 0.08 953.17 2419421.96
760650 0.12 1863.26 101406877.20 1060958 1382494 0.42 1308.88 11071530.40
409694 0.12 864.01 1468208.41 1434252 2471228 0.09 2655.71 2629938.50
211707 214131 0.11 813.65 1255954.31 464310 2882306 0.08 984.38 3729431.78
206642 206832 0.13 2312.55 3561899.14 1049032 1490515 0.08 2447.20 2079827.95
179770 180327 0.11 745.09 438500.40 1724000 1997981 0.08 795.20 1097502.30
123150 123161 3.87 41421.17 406113809.87 408087 532092 0.17 36588.86 5151196.68
116548 117330 0.11 88.35 65548.86 2219789 4243060 0.08 87.43 1324289.71
71612 72458 0.11 814.83 257968.30 833686 3148049 0.08

2010-04-13 03:53:02

by john stultz

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Thu, 2010-04-08 at 17:10 -0400, [email protected] wrote:
> > Any thoughts for ways to rework the state_lock in start_this_handle?
> > (Now that its at the top of the contention logs? :)
>
> That's going to be much harder. We're going to have to take
> j_state_lock at some point inside start_this_handle. We might be able
> to decrease the amount of code which is run while the spinlock is
> taken, but I very much doubt it's possible to eliminate that spinlock
> entirely.
>
> Do you have detailed lockstat information showing the hold-time and
> wait-time of j_lock_stat (especially in start_this_handle)?

Hey Ted,
Sorry this took so long. I've been using a fairly large pile of patches
in my testing on top of -rt, and since with -rt lockstat is less useful
(you don't get any of the contention data for mutexes, and the contended
spinlocks are always the internal rtmutex locks), I tried to regenerate
the data on something closer to plain vanilla.

So I ran dbench with 2.6.33, 2.6.33 + Nick Piggin's VFS scalability
patches, and 2.6.33 + Nick's patches + your state-lock patch on an 8 cpu
system.

Here's the chart of the performance difference:
http://sr71.net/~jstultz/dbench-scalability/graphs/2.6.33-ext4-state-lock/2.6.33_ext4-state-lock.png

Here's the perf log output:
http://sr71.net/~jstultz/dbench-scalability/perflogs/2.6.33-ext4-state-lock/

And finally, as requested, here's the lockstat data:
http://sr71.net/~jstultz/dbench-scalability/lockstat/2.6.33-ext4-state-lock/

Now, again, because the -rt kernel amplifies the contention cost, the
data above doesn't show as much pain at only 8 cpus as we see with -rt.
However, the contention does show up, and your patch helps.

In fact, with your patch, I'm not seeing any major contention in the
perf logs at 8 cpus. Although the lockstat logs still show:

t_handle_lock contention in start_This_handle/jbd2_journal_stop
- Likely the j_stat_lock was previously serializing this

j_state_lock contention in start_this_handle
- Expected

j_revoke_lock contention in find_revoke_record
- Also observed by Tim Chen

Let me know if there's any other data you'd like to see.

thanks
-john

2010-04-13 16:25:33

by Darren Hart

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

[email protected] wrote:
> On Mon, Apr 12, 2010 at 09:46:28PM +0200, Jan Kara wrote:
>> I also had a look at jbd2_journal_start. What probably makes
>> things bad there is that lots of threads accumulate waiting for
>> transaction to get out of T_LOCKED state. When that happens, all the
>> threads are woken up and start pondering at j_state_lock which
>> creates contention. This is just a theory and I might be completely
>> wrong... Some lockstat data would be useful to confirm / refute
>> this.
>
> Yeah, that sounds right. We do have a classic thundering hurd problem
> when we while are draining handles from the transaction in the
> T_LOCKED state --- that is (for those who aren't jbd2 experts) when it
> comes time to close out the current transaction, one of the first
> things that fs/jbd2/commit.c will do is to set the transaction into
> T_LOCKED state. In that state we are waiting for currently active
> handles to complete, and we don't allow any new handles to start until
> the currently running transaction is completely drained of active
> handles, at which point we can swap in a new transaction, and continue
> the commit process on the previously running transaction.
>
> On a non-real time kernel, the spinlock will tie up the currently
> running CPU's until the transaction drains, which is usually pretty
> fast, since we don't allow transactions to be held for that long (the
> worst case being truncate/unlink operations). Dbench is a worst case,
> though since we have some large number of threads all doing file
> system I/O (John, how was dbench configured?) and the spinlocks will
> no longer tie up a CPU, but actually let some other dbench thread run,
> so it magnifies the thundering hurd problem from 8 threads, to nearly
> all of the CPU threads.

I didn't follow that part - how will dbench prevent threads from
spinning on a spinlock and instead allow other dbench threads to run?

>
> Also, the spinlock code has a "ticket" system which tries to protect
> against the thundering hurd effect --- do the PI mutexes which replace
> spinlocks in the -rt kernel have any technqiue to try to prevent
> scheduler thrashing in the face of thundering hurd scenarios?

Nothing specific per-se, however, being a blocking lock, it will put all
those locks to sleep and then wake them in priority fifo order as the
lock becomes available. Unless dbench is being run with various priority
levels (I don't think John is doing that) then the PI part won't really
come into play. If we were, then we would see some more scheduling
overhead as high prio tasks became available, blocked on the lock,
boosted the owner, which then would get scheduled to release the lock,
then the high prio task would schedule back in - but that isn't the case
here.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-14 03:04:53

by john stultz

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Thu, 2010-04-08 at 17:10 -0400, [email protected] wrote:
> On Thu, Apr 08, 2010 at 01:41:57PM -0700, john stultz wrote:
> >
> > I'll continue to play with your patch and see if I can con some some
> > folks with more interesting storage setups to do some testing as well.
>
> You might want to ask djwong to play with it with his nice big
> machine. (We don't need a big file system, but we want as many CPU's
> as possible, and to use his "mailserver" workload to really stress the
> journal. I'd recommend using barrier=0 for additional journal
> lock-level stress testing, and then try some forced sysrq-b reboots
> and then make sure that the filesystem is consistent after the journal
> replay.)

So I did the above on a 16way that I borrowed from Keith and Darrick.
Got about 5 power-cycles and the journal was recovered and reported
clean by fsck on bootup each time.

So, again, doesn't prove its right, but nothing blew up.

thanks
-john

2010-04-14 03:56:46

by Theodore Ts'o

[permalink] [raw]

Subject: Re: ext4 dbench performance with CONFIG_PREEMPT_RT

On Mon, Apr 12, 2010 at 09:46:28PM +0200, Jan Kara wrote:
> I also had a look at jbd2_journal_start. What probably makes
> things bad there is that lots of threads accumulate waiting for
> transaction to get out of T_LOCKED state. When that happens, all the
> threads are woken up and start pondering at j_state_lock which
> creates contention. This is just a theory and I might be completely
> wrong... Some lockstat data would be useful to confirm / refute
> this.

Yeah, that sounds right. We do have a classic thundering hurd problem
when we while are draining handles from the transaction in the
T_LOCKED state --- that is (for those who aren't jbd2 experts) when it
comes time to close out the current transaction, one of the first
things that fs/jbd2/commit.c will do is to set the transaction into
T_LOCKED state. In that state we are waiting for currently active
handles to complete, and we don't allow any new handles to start until
the currently running transaction is completely drained of active
handles, at which point we can swap in a new transaction, and continue
the commit process on the previously running transaction.

On a non-real time kernel, the spinlock will tie up the currently
running CPU's until the transaction drains, which is usually pretty
fast, since we don't allow transactions to be held for that long (the
worst case being truncate/unlink operations). Dbench is a worst case,
though since we have some large number of threads all doing file
system I/O (John, how was dbench configured?) and the spinlocks will
no longer tie up a CPU, but actually let some other dbench thread run,
so it magnifies the thundering hurd problem from 8 threads, to nearly
all of the CPU threads.

Also, the spinlock code has a "ticket" system which tries to protect
against the thundering hurd effect --- do the PI mutexes which replace
spinlocks in the -rt kernel have any technqiue to try to prevent
scheduler thrashing in the face of thundering hurd scenarios?

- Ted

2010-06-02 22:34:15

by Eric Whitney

[permalink] [raw]

Subject: j_state_lock patch data (was: Re: ext4 dbench performance with CONFIG_PREEMPT_RT)

Weeks ago, Ted asked if someone would be willing to run Steven Pratt's
and Keith Mannthey's ffsb filesystem workloads (as found at
http://btrfs.boxacle.net) on a large system running a non-RT kernel with
and without Ted's j_state_lock patch. In a previous thread, John Stultz
had identified the j_state_lock as an important scalability limiter for
RT kernels running ext4 on 8 core systems.

Ted wanted lock_stats as well as performance data, and I've collected
those on a 48 core x86-64 system running a 2.6.34 kernel. (The
j_state_lock patch is now in 2.6.35, and I'll try to collect fresh data
in the near future.)

Since my test system was bigger than the system Keith has been using
recently, I adjusted the thread counts, running 1, 48 (1 thread per
core), and 192 (4 threads per core) ffsb threads in my version of his
workloads.

The adjusted "boxacle" large_file_creates and random_writes workloads
benefited significantly from the patch on my test configuration. Three
sets of uninstrumented runs for each workload yielded throughput results
that were generally consistent and close to those obtained for the
lock_stat runs. (There were a few outliers - fewer than 10% of the
total - that showed throughput declines of more than 50% from the rest
of the runs.)

Using statistics taken in the lock_stat runs:

The ffsb transaction rate for large_file_create at 48 threads improved
85% with the patch over the unmodified baseline, while the number of
contentions dropped 29%. At 192 threads, the transaction rate improved
46%, and the number of contentions dropped 44%.

The ffsb transaction rate for random_writes at 48 threads improved 35%
with the patch over the unmodified baseline, while the number of
contentions dropped 8%. At 192 threads, the transaction rate improved
29%, and the number of contentions dropped 3%.

The single thread workloads and all versions of the mail_server workload
were not materially affected by the patch. (The remaining workloads are
read-only, and also were not affected.)

On my test configuration, the lock_stats in the patched cases show the
j_state_lock is still the most heavily contended, and the system is
usually CPU-bound in system mode.

Detailed data for the large_file_create and random_write lock_stat runs,
including lock_stats, vmstats, ffsb reports, logs, the patch, etc., can
be found at:
http://free.linux.hp.com/~enw/jbd2scaling/2.6.34

At that location, there are four directories containing the detailed
information for the four relevant lock_stat runs:
large-file-creates
large-file-creates.patched
random_writes
random_writes.patched

There are also a set of links at the top level to make it easier to
compare lock_stat data without navigating through those directories -
the names simply begin with "lock_stat".

The patch used is also there: j_state_lock.patch.txt

Test system configuration:

* 48 core x86-64 server (HP Proliant DL785) with 256 GB of memory
* One Fibre Channel-attached 4 Gb/second MSA2000 RAID controller
* Single 586 GB hardware RAID0 volume consisting of four disks
* e2fsprogs v1.41.11
* nobarrier mount option
* deadline I/O scheduler

Thanks to Steven and Keith for their workloads and benchmarking data.

Eric

[email protected] wrote:
> On Mon, Apr 12, 2010 at 09:46:28PM +0200, Jan Kara wrote:
>> I also had a look at jbd2_journal_start. What probably makes
>> things bad there is that lots of threads accumulate waiting for
>> transaction to get out of T_LOCKED state. When that happens, all the
>> threads are woken up and start pondering at j_state_lock which
>> creates contention. This is just a theory and I might be completely
>> wrong... Some lockstat data would be useful to confirm / refute
>> this.
>
> Yeah, that sounds right. We do have a classic thundering hurd problem
> when we while are draining handles from the transaction in the
> T_LOCKED state --- that is (for those who aren't jbd2 experts) when it
> comes time to close out the current transaction, one of the first
> things that fs/jbd2/commit.c will do is to set the transaction into
> T_LOCKED state. In that state we are waiting for currently active
> handles to complete, and we don't allow any new handles to start until
> the currently running transaction is completely drained of active
> handles, at which point we can swap in a new transaction, and continue
> the commit process on the previously running transaction.
>
> On a non-real time kernel, the spinlock will tie up the currently
> running CPU's until the transaction drains, which is usually pretty
> fast, since we don't allow transactions to be held for that long (the
> worst case being truncate/unlink operations). Dbench is a worst case,
> though since we have some large number of threads all doing file
> system I/O (John, how was dbench configured?) and the spinlocks will
> no longer tie up a CPU, but actually let some other dbench thread run,
> so it magnifies the thundering hurd problem from 8 threads, to nearly
> all of the CPU threads.
>
> Also, the spinlock code has a "ticket" system which tries to protect
> against the thundering hurd effect --- do the PI mutexes which replace
> spinlocks in the -rt kernel have any technqiue to try to prevent
> scheduler thrashing in the face of thundering hurd scenarios?
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html