2008-10-07 13:14:08

by Theodore Ts'o

[permalink] [raw]
Subject: jbd/jbd2 performance improvements


As I mentioned on the ext4 call yesterday, there was an interesting
thread on LKML that wasn't cc'ed onto the linux-ext4 mailing list. So
in case folks missed it, it might be worth taking a look at this mail
thread:

[PATCH] Give kjournald a IOPRIO_CLASS_RT io priority

http://lkml.org/lkml/2008/10/1/405

The main issue that got discussed was the age-old "entaglement" problem.
The jbd/jbd2 layer is supposed to avoid this by not blocking the
"current" transaction why the blocks from the previous "committing"
transaction are still being written out to disk. Apparently this was
broken sometime in the 2.5 time-frame:

http://lkml.org/lkml/2008/10/2/41
http://lkml.org/lkml/2008/10/2/322

Later in the thread, a major contention point in do_get_write_access()
was identified as the problem:

http://lkml.org/lkml/2008/10/3/7

... and then andrew produced the following "hacky" fix:

http://lkml.org/lkml/2008/10/3/22

If someone has time to runs some benchmarks to see how this improves
things, especially on a workload that has plenty of "engtanglements",
that would be great. (I bet Ric's fs_mark run should do a good job;
fsyncs to creates lots of commits and the need to modify blocks that had
been modified in the previous tansactions.)

If we can get some quick testing done, and it shows really good results,
this could be something that could try fast-tracking into the 2.6.28
merge window.

- Ted



2008-10-15 17:29:57

by Ric Wheeler

[permalink] [raw]
Subject: Re: jbd/jbd2 performance improvements

Theodore Ts'o wrote:
> As I mentioned on the ext4 call yesterday, there was an interesting
> thread on LKML that wasn't cc'ed onto the linux-ext4 mailing list. So
> in case folks missed it, it might be worth taking a look at this mail
> thread:
>
> [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
>
> http://lkml.org/lkml/2008/10/1/405
>
> The main issue that got discussed was the age-old "entaglement" problem.
> The jbd/jbd2 layer is supposed to avoid this by not blocking the
> "current" transaction why the blocks from the previous "committing"
> transaction are still being written out to disk. Apparently this was
> broken sometime in the 2.5 time-frame:
>
> http://lkml.org/lkml/2008/10/2/41
> http://lkml.org/lkml/2008/10/2/322
>
> Later in the thread, a major contention point in do_get_write_access()
> was identified as the problem:
>
> http://lkml.org/lkml/2008/10/3/7
>
> ... and then andrew produced the following "hacky" fix:
>
> http://lkml.org/lkml/2008/10/3/22
>
> If someone has time to runs some benchmarks to see how this improves
> things, especially on a workload that has plenty of "engtanglements",
> that would be great. (I bet Ric's fs_mark run should do a good job;
> fsyncs to creates lots of commits and the need to modify blocks that had
> been modified in the previous tansactions.)
>
> If we can get some quick testing done, and it shows really good results,
> this could be something that could try fast-tracking into the 2.6.28
> merge window.
>
> - Ted
>

We are going to try and poke at this - do you suspect a single or
multi-threaded test would work best?

Ric


2008-10-16 06:07:50

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: jbd/jbd2 performance improvements

Hi Ric,

>>>>> On Wed, 15 Oct 2008 13:29:55 -0400, Ric Wheeler <[email protected]> said:
Ric> We are going to try and poke at this - do you suspect a single or
Ric> multi-threaded test would work best?

I've performed some tests:
http://www.bullopensource.org/ext4/20081013-2.6.27-rc9-ext4-1-akpm-fix-run6/
http://www.bullopensource.org/ext4/20081013-2.6.27-rc9-ext4-1-akpm-fix-run6/results_sorted.txt.html

I now realize that the results may not be valid since I used kvm, but
they do show variation wrt. the number of threads.

So you may want to test both single and multi-threaded.

--
solofo

2008-10-16 12:07:49

by Ric Wheeler

[permalink] [raw]
Subject: Re: jbd/jbd2 performance improvements

[email protected] wrote:
> Hi Ric,
>
>
>>>>>> On Wed, 15 Oct 2008 13:29:55 -0400, Ric Wheeler <[email protected]> said:
>>>>>>
> Ric> We are going to try and poke at this - do you suspect a single or
> Ric> multi-threaded test would work best?
>
> I've performed some tests:
> http://www.bullopensource.org/ext4/20081013-2.6.27-rc9-ext4-1-akpm-fix-run6/
> http://www.bullopensource.org/ext4/20081013-2.6.27-rc9-ext4-1-akpm-fix-run6/results_sorted.txt.html
>
> I now realize that the results may not be valid since I used kvm, but
> they do show variation wrt. the number of threads.
>
> So you may want to test both single and multi-threaded.
>
>
A very thorough test, but the results don't seem to point to a
consistent winner.

I agree that running without KVM in the picture might be very
interesting. Eric has some similar tests underway, I think that his
results were also inconclusive so far...

Ric



2008-10-16 12:40:00

by Eric Sandeen

[permalink] [raw]
Subject: Re: jbd/jbd2 performance improvements

Ric Wheeler wrote:
> [email protected] wrote:
>> Hi Ric,
>>
>>
>>>>>>> On Wed, 15 Oct 2008 13:29:55 -0400, Ric Wheeler <[email protected]> said:
>>>>>>>
>> Ric> We are going to try and poke at this - do you suspect a single or
>> Ric> multi-threaded test would work best?
>>
>> I've performed some tests:
>> http://www.bullopensource.org/ext4/20081013-2.6.27-rc9-ext4-1-akpm-fix-run6/
>> http://www.bullopensource.org/ext4/20081013-2.6.27-rc9-ext4-1-akpm-fix-run6/results_sorted.txt.html
>>
>> I now realize that the results may not be valid since I used kvm, but
>> they do show variation wrt. the number of threads.
>>
>> So you may want to test both single and multi-threaded.
>>
>>
> A very thorough test, but the results don't seem to point to a
> consistent winner.
>
> I agree that running without KVM in the picture might be very
> interesting. Eric has some similar tests underway, I think that his
> results were also inconclusive so far...

Yep, I've yet to find an fs_mark invocation, at least, which shows a
clear winner. I also ran w/ akpm's suggested io_schedule watcher patch
and never see us waiting on this lock (I did set it to 1s though, which
is probably too long for my storage).

-Eric

2008-10-23 10:50:59

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: jbd/jbd2 performance improvements

>>>>> On Thu, 16 Oct 2008 07:39:04 -0500, Eric Sandeen <[email protected]> said:
>> A very thorough test, but the results don't seem to point to a
>> consistent winner.
>>
>> I agree that running without KVM in the picture might be very
>> interesting. Eric has some similar tests underway, I think that
>> his results were also inconclusive so far...

Eric> Yep, I've yet to find an fs_mark invocation, at least, which
Eric> shows a clear winner. I also ran w/ akpm's suggested
Eric> io_schedule watcher patch and never see us waiting on this
Eric> lock (I did set it to 1s though, which is probably too long
Eric> for my storage).

I've redone the tests without kvm. Still no clear winner

To sum up:
. kernel ext4-stable
. mkfs (1.41.3) default options
. mount options: default, akpm, akpm_lock_hack
. scheduler default (cfq)
. 8 cpus, single 15K rpm disk.
. without the high latency detection patch
. a broad range of fs_mark (all the sync strategies, from 1 to 32
threads, up to 10000 files/thread, several directories).
. a "tangled synchrony" workload as mentionned in the "Analysis and
evolution of journaling file systems" paper discussed monday.

First things first, maybe I should have spent more time
reproducing Arjan behavior before testing.

This was not a complete waste of time though, as the following errors
were spotted during the runs:
1. EXT4-fs error (device sdb): ext4_free_inode: bit already cleared for inode 32769
2. EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 8
3. BUG: spinlock wrong CPU on CPU#3, fs_mark/1975
lock: ffff88015a44f480, .magic: dead4ead, .owner: fs_mark/1975, .owner_cpu: 1
Pid: 1975, comm: fs_mark Not tainted 2.6.27.1-ext4-stable-gcov #1

Call Trace:
[<ffffffff811a47a2>] spin_bug+0xa2/0xaa
[<ffffffff811a481f>] _raw_spin_unlock+0x75/0x8a
[<ffffffff814552c1>] _spin_unlock+0x26/0x2a
[<ffffffffa00d4fd3>] ext4_read_inode_bitmap+0xfa/0x14e [ext4]
[<ffffffffa00d564b>] ext4_new_inode+0x5d4/0xec4 [ext4]
[<ffffffff810562db>] ? __lock_acquire+0x481/0x7d8
[<ffffffffa00c2430>] ? jbd2_journal_start+0xef/0x11a [jbd2]
[<ffffffffa00c2430>] ? jbd2_journal_start+0xef/0x11a [jbd2]
[<ffffffffa00deb99>] ext4_create+0xc7/0x144 [ext4]
[<ffffffff810b6734>] vfs_create+0xdf/0x155
[<ffffffff810b8905>] do_filp_open+0x220/0x7fc
[<ffffffff814552c1>] ? _spin_unlock+0x26/0x2a
[<ffffffff810abe5a>] do_sys_open+0x53/0xd3
[<ffffffff810abf03>] sys_open+0x1b/0x1d
[<ffffffff8100bf8b>] system_call_fastpath+0x16/0x1b

Anybody seen this in their logs?

The "bit already cleared for inode" is triggered by:
fs_mark -v -d /mnt/test-ext4 -n10000 -D10 -N1000 -t8 -s4096 -S0

--
solofo


2008-10-23 12:01:57

by Ric Wheeler

[permalink] [raw]
Subject: Re: jbd/jbd2 performance improvements

[email protected] wrote:
>>>>>> On Thu, 16 Oct 2008 07:39:04 -0500, Eric Sandeen <[email protected]> said:
>>>>>>
> >> A very thorough test, but the results don't seem to point to a
> >> consistent winner.
> >>
> >> I agree that running without KVM in the picture might be very
> >> interesting. Eric has some similar tests underway, I think that
> >> his results were also inconclusive so far...
>
> Eric> Yep, I've yet to find an fs_mark invocation, at least, which
> Eric> shows a clear winner. I also ran w/ akpm's suggested
> Eric> io_schedule watcher patch and never see us waiting on this
> Eric> lock (I did set it to 1s though, which is probably too long
> Eric> for my storage).
>
> I've redone the tests without kvm. Still no clear winner
>
> To sum up:
> . kernel ext4-stable
> . mkfs (1.41.3) default options
> . mount options: default, akpm, akpm_lock_hack
> . scheduler default (cfq)
> . 8 cpus, single 15K rpm disk.
> . without the high latency detection patch
> . a broad range of fs_mark (all the sync strategies, from 1 to 32
> threads, up to 10000 files/thread, several directories).
> . a "tangled synchrony" workload as mentionned in the "Analysis and
> evolution of journaling file systems" paper discussed monday.
>
> First things first, maybe I should have spent more time
> reproducing Arjan behavior before testing.
>
> This was not a complete waste of time though, as the following errors
> were spotted during the runs:
> 1. EXT4-fs error (device sdb): ext4_free_inode: bit already cleared for inode 32769
> 2. EXT4-fs error (device sdb): ext4_init_inode_bitmap: Checksum bad for group 8
> 3. BUG: spinlock wrong CPU on CPU#3, fs_mark/1975
> lock: ffff88015a44f480, .magic: dead4ead, .owner: fs_mark/1975, .owner_cpu: 1
> Pid: 1975, comm: fs_mark Not tainted 2.6.27.1-ext4-stable-gcov #1
>
> Call Trace:
> [<ffffffff811a47a2>] spin_bug+0xa2/0xaa
> [<ffffffff811a481f>] _raw_spin_unlock+0x75/0x8a
> [<ffffffff814552c1>] _spin_unlock+0x26/0x2a
> [<ffffffffa00d4fd3>] ext4_read_inode_bitmap+0xfa/0x14e [ext4]
> [<ffffffffa00d564b>] ext4_new_inode+0x5d4/0xec4 [ext4]
> [<ffffffff810562db>] ? __lock_acquire+0x481/0x7d8
> [<ffffffffa00c2430>] ? jbd2_journal_start+0xef/0x11a [jbd2]
> [<ffffffffa00c2430>] ? jbd2_journal_start+0xef/0x11a [jbd2]
> [<ffffffffa00deb99>] ext4_create+0xc7/0x144 [ext4]
> [<ffffffff810b6734>] vfs_create+0xdf/0x155
> [<ffffffff810b8905>] do_filp_open+0x220/0x7fc
> [<ffffffff814552c1>] ? _spin_unlock+0x26/0x2a
> [<ffffffff810abe5a>] do_sys_open+0x53/0xd3
> [<ffffffff810abf03>] sys_open+0x1b/0x1d
> [<ffffffff8100bf8b>] system_call_fastpath+0x16/0x1b
>
> Anybody seen this in their logs?
>
> The "bit already cleared for inode" is triggered by:
> fs_mark -v -d /mnt/test-ext4 -n10000 -D10 -N1000 -t8 -s4096 -S0
>
>
Arjan,

Do you have any details on the test case that you ran that showed a
clear improvement? What kind of storage & IO pattern did you use?

Regards,

Ric


2008-10-23 12:22:20

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: jbd/jbd2 performance improvements

Ric Wheeler writes:
> Do you have any details on the test case that you ran that showed a
> clear improvement? What kind of storage & IO pattern did you use?

Is it possible to record latencytop output (like top batch mode) to
capture the highest latency during a test run?

Or how did you collect this:
>my reproducer is sadly very simple (claws-mail is my mail client that uses maildir)
>Process claws-mail (4896) Total: 2829.7 msec
>EXT3: Waiting for journal access 2491.0 msec 88.4 %
>Writing back inodes 160.9 msec 5.7 %
>synchronous write 78.8 msec 3.0 %
>is an example of such a trace (this is with patch, without patch the numbers are about 3x bigger)
>Waiting for journal access is "journal_get_write_access"
>Writing back inodes is "writeback_inodes"
>synchronous write is "do_sync_write"


--
solofo