2014-01-13 20:13:22

by Benjamin LaHaise

[permalink] [raw]
Subject: high write latency bug in ext3 / jbd in 3.4

Hello all,

I've recently encountered a bug in ext3 where the occasional write is
showing extremely high latency, on the order of 2.2 to 11 seconds compared
to a more typical 200-300ms. This is happening on a 3.4.67 kernel. When
this occurs, the system is writing to disk somewhere between 290-330MB/s.
The test takes anywhere from 3 to 12 minutes into a run to trigger the
high latency write. During one of these high latency writes, vmstat reports
0 blocks being written to disk. The disk array being written to is able to
write quite a bit faster (about ~770MB/s).

The setup is a bit complicated, but is completely reproducible. The
workload consists of about 8 worker threads creating and then writing out
spool files that are a little under 8MB in size. After each write, the file
and the directory it is in are then fsync()d. The latency measured is from
the beginning open() of a spool file until the final fsync() completes.

Poking around the system with latencytop shows that sleep_on_buffer() is
where all the latency is coming from, leading to log_wait_commit() showing
the very high latency for the fsync()s. This leads me to believe that jbd
is somehow not properly flushing a buffer being waited on in a timely
fashion. Changing elevator in use has no effect.

Does anyone have any ideas on where to look in ext3 or jbd for something
that might be causing this behaviour? If I use ext4 to mount the ext3
filesystem being tested, the problem goes away. Testing on newer kernels
is not very easy to do (the system has other dependencyies on the 3.4
kernel). Thoughts?

-ben
--
"Thought is the essence of where you are now."


2014-01-13 21:01:36

by Andreas Dilger

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

On Jan 13, 2014, at 1:13 PM, Benjamin LaHaise <[email protected]> wrote:
> I've recently encountered a bug in ext3 where the occasional write
> is showing extremely high latency, on the order of 2.2 to 11 seconds
> compared to a more typical 200-300ms. This is happening on a 3.4.67
> kernel. When this occurs, the system is writing to disk somewhere
> between 290-330MB/s. The test takes anywhere from 3 to 12 minutes
> into a run to trigger the high latency write. During one of these
> high latency writes, vmstat reports 0 blocks being written to disk.
> The disk array being written to is able to write quite a bit faster
> (about ~770MB/s).
>
> The setup is a bit complicated, but is completely reproducible.
> The workload consists of about 8 worker threads creating and then
> writing out spool files that are a little under 8MB in size. After
> each write, the file and the directory it is in are then fsync()d.
> The latency measured is from the beginning open() of a spool file
> until the final fsync() completes.
>
> Poking around the system with latencytop shows that sleep_on_buffer()
> is where all the latency is coming from, leading to log_wait_commit()
> showing the very high latency for the fsync()s. This leads me to
> believe that jbd is somehow not properly flushing a buffer being
> waited on in a timely fashion. Changing elevator in use has no effect.
>
> Does anyone have any ideas on where to look in ext3 or jbd for something
> that might be causing this behaviour? If I use ext4 to mount the ext3
> filesystem being tested, the problem goes away. Testing on newer
> kernels is not very easy to do (the system has other dependencyies on
> the 3.4 kernel). Thoughts?

Not to be flippant, but is there any reason NOT to just mount the
filesystem with ext4? There are a large number of improvements in
the ext4 code that don't require on-disk format changes (e.g. delayed
allocation, multi-block allocation, etc) if there is a concern about
being able to downgrade to an ext3-type mount in case of problems.

There are further improvements in ext4 that can be used on upgraded
ext3 filesystems if the feature bit is enabled (in particular extent
mapped files). However, extent mapped files are not accessible under
ext3, so it makes sense to run with ext4 w/o any new features for a
while until you are sure it is working for you.

Using delalloc, mballoc, and extents can reduce application visible
read, write, and unlink latency significantly, because the blocks are
being allocated and freed in contiguous chunks after the file is
written from userspace.

We've been discussing deleting the ext3 code in favour of ext4 for
a while already, and newer Fedora and RHEL kernels are using the
ext4 code to mount ext2- and ext3-formatted filesystems for a while
already.

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-01-13 21:16:13

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

On Mon, Jan 13, 2014 at 02:01:08PM -0700, Andreas Dilger wrote:
> Not to be flippant, but is there any reason NOT to just mount the
> filesystem with ext4? There are a large number of improvements in
> the ext4 code that don't require on-disk format changes (e.g. delayed
> allocation, multi-block allocation, etc) if there is a concern about
> being able to downgrade to an ext3-type mount in case of problems.

I'm leaning towards doing this. The main reason for not doing so was
primarily that a few of the tweaks that I had been made to ext3 would
have to be ported to ext4. Thankfully, I think we're still in an early
enough stage of release that I should be able to do so. The changes
are pretty specific, mostly allocator tweaks to improve the on-disk
layout for our specific use-case.

> There are further improvements in ext4 that can be used on upgraded
> ext3 filesystems if the feature bit is enabled (in particular extent
> mapped files). However, extent mapped files are not accessible under
> ext3, so it makes sense to run with ext4 w/o any new features for a
> while until you are sure it is working for you.

I had hoped to use ext4, but the recommended fsck after changing the
various feature bits is a non-starter during our upgrade process (a 22
minute outage isn't acceptable).

> Using delalloc, mballoc, and extents can reduce application visible
> read, write, and unlink latency significantly, because the blocks are
> being allocated and freed in contiguous chunks after the file is
> written from userspace.
>
> We've been discussing deleting the ext3 code in favour of ext4 for
> a while already, and newer Fedora and RHEL kernels are using the
> ext4 code to mount ext2- and ext3-formatted filesystems for a while
> already.

That is reassuring to hear. I'll give it a try and see what happens.

-ben

> Cheers, Andreas
>
>
>
>
>



--
"Thought is the essence of where you are now."

2014-01-13 21:39:12

by Eric Sandeen

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

On 1/13/14, 3:16 PM, Benjamin LaHaise wrote:
> On Mon, Jan 13, 2014 at 02:01:08PM -0700, Andreas Dilger wrote:
>> Not to be flippant, but is there any reason NOT to just mount the
>> filesystem with ext4? There are a large number of improvements in
>> the ext4 code that don't require on-disk format changes (e.g. delayed
>> allocation, multi-block allocation, etc) if there is a concern about
>> being able to downgrade to an ext3-type mount in case of problems.
>
> I'm leaning towards doing this. The main reason for not doing so was
> primarily that a few of the tweaks that I had been made to ext3 would
> have to be ported to ext4. Thankfully, I think we're still in an early
> enough stage of release that I should be able to do so. The changes
> are pretty specific, mostly allocator tweaks to improve the on-disk
> layout for our specific use-case.
>
>> There are further improvements in ext4 that can be used on upgraded
>> ext3 filesystems if the feature bit is enabled (in particular extent
>> mapped files). However, extent mapped files are not accessible under
>> ext3, so it makes sense to run with ext4 w/o any new features for a
>> while until you are sure it is working for you.
>
> I had hoped to use ext4, but the recommended fsck after changing the
> various feature bits is a non-starter during our upgrade process (a 22
> minute outage isn't acceptable).

I would never recommend the ext3-ext4 "tune2fs migration" - you'll end
up with a really weird hybrid filesystems containing files with different
capabilities, and missing many of the metadata layout improvements.

-Eric

2014-01-13 22:52:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

On Mon, Jan 13, 2014 at 04:16:10PM -0500, Benjamin LaHaise wrote:
>
> I'm leaning towards doing this. The main reason for not doing so was
> primarily that a few of the tweaks that I had been made to ext3 would
> have to be ported to ext4. Thankfully, I think we're still in an early
> enough stage of release that I should be able to do so. The changes
> are pretty specific, mostly allocator tweaks to improve the on-disk
> layout for our specific use-case.

We have been thinking about making some changes to the block
allocator, so I'd be interested in hearing what tweaks you made and a
bit more about your use case that drove the need for these allocator
tweaks.

> I had hoped to use ext4, but the recommended fsck after changing the
> various feature bits is a non-starter during our upgrade process (a 22
> minute outage isn't acceptable).

You can move to ext4 without necessarily using those features which
require an fsck after the upgrade process. That's hwo we handled the
upgrade to ext4 at Google. New disks were formatted using ext4, but
for legacy file systems, we enabled extents feature (maybe one or two
other ones, but that was the main one) and then remounted those file
systems using ext4. We called file systems which were upgraded in
this way "ext2-as-ext4", and our benchmarking indicated that for our
workload, that "ext2-as-ext4" got roughly half the performance gained
when comparing file systems still using ext2 with newly formated file
systems using ext4.

Given that file systems on a server got reformatted when it needs some
kind of hardware repairs, betewen hardware refresh and disks getting
reformatted as part of the refresh, the percentage of file systems
running in "ext2-as-ext4" dropped fairly quickly.

Mike Rubin gave a presentation about this two years ago at the LF
Collab Summit that went into a lot more detail about how ext4 was
adopted by Google. That presentation is available here:

http://www.youtube.com/watch?v=Wp5Ehw7ByuU

Cheers,

- Ted

2014-01-14 00:55:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

On Jan 13, 2014, at 3:52 PM, Theodore Ts'o <[email protected]> wrote:
> On Mon, Jan 13, 2014 at 04:16:10PM -0500, Benjamin LaHaise wrote:
>> I had hoped to use ext4, but the recommended fsck after changing the
>> various feature bits is a non-starter during our upgrade process (a 22
>> minute outage isn't acceptable).
>
> You can move to ext4 without necessarily using those features which
> require an fsck after the upgrade process. That's hwo we handled the
> upgrade to ext4 at Google. New disks were formatted using ext4, but
> for legacy file systems, we enabled extents feature (maybe one or two
> other ones, but that was the main one) and then remounted those file
> systems using ext4.

We also did this for upgraded Lustre ext3 filesystems in the past
(just enabling the extents feature) without any problems. So long
as you don't need things like fallocate() (which presumably you don't
since that doesn't work for ext3) then the application can't tell the
difference between new extent-mapped and old block-mapped files.

This only affects new files, so old files are not changed.

Cheers, Andreas

> We called file systems which were upgraded in
> this way "ext2-as-ext4", and our benchmarking indicated that for our
> workload, that "ext2-as-ext4" got roughly half the performance gained
> when comparing file systems still using ext2 with newly formated file
> systems using ext4.
>
> Given that file systems on a server got reformatted when it needs some
> kind of hardware repairs, betewen hardware refresh and disks getting
> reformatted as part of the refresh, the percentage of file systems
> running in "ext2-as-ext4" dropped fairly quickly.
>
> Mike Rubin gave a presentation about this two years ago at the LF
> Collab Summit that went into a lot more detail about how ext4 was
> adopted by Google. That presentation is available here:
>
> http://www.youtube.com/watch?v=Wp5Ehw7ByuU
>
> Cheers,
>
> - Ted


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-01-14 01:01:58

by Eric Sandeen

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/13/14, 6:55 PM, Andreas Dilger wrote:
> On Jan 13, 2014, at 3:52 PM, Theodore Ts'o <[email protected]> wrote:
>> On Mon, Jan 13, 2014 at 04:16:10PM -0500, Benjamin LaHaise wrote:
>>> I had hoped to use ext4, but the recommended fsck after changing the
>>> various feature bits is a non-starter during our upgrade process (a 22
>>> minute outage isn't acceptable).
>>
>> You can move to ext4 without necessarily using those features which
>> require an fsck after the upgrade process. That's hwo we handled the
>> upgrade to ext4 at Google. New disks were formatted using ext4, but
>> for legacy file systems, we enabled extents feature (maybe one or two
>> other ones, but that was the main one) and then remounted those file
>> systems using ext4.
>
> We also did this for upgraded Lustre ext3 filesystems in the past
> (just enabling the extents feature) without any problems. So long
> as you don't need things like fallocate() (which presumably you don't
> since that doesn't work for ext3) then the application can't tell the
> difference between new extent-mapped and old block-mapped files.
>
> This only affects new files, so old files are not changed.

Which was my point about ending up with a mishmash. Maybe it's
ok, depends on your usecase.

But you wind up with different files on the same fs having different
capabilities depending on whether they are extent format or not.
(for example you can't do preallocation on the old format files,
they have different maximum offset limits, different direct IO
behavior...)

Just something to be aware of.

- -Eric

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJS1Ix+AAoJECCuFpLhPd7gLxMP/1kGrivmUostFaw3w2ldQaXj
vkP+lfWE3lYPLDTuJbkVdLZlruOxY01KwG+DoYVyhDj9ykbVRJdyrQaTY38o6MNv
/xgHQdESlw0MwYifT2mppphZfMEcNJYnqImdeQ34POzhdmclnKil25pif4a+eR2u
VXgcK6vAD+YTFzjJ0/G3WknIOjCkD6OX1uljXkUMAEZFOuEIgp96GEaNT1Zh6sGy
NoRM35n6WnOvfFBhZVg1CEvwSgg2ETC89tLjomZMnnjOaKPaEYwG84aD5DVcDPB/
9mXtD1bOYcdXrappVefCC5BI3x7Nft9POuw3o9xan4f1+vSSusBhwNJD7FWt6N+1
faQAbD5tXaG27g7/sFJpc/Yn8coWbY4GDsB6/carZP+O7N6TK45gkH1GLfde5m8A
1onZEFn2jEQjmRmGO6wkpQGJOLlEt347joUiEvYeV2tncu+TPDFJEY5T8r6A30pa
+iTmzc0NiP9/A9087NEd+15UtLS62VUkhQxwJDWpnZZ77C6ZiqwXnyHlzSi6BGOh
f8BpVncVAS04nih7BGquri3JsL7o3IcvtlATj/wrZO3adeoYQWiJHxBFrrsi/cUI
WjhhWXR1GIZ//W6/wTcEe3wNUxY3RP8X5Yb4P1FlRYzt8jMVvSpFtzh73C9Fhc0H
WlJoDj0OA1QI5Mrjliz0
=KYq3
-----END PGP SIGNATURE-----

2014-01-14 01:21:21

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

On Mon, Jan 13, 2014 at 05:52:19PM -0500, Theodore Ts'o wrote:
> On Mon, Jan 13, 2014 at 04:16:10PM -0500, Benjamin LaHaise wrote:
> >
> > I'm leaning towards doing this. The main reason for not doing so was
> > primarily that a few of the tweaks that I had been made to ext3 would
> > have to be ported to ext4. Thankfully, I think we're still in an early
> > enough stage of release that I should be able to do so. The changes
> > are pretty specific, mostly allocator tweaks to improve the on-disk
> > layout for our specific use-case.
>
> We have been thinking about making some changes to the block
> allocator, so I'd be interested in hearing what tweaks you made and a
> bit more about your use case that drove the need for these allocator
> tweaks.

The main layout tweak is pretty specific to the ext2/3 style indirect /
double indirect block usage: instead of placing the ind/dind/tind blocks
throughtout the file, they are all placed immediately before the first
data block at fallocate() time. With that change in place, all of the
metadata blocks are then read at the same time the first page of the file is
read. The reason for doing this is that our spoolfiles have a header at
the beginning of the file that must always be read before we can find where
the data needed from the file is. By pulling in the metadata at the same
time as the first data block, the number of seeks to get data elsewhere in
the file is reduced (as some requests are essentially random). It also has
a nice side effect of speeding up unlink and fsck times.

The other allocator change which is more relevant to ext4 is to not use
orlov on subdirectories of the filesystem. There is a notable performance
difference when inodes are spread out across the filesystem. Our usage
pattern tends to be somewhat close to FIFO for the files written and later
read & deleted.

There are some other bits I plan to post shortly as well, including a fully
async implementation of readpage for use with ext2/3 style metadata. It was
necessary to make async reads fully non-blocking in order to hit the
performance targets, as switching to helper threads incurred a significant
amount of overhead compared to having aio completions from the interrupt
handler of the block device. I also did async read and readahead
implementations tied into aio. Development on the release I'm working on
is mostly done now, so I should have the time over the next few weeks to
clean up and merge these changes to 3.13.

> > I had hoped to use ext4, but the recommended fsck after changing the
> > various feature bits is a non-starter during our upgrade process (a 22
> > minute outage isn't acceptable).
>
> You can move to ext4 without necessarily using those features which
> require an fsck after the upgrade process. That's hwo we handled the
> upgrade to ext4 at Google. New disks were formatted using ext4, but
> for legacy file systems, we enabled extents feature (maybe one or two
> other ones, but that was the main one) and then remounted those file
> systems using ext4. We called file systems which were upgraded in
> this way "ext2-as-ext4", and our benchmarking indicated that for our
> workload, that "ext2-as-ext4" got roughly half the performance gained
> when comparing file systems still using ext2 with newly formated file
> systems using ext4.

Another reason for not being able to migrate to extents is that it breaks
the ability of our system to be downgraded smoothly. The previous kernel
being used was of 2.6.18 vintage, so this is the first version of our
product that supports using ext4. There were also concerns about testing
both the extent and non-extent code paths as well -- regression tests take
months to complete, so adding a times 2 multiplier to everything is a hard
sell.

> Given that file systems on a server got reformatted when it needs some
> kind of hardware repairs, betewen hardware refresh and disks getting
> reformatted as part of the refresh, the percentage of file systems
> running in "ext2-as-ext4" dropped fairly quickly.

Our filesystems are, unfortunately, rather long lived.

> Mike Rubin gave a presentation about this two years ago at the LF
> Collab Summit that went into a lot more detail about how ext4 was
> adopted by Google. That presentation is available here:
>
> http://www.youtube.com/watch?v=Wp5Ehw7ByuU

Thanks -- I'll pass that along to folks here at Solace.

-ben

> Cheers,
>
> - Ted

--
"Thought is the essence of where you are now."

2014-01-14 03:52:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

On Mon, Jan 13, 2014 at 08:21:21PM -0500, Benjamin LaHaise wrote:
> Another reason for not being able to migrate to extents is that it breaks
> the ability of our system to be downgraded smoothly. The previous kernel
> being used was of 2.6.18 vintage, so this is the first version of our
> product that supports using ext4. There were also concerns about testing
> both the extent and non-extent code paths as well -- regression tests take
> months to complete, so adding a times 2 multiplier to everything is a hard
> sell.

Well, you can use ext4 without enabling extents; at that point the
major performance improvement is those related to delayed allocation.
That should address your large latency associated with file system
commits. That's because we don't allocate blocks until right before
we allocate them. Hence, we don't have to force those blocks out to
the journal in order to guarantee the data=ordered guarantees.

You can probably also avoid the huge latency spikes you are seeing by
using data=writeback, BTW, but of course then you risk stale data
issues after a crash.

- Ted

2014-01-27 23:55:20

by Jan Kara

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

Hello,

On Mon 13-01-14 15:13:20, Benjamin LaHaise wrote:
> I've recently encountered a bug in ext3 where the occasional write is
> showing extremely high latency, on the order of 2.2 to 11 seconds compared
> to a more typical 200-300ms. This is happening on a 3.4.67 kernel. When
> this occurs, the system is writing to disk somewhere between 290-330MB/s.
> The test takes anywhere from 3 to 12 minutes into a run to trigger the
> high latency write. During one of these high latency writes, vmstat reports
> 0 blocks being written to disk. The disk array being written to is able to
> write quite a bit faster (about ~770MB/s).
>
> The setup is a bit complicated, but is completely reproducible. The
> workload consists of about 8 worker threads creating and then writing out
> spool files that are a little under 8MB in size. After each write, the file
> and the directory it is in are then fsync()d. The latency measured is from
> the beginning open() of a spool file until the final fsync() completes.
>
> Poking around the system with latencytop shows that sleep_on_buffer() is
> where all the latency is coming from, leading to log_wait_commit() showing
> the very high latency for the fsync()s. This leads me to believe that jbd
> is somehow not properly flushing a buffer being waited on in a timely
> fashion. Changing elevator in use has no effect.
I'm not sure if you haven't switched to ext4 as others have suggested in
this thread. If not:
1) Since the stall is so long, can you run
'echo w >/proc/sysrq-trigger'
when the stall happens and send the stack traces from kernel log?
2) Are you running with 'barrier' option?

> Does anyone have any ideas on where to look in ext3 or jbd for something
> that might be causing this behaviour? If I use ext4 to mount the ext3
> filesystem being tested, the problem goes away. Testing on newer kernels
> is not very easy to do (the system has other dependencyies on the 3.4
> kernel). Thoughts?
My suspicion is we are hanging on writing the 'commit' block of a
transaction. That issues a cache flush to the storage and that can take
quite a bit of time if we are unlucky.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-01-28 16:06:28

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: high write latency bug in ext3 / jbd in 3.4

Hi Jan,

On Tue, Jan 28, 2014 at 12:55:18AM +0100, Jan Kara wrote:
> Hello,
>
> On Mon 13-01-14 15:13:20, Benjamin LaHaise wrote:
...
> I'm not sure if you haven't switched to ext4 as others have suggested in
> this thread. If not:
> 1) Since the stall is so long, can you run
> 'echo w >/proc/sysrq-trigger'
> when the stall happens and send the stack traces from kernel log?

Unfortunately, I didn't capture that output while testing. I ended up
migrating to using the ext4 codebase for our ext3 filesystems. With a
couple of tweaks to the inode allocator, I was able to resolve the
regression moving to ext4 had caused. If there is actually some desire
to fix this bug, I can certainly go back and reproduce it.

> 2) Are you running with 'barrier' option?

I didn't change the barrier setting from the default.

> > Does anyone have any ideas on where to look in ext3 or jbd for something
> > that might be causing this behaviour? If I use ext4 to mount the ext3
> > filesystem being tested, the problem goes away. Testing on newer kernels
> > is not very easy to do (the system has other dependencyies on the 3.4
> > kernel). Thoughts?
> My suspicion is we are hanging on writing the 'commit' block of a
> transaction. That issues a cache flush to the storage and that can take
> quite a bit of time if we are unlucky.

I actually control both ends of the SAN (the two systems are connected via
fibre channel), and while the hang occurs, no I/O shows up as being queued
on the head end. It is as if the system is waiting on a write that hasn't
been submitted yet.

-ben

> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

--
"Thought is the essence of where you are now."