LinuxLists.cc - Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

2011-06-23 19:20:09

Subject: Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

Hello again everyone,

I'm in the middle of doing some software testing on a pre-production
clone of this system using some modified software configurations and a
testing-only data volume, and I've managed to trigger this panic again.

The trigger was exactly the same; I had a bunch of queued emails from
logcheck because my TLS configuration was wrong, then I fixed the TLS
configuration and typed "postqueue -f" to send the queued mail.

Ted, since this new iteration has no customer data, passwords, keys, or
any other private data, I'm going to try to get approval to release an
exact EC2 image of this system for you to test with, including the fake
data volume that I triggered the problem on.

If not I can certainly reproduce it now by stopping email delivery and
generating a lot of fake syslog spam; I can try applying kernel patches
and report what happens.

Hopefully you're still willing to help out tracking down the problem?

Thanks again!

Cheers,
Kyle Moffett

2011-06-23 20:55:50

by Sean Ryle

[permalink] [raw]

Subject: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

Maybe I am wrong here, but shouldn't the cast be to (unsigned long) or to
(sector_t)?

Line 534 of commit.c:
jbd_debug(4, "JBD: got buffer %llu (%p)\n",
(unsigned long long)bh->b_blocknr,
bh->b_data);

Line 64 of buffer_head.h:
sector_t b_blocknr; /* start block number */

Lines 137-143 of include/linux/types/h:
#ifdef CONFIG_LBDAF
typedef u64 sector_t;
typedef u64 blkcnt_t;
#else
typedef unsigned long sector_t;
typedef unsigned long blkcnt_t;
#endif

Is it possible he is experiencing the panic due to a bad cast in the call to
jbd_debug() in fs/jbd2/commit.c? It would seem to me this should be cast to
(sector_t). Any thoughts?

On Thu, Jun 23, 2011 at 2:32 PM, Moffett, Kyle D
<[email protected]>wrote:

> Hello again everyone,
>
> I'm in the middle of doing some software testing on a pre-production
> clone of this system using some modified software configurations and a
> testing-only data volume, and I've managed to trigger this panic again.
>
> The trigger was exactly the same; I had a bunch of queued emails from
> logcheck because my TLS configuration was wrong, then I fixed the TLS
> configuration and typed "postqueue -f" to send the queued mail.
>
> Ted, since this new iteration has no customer data, passwords, keys, or
> any other private data, I'm going to try to get approval to release an
> exact EC2 image of this system for you to test with, including the fake
> data volume that I triggered the problem on.
>
> If not I can certainly reproduce it now by stopping email delivery and
> generating a lot of fake syslog spam; I can try applying kernel patches
> and report what happens.
>
> Hopefully you're still willing to help out tracking down the problem?
>
> Thanks again!
>
> Cheers,
> Kyle Moffett
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2011-06-23 22:23:38

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

On Thu, Jun 23, 2011 at 01:32:48PM -0500, Moffett, Kyle D wrote:
>
> Ted, since this new iteration has no customer data, passwords, keys, or
> any other private data, I'm going to try to get approval to release an
> exact EC2 image of this system for you to test with, including the fake
> data volume that I triggered the problem on.

That would be great! Approximately how big are the images involved?

- Ted

2011-06-23 22:43:44

by Moffett, Kyle D

[permalink] [raw]

Subject: Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

On Jun 23, 2011, at 16:55, Sean Ryle wrote:
> Maybe I am wrong here, but shouldn't the cast be to (unsigned long) or to (sector_t)?
>
> Line 534 of commit.c:
> jbd_debug(4, "JBD: got buffer %llu (%p)\n",
> (unsigned long long)bh->b_blocknr, bh->b_data);

No, that printk() is fine, the format string says "%llu" so the cast is
unsigned long long.

Besides which, line 534 in the Debian 2.6.32 kernel I am using is this
one:

J_ASSERT(commit_transaction->t_nr_buffers <=
commit_transaction->t_outstanding_credits);

If somebody can tell me what information would help to debug this I'd be
more than happy to throw a whole bunch of debug printks under that error
condition and try to trigger the crash with that.

Alternatively I could remove that J_ASSERT() and instead add some debug
further down around the "commit_transaction->t_outstanding_credits--;"
to try to see exactly what IO it's handling when it runs out of credits.

Any ideas?

Cheers,
Kyle Moffett

2011-06-24 13:47:03

On Mon 27-06-11 23:21:17, Moffett, Kyle D wrote:
> On Jun 27, 2011, at 12:01, Ted Ts'o wrote:
> > On Mon, Jun 27, 2011 at 05:30:11PM +0200, Lukas Czerner wrote:
> >>> I've found some. So although data=journal users are minority, there are
> >>> some. That being said I agree with you we should do something about it
> >>> - either state that we want to fully support data=journal - and then we
> >>> should really do better with testing it or deprecate it and remove it
> >>> (which would save us some complications in the code).
> >>>
> >>> I would be slightly in favor of removing it (code simplicity, less options
> >>> to configure for admin, less options to test for us, some users I've come
> >>> across actually were not quite sure why they are using it - they just
> >>> thought it looks safer).
> >
> > Hmm... FYI, I hope to be able to bring on line automated testing for
> > ext4 later this summer (there's a testing person at Google is has
> > signed up to work on setting this up as his 20% project). The test
> > matrix that I have him included data=journal, so we will be getting
> > better testing in the near future.
> >
> > At least historically, data=journalling was the *simpler* case, and
> > was the first thing supported by ext4. (data=ordered required revoke
> > handling which didn't land for six months or so). So I'm not really
> > that convinced that removing really buys us that much code
> > simplification.
> >
> > That being siad, it is true that data=journalled isn't necessarily
> > faster. For heavy disk-bound workloads, it can be slower. So I can
> > imagine adding some documentation that warns people not to use
> > data=journal unless they really know what they are doing, but at least
> > personally, I'm a bit reluctant to dispense with a bug report like
> > this by saying, "oh, that feature should be deprecated".
>
> I suppose I should chime in here, since I'm the one who (potentially
> incorrectly) thinks I should be using data=journalled mode.
>
> My basic impression is that the use of "data=journalled" can help
> reduce the risk (slightly) of serious corruption to some kinds of
> databases when the application does not provide appropriate syncs
> or journalling on its own (IE: such as text-based Wiki database files).
It depends on the way such programs update the database files. But
generally yeas, data=journal provides a bit more guarantees than other
journaling modes - see below.

> Please correct me if this is horribly horribly wrong:
>
> no journal:
> Nothing is journalled
> + Very fast.
> + Works well for filesystems that are "mkfs"ed on every boot
> - Have to fsck after every reboot
Fsck is needed only after a crash / hard powerdown. Otherwise completely
correct. Plus you always have a possibility of exposing uninitialized
(potentially sensitive) data after a fsck.

Actually, normal desktop might be quite happy with non-journaled filesystem
when fsck is fask enough.

> data=writeback:
> Metadata is journalled, data (to allocated extents) may be written
> before or after the metadata is updated with a new file size.
> + Fast (not as fast as unjournalled)
> + No need to "fsck" after a hard power-down
> - A crash or power failure in the middle of a write could leave
> old data on disk at the end of a file. If security labeling
> such as SELinux is enabled, this could "contaminate" a file with
> data from a deleted file that was at a higher sensitivity.
> Log files (including binary database replication logs) may be
> effectively corrupted as a result.
Correct.

> data=ordered:
> Data appended to a file will be written before the metadata
> extending the length of the file is written, and in certain cases
> the data will be written before file renames (partial ordering),
> but the data itself is unjournalled, and may be only partially
> complete for updates.
> + Does not write data to the media twice
> + A crash or power failure will not leave old uninitialized data
> in files.
> - Data writes to files may only partially complete in the event
> of a crash. No problems for logfiles, or self-journalled
> application databases, but others may experience partial writes
> in the event of a crash and need recovery.
Correct, one should also note that noone guarantees order in which data
hits the disk - i.e. when you do write(f,"a"); write(f,"b"); and these are
overwrites it may happen that "b" is written while "a" is not.

> data=journalled:
> Data and metadata are both journalled, meaning that a given data
> write will either complete or it will never occur, although the
> precise ordering is not guaranteed. This also implies all of the
> data<=>metadata guarantees of data=ordered.
> + Direct IO data writes are effectively "atomic", resulting in
> less likelihood of data loss for application databases which do
> not do their own journalling. This means that a power failure
> or system crash will not result in a partially-complete write.
Well, direct IO is atomic in data=journal the same way as in data=ordered.
It can happen only half of direct IO write is done when you hit power
button at the right moment - note this holds for overwrites. Extending
writes or writes to holes are all-or-nothing for ext4 (again both in
data=journal and data=ordered mode).

> - Cached writes are not atomic
> + For small cached file writes (of only a few filesystem pages)
> there is a good chance that kernel writeback will queue the
> entire write as a single I/O and it will be "protected" as a
> result. This helps reduce the chance of serious damage to some
> text-based database files (such as those for some Wikis), but
> is obviously not a guarantee.
Page sized and page aligned writes are atomic (in both data=journal and
data=ordered modes). When a write spans multiple pages, there are chances
the writes will be merged in a single transaction but no guarantees as you
properly write.

> - This writes all data to the block device twice (once to the FS
> journal and once to the data blocks). This may be especially bad
> for write-limited Flash-backed devices.
Correct.

To sum up, the only additional guarantee data=journal offers against
data=ordered is a total ordering of all IO operations. That is, if you do a
sequence of data and metadata operations, then you are guaranteed that
after a crash you will see the filesystem in a state corresponding exactly
to your sequence terminated at some (arbitrary) point. Data writes are
disassembled into page-sized & page-aligned sequence of writes for purpose
of this model...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-06-28 13:59:06

On Tue 30-08-11 19:26:22, Moffett, Kyle D wrote:
> On Aug 30, 2011, at 18:12, Jan Kara wrote:
> >> I can still trigger it on my VM snapshot very easily, so if you have anything
> >> you think I should test I would be very happy to give it a shot.
> >
> > OK, so in the meantime I found a bug in data=journal code which could be
> > related to your problem. It is fixed by commit
> > 2d859db3e4a82a365572592d57624a5f996ed0ec which is in 3.1-rc1. Have you
> > tried that or newer kernel as well?
> >
> > If the problem still is not fixed, I can provide some debugging patch to
> > you. We spoke with Josef Bacik how errors like yours could happen so I have
> > some places to watch...
>
> I have not tried anything more recent; I'm actually a bit reluctant to move
> away from the Debian squeeze official kernels since I do need the security
> updates.
>
> I took a quick look and I can't find that function in 2.6.32, so I assume it
> would be a rather nontrivial back-port. It looks like the relevant code
> used to be in ext4_clear_inode somewhere?
It's not that hard - untested patch attached.

> Out of curiosity, what would happen in data=journal mode if you unlinked a
> file which still had buffers pending? That case does not seem to be handled
> by that commit you mentioned, was it already handled elsewhere?
Once the file is deleted, it's OK to discard its data after a
transaction doing delete commits. The current code in JBD2 handles this
case fine - the problem was that for not-deleted files we cannot discard
dirty data after a transaction commits ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

Attachments:

(No filename) (1.60 kB)
ext4-2.6.32-data-journal-corruption.diff (2.95 kB)
Download all attachments

2011-12-06 23:20:58

by Moffett, Kyle D

[permalink] [raw]

Subject: Re: Bug#615998: linux-image-2.6.32-5-xen-amd64: Repeatable "kernel BUG at fs/jbd2/commit.c:534" from Postfix on ext4

Hello again!

I know it's been ages, but I finally got some time to get that patch
tested out and try additional debugging.

On Sep 01, 2011, at 11:17, Jan Kara wrote:
> On Tue 30-08-11 19:26:22, Moffett, Kyle D wrote:
>> On Aug 30, 2011, at 18:12, Jan Kara wrote:
>>>> I can still trigger it on my VM snapshot very easily, so if you have anything
>>>> you think I should test I would be very happy to give it a shot.
>>>
>>> OK, so in the meantime I found a bug in data=journal code which could be
>>> related to your problem. It is fixed by commit
>>> 2d859db3e4a82a365572592d57624a5f996ed0ec which is in 3.1-rc1. Have you
>>> tried that or newer kernel as well?
>>>
>>> If the problem still is not fixed, I can provide some debugging patch to
>>> you. We spoke with Josef Bacik how errors like yours could happen so I have
>>> some places to watch...
>>
>> I have not tried anything more recent; I'm actually a bit reluctant to move
>> away from the Debian squeeze official kernels since I do need the security
>> updates.
>>
>> I took a quick look and I can't find that function in 2.6.32, so I assume it
>> would be a rather nontrivial back-port. It looks like the relevant code
>> used to be in ext4_clear_inode somewhere?
> It's not that hard - untested patch attached.

So this applied mostly cleanly (with one minor context-only conflict in
the 2.6.32.17 patch), unfortunately it didn't resolve the problem.
Just as a sanity check, I upgraded to the Debian 3.1.0-1-amd64 kernel,
based on kernel version 3.1.1 and the problem still occurs there too
(additional info at the end of the email).

Looking at the issue again, I don't think it has anything to do with
file deletion at all.

Specifically, there are a grand total of 4 files in that filesystem
(alongside an empty "lost+found" directory):
master.lock
prng_exch
smtpd_scache.db
smtp_scache.db

As far as I can tell, none of those is ever deleted during normal
operation.

The crash occurs very quickly after starting postfix. It connects to
the external email server (using TLS) and begins to flush queued mail.

At that point, the "tlsmgr" daemon tries to update the "smtp_scache.db"
file, which is a Berkeley DB about 40k in size. Somewhere in there,
the Berkeley DB does an fdatasync().

The "fdatasync()" apparently triggers the bad behavior from the "jbd2"
thread, which then oopses in fs/jbd2/commit.c:485 (which appears to be
the same same BUG_ON() as before).

The stack looks something like this:
jbd_journal_commit_transaction+0x4ea/0x1053 [jbd2]
kjournald2+0xc0/0x20a [jbd2]
add_wait_queue+0x3c/0x3c
commit_timeout+0x5/0x5 [jbd2]
kthread+0x76/0x7e

Cheers,
Kyle Moffett

--
Curious about my work on the Debian powerpcspe port?
I'm keeping a blog here: http://pureperl.blogspot.com/