2002-09-24 07:16:06

by Jakob Oestergaard

[permalink] [raw]
Subject: jbd bug(s) (?)


First:

In Linux-2.4.19, I was wondering about the following:

In fs/jbd/commit.c:583, we find the following:
/* AKPM: buglet - add `i' to tmp! */
for (i = 0; i < jh2bh(descriptor)->b_size; i += 512) {
journal_header_t *tmp =
(journal_header_t*)jh2bh(descriptor)->b_data;
tmp->h_magic = htonl(JFS_MAGIC_NUMBER);
tmp->h_blocktype = htonl(JFS_COMMIT_BLOCK);
tmp->h_sequence = htonl(commit_transaction->t_tid);
}


As I see it, this means that jbd using filesystems (ext3) will only
remember writing *ONE* entry from the journal.

Isn't this a problem ?

Second:

The jbd superblocks contains an index into the journal for the first
transaction - but there is only *one* copy of the index, and there is no
reasonable way to detect if it got written correctly to disk.

If the system loses power while updating the superblock, and only *half*
of this index is written correctly, we have a journal which we cannot
reach.

Sort of removes the point of having the journal in the first place. (If
my above assertion is true).

As far as I know, Tux2 solves this problem by keeping multiple indexes
(yes it uses phase trees and not a journal, but Tux2 root nodes and the
journal index are identical wrt. this problem).

If one keeps two blocks holding:
index
timestamp
CRC
one can consider the two blocks and disregard the ones with invalid CRC.
This leaves us with one or two blocks left - we then pick the one with
the highest timestamp - and we are then guaranteed to *always* have a
valid index.

(The above works when the timestamp is incremented for every write,
index updates are written alternating between the two blocks, and the
complete block is sync()ed before the other is written to)

Wouldn't something like this be required for a journalling fs to be
worth anything ?

I know the window is rather small for half an index to be written - but
that doesn't mean it can't happen.


--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:


2002-09-25 16:30:54

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

Hi,

On Tue, Sep 24, 2002 at 09:21:17AM +0200, Jakob Oestergaard wrote:

> In Linux-2.4.19, I was wondering about the following:
>
> In fs/jbd/commit.c:583, we find the following:
> /* AKPM: buglet - add `i' to tmp! */
> for (i = 0; i < jh2bh(descriptor)->b_size; i += 512) {

> As I see it, this means that jbd using filesystems (ext3) will only
> remember writing *ONE* entry from the journal.

> Isn't this a problem ?

Nah. In fact I should just remove the loop entirely. For commit
processing, only the header at the very, very start of a commit block
is cared about --- that way, we get atomic commits even if the commit
block is partially written out-of-order on disk. As long as sector
writes within the fs block are atomic, the header remains intact.

> The jbd superblocks contains an index into the journal for the first
> transaction - but there is only *one* copy of the index, and there is no
> reasonable way to detect if it got written correctly to disk.
>
> If the system loses power while updating the superblock, and only *half*
> of this index is written correctly, we have a journal which we cannot
> reach.

Again, only the data in the first sector matters there, and we assume
that disks write individual sectors atomically, or return IO failure
if things get messed up. And the index sector is not updated all that
frequently anyway --- maybe once or twice per journal wrap, but it
doesn't have to be written for each transaction.

> Sort of removes the point of having the journal in the first place. (If
> my above assertion is true).

Actually, the number of single points of failure in a filesystem is
huge. If we lose, say, the root directory, we're toast too (and that
can be due to an inode block or a directory block failure); similarly,
other key directories are critical, and within the journal itself, an
unreadable metadata descriptor block will rend parts of the journal
unusable for recovery.

So if we detect incomplete sector writes, we can recover by forcing a
fsck, but if you want to be able to survive actual data loss, you need
raid.

> one can consider the two blocks and disregard the ones with invalid CRC.
> This leaves us with one or two blocks left - we then pick the one with
> the highest timestamp - and we are then guaranteed to *always* have a
> valid index.

> Wouldn't something like this be required for a journalling fs to be
> worth anything ?

No, ext3 just relies on the sector atomicity guarantees instead.
There _are_ multi-sector data structures in the journal, but those are
all protected by the sector-atomic commit block --- if we don't see
the commit sector, then we ignore all of the blocks of the prior
transaction in the log, and the commit sector is never written until
we've got a guarantee that the whole of the preceding blocks are
consistent on-disk.

Cheers,
Stephen

2002-09-26 12:16:10

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

On Wed, Sep 25, 2002 at 05:36:05PM +0100, Stephen C. Tweedie wrote:
> Hi,
..
> > Isn't this a problem ?
>
> Nah. In fact I should just remove the loop entirely. For commit
> processing, only the header at the very, very start of a commit block
> is cared about --- that way, we get atomic commits even if the commit
> block is partially written out-of-order on disk. As long as sector
> writes within the fs block are atomic, the header remains intact.

Ok, fair enough.

The loop (which performs a number of repeated identical writes to the
*same* location) along with the comment does look pretty scary though

8)

>
> > The jbd superblocks contains an index into the journal for the first
> > transaction - but there is only *one* copy of the index, and there is no
> > reasonable way to detect if it got written correctly to disk.
> >
> > If the system loses power while updating the superblock, and only *half*
> > of this index is written correctly, we have a journal which we cannot
> > reach.
>
> Again, only the data in the first sector matters there, and we assume
> that disks write individual sectors atomically, or return IO failure
> if things get messed up.

Yep - so does Reiser. I originally believed that this was not the case
with modern disks, but I think I was corrected on that one in the Reiser
bug thread :)

...
> > Sort of removes the point of having the journal in the first place. (If
> > my above assertion is true).
>
> Actually, the number of single points of failure in a filesystem is
> huge.
[snip]

Originally it was my impression that the index was written fairly
frequently, *and* that you did not have the atomic-sector-write
guarantee.

That's why I was very worried.

> So if we detect incomplete sector writes, we can recover by forcing a
> fsck, but if you want to be able to survive actual data loss, you need
> raid.

RAID wouldn't save me in the case where the journal index is screwed due
to a partial sector write and a power loss.

But anyway, that is a moot point :)


Thank you for explaining,

(/me hopes the scary-loop dies ;)

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-09-26 12:22:10

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

Hi,

On Thu, Sep 26, 2002 at 02:21:24PM +0200, Jakob Oestergaard wrote:

> Originally it was my impression that the index was written fairly
> frequently, *and* that you did not have the atomic-sector-write
> guarantee.

The index is only updated when we purge stuff out of the journal.
That can still be quite frequent on a really busy journal, but it's
definitely not a required part of a transaction.

That's deliberate --- the ext3 journal is designed to be written as
sequentially as possible, so seeking to the index block is an expense
which we try to avoid.

> RAID wouldn't save me in the case where the journal index is screwed due
> to a partial sector write and a power loss.

A partial sector write is essentially impossible. It's unlikely that
the data on disk would be synchronised beyond the point at which the
write stopped, and even if it was, the CRC would be invalid, so you'd
get a bad sector error return on subsequent attempts to read that data
--- you'd not be given silently corrupt data.

Making parts of the disk suddenly unreadable on power-fail is
generally considered a bad thing, though, so modern disks go to great
lengths to ensure the write finishes.

--Stephen

2002-09-26 12:51:31

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

On Thu, Sep 26, 2002 at 01:27:23PM +0100, Stephen C. Tweedie wrote:
> Hi,

Hi !

...
>
> The index is only updated when we purge stuff out of the journal.

Yes, of course.

> That can still be quite frequent on a really busy journal, but it's
> definitely not a required part of a transaction.

No, but the correct writing of your index is a required part for the
successful re-play at recovery.

You can't re-play if you can't find the beginning of your journal :)

Just *imagine* that half your index was written successfully to disk and
then the power failed. That was what I imagined could happen.

Quite some people have pointed out to me by now, that I was wrong - so
don't worry about it :)

>
> That's deliberate --- the ext3 journal is designed to be written as
> sequentially as possible, so seeking to the index block is an expense
> which we try to avoid.

Of course - any sane journal should be designed that way.

>
> > RAID wouldn't save me in the case where the journal index is screwed due
> > to a partial sector write and a power loss.
>
> A partial sector write is essentially impossible. It's unlikely that
> the data on disk would be synchronised beyond the point at which the
> write stopped, and even if it was, the CRC would be invalid, so you'd
> get a bad sector error return on subsequent attempts to read that data
> --- you'd not be given silently corrupt data.

I know. What I imagined was, that there were disks out there which
*internally* worked with smaller sector sizes, and merely presented a
512 byte sector to the outside world.

That way, it would be perfecly possible for a disk to write 2 bytes of
your index pointer along with it's error correction codes and what not.
And it would be perfectly possible for it to return an invalid index
pointer (2 bytes of your new pointer, the remaining bytes from the old
pointer) - without returning a read error.

Let's hope that none of the partitioning formats or LVM projects out
there will misalign the filesystem so that your index actually *does*
cross a 512 byte boundary ;)

> Making parts of the disk suddenly unreadable on power-fail is
> generally considered a bad thing, though, so modern disks go to great
> lengths to ensure the write finishes.

Lucky us :)


Cheers,

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-09-26 13:40:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

On Thu, Sep 26, 2002 at 02:56:47PM +0200, Jakob Oestergaard wrote:
> I know. What I imagined was, that there were disks out there which
> *internally* worked with smaller sector sizes, and merely presented a
> 512 byte sector to the outside world.

Actually, it's the other way around. Most disks are internally
actually using a sector size of 32k or now even 64k. So ideally, I'd
like to have ext2 be able to support such larger block sizes, since it
would be win from a performance perspective. There are only two
problems with that. First of all, we need to support tail-merging, so
small files don't cause fragmentation problems. But this isn't an
absolute requirement, since without it a 32k block filesystem would
still be useful for a filesystem dedicates for large multimedia files
(and heck, a FAT16 filesystem would often use an 32k or larger block
size.)

The real problem is that the current VM has an intrinsic assumption
that the blocksize is less than or equal to the page size. It could
be possible to fake things by using an internal blocksize of 32k or
64k, but emulate a 4k blocksize to the VM layer, but provides only a
very limited benefit. Since the VM doesn't know that the block size
is really 32k, we don't get the automatic I/O clustering. Also, we
end up needing to use multiple buffer heads per real ext2 block (since
the VM still thinks the block size is PAGE_SIZE, not the larger ext2
block size). So we could add larger block sizes, but it would mean
adding a huge amount of complexity for minimal gain (and if you really
want that, you can always use XFS, which pays that complexity cost).

It'd be nice to get real VM support for this, but that will almost
certainly have to wait for 2.6.

> Let's hope that none of the partitioning formats or LVM projects out
> there will misalign the filesystem so that your index actually *does*
> cross a 512 byte boundary ;)

None of them would do anything as insane as that, not just because of
the safety issue, but because it would be a performace loss. Just as
misaligned data accesses in memory are slower (or prohibited in some
architectures), misaligned data on disks are bad for the same reason.

> > Making parts of the disk suddenly unreadable on power-fail is
> > generally considered a bad thing, though, so modern disks go to great
> > lengths to ensure the write finishes.
>
> Lucky us :)

Actually, disks try so hard to ensure the write finishes that
sometimes they ensure it past the point where the memory has already
started going insane because of the low voltage during a power
failure. This is why some people have reported extensive data loss
after just switching off the power switch. The system was in the
midst of writing out part of the inode table, and as the voltage of
the +5 voltage rail started dipping, the memory started returning bad
results, but the DMA engine and the disk drive still had enough juice
to complete the write. Oops.

This is one place where the physical journalling layer in ext3
actually helps us out tremendously, because before we write out a
metadata block (such as part of the inode table), it gets written to
the journal first. So each metadata block gets written twice to disk;
once to the journal, synchronously, and then layer when we have free
time, to the disk. This wastes some of our disk bandwidth --- which
won't be noticed if the system isn't busy, but if the workload
saturates the disk write bandwidth, then it will slow the system down.
However, this redundancy is worth it, because in the case of this
particular cause of corruption, although part of the on-disk inode
table might get corrupted on an unexpected power failure, it is also
on the journal, so the problem gets silently and automatically fixed
when the journal is run.

Filesystems such as JFS and XFS do not use physical journaling, but
use logical journalling instead. So they don't write the entire inode
table block to the journal, but just an abstract representation of
what changed. This is more efficient from the standpoint of
conserving your write bandwidth, but it leaves them helpless if the
disk subsystem writes out garbage due to an unclean power failure.

Is this tradeoff worth it? Well, arguably hardware Shouldn't Do That,
and it's not fair to require filesystems to deal with lousy hardware.
However, as I've said for a long time, the reason why PC hardware is
cheap is because PC hardware is crap, and since we live in the real
world, it's something we need to deal with.

Also, the double write overhead which ext3 imposes is only for the
metadata, and for real-life workloads, the overhead presented is
relatively minimal. There are some lousy benchmarks like dbench which
exaggerate the metadata costs, but they aren't really representative
of real life. So personally, I think the tradeoffs are worth it; of
course, though, I'm biased. :-)

- Ted

2002-09-26 14:00:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

On Thu, Sep 26, 2002 at 09:44:35AM -0400, Theodore Ts'o wrote:
> block size). So we could add larger block sizes, but it would mean
> adding a huge amount of complexity for minimal gain (and if you really
> want that, you can always use XFS, which pays that complexity cost).

XFS does't support blocksize > PAGE_CACHE_SIZE under linux. In fact the
latest public XFS/Linux release doesn't even support any blocksize other
than PAGE_CACHE_SIZE. This has changed in the development tree now and
the version merged in 2.5 and the next public 2.4 release will have that
support. Doing blocksize > PAGE_CACHE_SIZE will difficult if not
impossible due VM locking issues with the 2.4 and 2.5 VM code.

> It'd be nice to get real VM support for this, but that will almost
> certainly have to wait for 2.6.

I don't really see this happening before Halloween..

2002-09-26 14:20:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

On Thu, Sep 26, 2002 at 03:05:57PM +0100, Christoph Hellwig wrote:
> On Thu, Sep 26, 2002 at 09:44:35AM -0400, Theodore Ts'o wrote:
> > block size). So we could add larger block sizes, but it would mean
> > adding a huge amount of complexity for minimal gain (and if you really
> > want that, you can always use XFS, which pays that complexity cost).
>
> XFS does't support blocksize > PAGE_CACHE_SIZE under linux. In fact the
> latest public XFS/Linux release doesn't even support any blocksize other
> than PAGE_CACHE_SIZE. This has changed in the development tree now and
> the version merged in 2.5 and the next public 2.4 release will have that
> support. Doing blocksize > PAGE_CACHE_SIZE will difficult if not
> impossible due VM locking issues with the 2.4 and 2.5 VM code.

My mistake. At one point I was talking to Mark Lord and I had gotten
the impression they had some Irix-VM-to-Linux-VM mapping layer which
would make blocksize > PAGE_SIZE possible.

- Ted

2002-09-26 14:36:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: jbd bug(s) (?)

On Thu, Sep 26, 2002 at 10:25:04AM -0400, Theodore Ts'o wrote:
> My mistake. At one point I was talking to Mark Lord and I had gotten

^^^ Stephen Lord

> the impression they had some Irix-VM-to-Linux-VM mapping layer which
> would make blocksize > PAGE_SIZE possible.

The pagebuf layer presents and inferface similar enough to the IRIX
buffercache to XFS and is layered ontop of the Linux pagecache. Today
it's only used for metadata and allows mapping of > PAGE_CACHE_SIZE
objects there (see the vmap/vunmap code I added to vmalloc.c in 2.5
for that). But it's not used for data I/O at all anymore and it would
also have problems for blocksize > PAGE_CACHE_SIZE.