LinuxLists.cc - Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Alan Cox wrote:
> Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich:
>> WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
>> you don't need your filesystem beeing super-robust against bad sectors
>> and such stuff because:
>
> You do it turns out. Its becoming an issue more and more that the sheer
> amount of storage means that the undetected error rate from disks,
> hosts, memory, cables and everything else is rising.

Yikes. Undetected.

Wait, what? Disks, at least, would be protected by RAID. Are you
telling me RAID won't detect such an error?

It just seems wholly alien to me that errors would go undetected, and
we're OK with that, so long as our filesystems are robust enough. If
it's an _undetected_ error, doesn't that cause way more problems
(impossible problems) than FS corruption? Ok, your FS is fine -- but
now your bank database shows $1k less on random accounts -- is that ok?

> There has been a great deal of discussion about this at the filesystem
> and kernel summits - and data is getting kicked the way of networking -
> end to end not reliability in the middle.

Sounds good, but I've never let discussions by people smarter than me
prevent me from asking the stupid questions.

> The sort of changes this needs hit the block layer and ever fs.

Seems it would need to hit every application also...

2006-08-01 16:57:16

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Horst H. von Brand wrote:
> Bernd Schubert <[email protected]> wrote:

>> While filesystem speed is nice, it also would be great if reiser4.x would be
>> very robust against any kind of hardware failures.
>
> Can't have both.

Why not? I mean, other than TANSTAAFL, is there a technical reason for
them being mutually exclusive? I suspect it's more "we haven't found a
way yet..."

2006-08-01 17:01:25

by Alan

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover:
> Yikes. Undetected.
>
> Wait, what? Disks, at least, would be protected by RAID. Are you
> telling me RAID won't detect such an error?

Yes.

RAID deals with the case where a device fails. RAID 1 with 2 disks can
in theory detect an internal inconsistency but cannot fix it.

> we're OK with that, so long as our filesystems are robust enough. If
> it's an _undetected_ error, doesn't that cause way more problems
> (impossible problems) than FS corruption? Ok, your FS is fine -- but
> now your bank database shows $1k less on random accounts -- is that ok?

Not really no. Your bank is probably using a machine (hopefully using a
machine) with ECC memory, ECC cache and the like. The UDMA and SATA
storage subsystems use CRC checksums between the controller and the
device. SCSI uses various similar systems - some older ones just use a
parity bit so have only a 50/50 chance of noticing a bit error.

Similarly the media itself is recorded with a lot of FEC (forward error
correction) so will spot most changes.

Unfortunately when you throw this lot together with astronomical amounts
of data you get burned now and then, especially as most systems are not
using ECC ram, do not have ECC on the CPU registers and may not even
have ECC on the caches in the disks.

> > The sort of changes this needs hit the block layer and ever fs.
>
> Seems it would need to hit every application also...

Depending how far you propogate it. Someone people working with huge
data sets already write and check user level CRC values for this reason
(in fact bitkeeper does it for one example). It should be relatively
cheap to get much of that benefit without doing application to
application just as TCP gets most of its benefit without going app to
app.

Alan

2006-08-01 17:05:11

by Gregory Maxwell

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On 8/1/06, David Masover <[email protected]> wrote:
> Yikes. Undetected.
>
> Wait, what? Disks, at least, would be protected by RAID. Are you
> telling me RAID won't detect such an error?

Unless the disk ECC catches it raid won't know anything is wrong.

This is why ZFS offers block checksums... it can then try all the
permutations of raid regens to find a solution which gives the right
checksum.

Every level of the system must be paranoid and take measure to avoid
corruption if the system is to avoid it... it's a tough problem. It
seems that the ZFS folks have addressed this challenge by building as
much of what is classically separate layers into one part.

2006-08-01 17:40:45

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Alan Cox wrote:
> Ar Maw, 2006-08-01 am 11:44 -0500, ysgrifennodd David Masover:
>> Yikes. Undetected.
>>
>> Wait, what? Disks, at least, would be protected by RAID. Are you
>> telling me RAID won't detect such an error?
>
> Yes.
>
> RAID deals with the case where a device fails. RAID 1 with 2 disks can
> in theory detect an internal inconsistency but cannot fix it.

Still, if it does that, that should be enough. The scary part wasn't
that there's an internal inconsistency, but that you wouldn't know.

And it can fix it if you can figure out which disk went. Or give it 3
disks and it should be entirely automatic -- admin gets paged, admin
hotswaps in a new disk, done.

>> we're OK with that, so long as our filesystems are robust enough. If
>> it's an _undetected_ error, doesn't that cause way more problems
>> (impossible problems) than FS corruption? Ok, your FS is fine -- but
>> now your bank database shows $1k less on random accounts -- is that ok?
>
> Not really no. Your bank is probably using a machine (hopefully using a
> machine) with ECC memory, ECC cache and the like. The UDMA and SATA
> storage subsystems use CRC checksums between the controller and the
> device. SCSI uses various similar systems - some older ones just use a
> parity bit so have only a 50/50 chance of noticing a bit error.
>
> Similarly the media itself is recorded with a lot of FEC (forward error
> correction) so will spot most changes.
>
> Unfortunately when you throw this lot together with astronomical amounts
> of data you get burned now and then, especially as most systems are not
> using ECC ram, do not have ECC on the CPU registers and may not even
> have ECC on the caches in the disks.

It seems like this is the place to fix it, not the software. If the
software can fix it easily, great. But I'd much rather rely on the
hardware looking after itself, because when hardware goes bad, all bets
are off.

Specifically, it seems like you do mention lots of hardware solutions,
that just aren't always used. It seems like storage itself is getting
cheap enough that it's time to step back a year or two in Moore's Law to
get the reliability.

>>> The sort of changes this needs hit the block layer and ever fs.
>> Seems it would need to hit every application also...
>
> Depending how far you propogate it. Someone people working with huge
> data sets already write and check user level CRC values for this reason
> (in fact bitkeeper does it for one example). It should be relatively
> cheap to get much of that benefit without doing application to
> application just as TCP gets most of its benefit without going app to
> app.

And yet, if you can do that, I'd suspect you can, should, must do it at
a lower level than the FS. Again, FS robustness is good, but if the
disk itself is going, what good is having your directory (mostly) intact
if the files themselves have random corruptions?

If you can't trust the disk, you need more than just an FS which can
mostly survive hardware failure. You also need the FS itself (or maybe
the block layer) to support bad block relocation and all that good
stuff, or you need your apps designed to do that job by themselves.

It just doesn't make sense to me to do this at the FS level. You
mention TCP -- ok, but if TCP is doing its job, I shouldn't also need to
implement checksums and other robustness at the protocol layer (http,
ftp, ssh), should I? Because in this analogy, it looks like TCP is the
"block layer" and a protocol is the "fs".

As I understand it, TCP only lets the protocol/application know when
something's seriously FUBARed and it has to drop the connection.
Similarly, the FS (and the apps) shouldn't have to know about hardware
problems until it really can't do anything about it anymore, at which
point the right thing to do is for the FS and apps to go "oh shit" and
drop what they're doing, and the admin replaces hardware and restores
from backup. Or brings a backup server online, or...

I guess my main point was that _undetected_ problems are serious, but if
you can detect them, and you have at least a bit of redundancy, you
should be good. For instance, if your RAID reports errors that it can't
fix, you bring that server down and let the backup server run.

2006-08-01 17:42:06

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Gregory Maxwell wrote:
> On 8/1/06, David Masover <[email protected]> wrote:
>> Yikes. Undetected.
>>
>> Wait, what? Disks, at least, would be protected by RAID. Are you
>> telling me RAID won't detect such an error?
>
> Unless the disk ECC catches it raid won't know anything is wrong.
>
> This is why ZFS offers block checksums... it can then try all the
> permutations of raid regens to find a solution which gives the right
> checksum.

Isn't there a way to do this at the block layer? Something in
device-mapper?

> Every level of the system must be paranoid and take measure to avoid
> corruption if the system is to avoid it... it's a tough problem. It
> seems that the ZFS folks have addressed this challenge by building as
> much of what is classically separate layers into one part.

Sounds like bad design to me, and I can point to the antipattern, but
what do I know?

2006-08-01 18:11:37

by Adrian Ulrich

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

> You do it turns out. Its becoming an issue more and more that the sheer
> amount of storage means that the undetected error rate from disks,
> hosts, memory, cables and everything else is rising.

IMHO the possibility to hit such a random-so-far-undetected-corruption
is very low with one of the big/expensive raid systems as they are
doing fancy stuff like 'disk scrubbing' and usually do fail disks
at very early stages..

* I've seen storage systems from a BIG vendor die due to
firmware bugs
* I've seen FC-Cards die.. SAN-switches rebooted.. People used
my cables to do rope skipping
* We had Fire, non-working UPS and faulty diesel generators..

but so far the FSes (and applications) on the Storage never
complained about corrupted data.

..YMMV..

Btw: I don't think that Reiserfs really behaves this bad
with broken hardware. So far, Reiser3 survived 2 broken Harddrives
without problems while i've seen ext2/3 die 4 times so far...
(= everything inside /lost+found). Reiser4 survived
# mkisofs . > /dev/sda

Lucky me.. maybe..

To get back on-topic:

Some people try very hard to claim that the world doesn't need
Reiser4 and that you can do everything with ext3.

Ext3 may be fine for them but some people (like me) really need Reiser4
because they got applications/workloads that won't work good (fast) on ext3.

Why is it such a big thing to include a filesystem?
Even if it's unstable: does anyone care? Eg: the HFS+ driver
is buggy (corrupted the FS of my OSX installation 3 times so far) but
does this buggyness affect people *not* using it? No.

Regards,
Adrian

2006-08-01 18:14:52

by Adrian Ulrich

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

> > This is why ZFS offers block checksums... it can then try all the
> > permutations of raid regens to find a solution which gives the right
> > checksum.
>
> Isn't there a way to do this at the block layer? Something in
> device-mapper?

Remember: Suns new Filesystem + Suns new Volume Manager = ZFS

2006-08-01 18:21:42

by Ric Wheeler

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Alan Cox wrote:
> Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich:
>
>>WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
>>you don't need your filesystem beeing super-robust against bad sectors
>>and such stuff because:
>
>
> You do it turns out. Its becoming an issue more and more that the sheer
> amount of storage means that the undetected error rate from disks,
> hosts, memory, cables and everything else is rising.

I agree with Alan despite being an enthusiastic supporter of neat array
based technologies.

Most people use absolutely giant disks in laptops and desktop systems
(300GB & 500GB are common, 750GB on the way). File systems need to be as
robust as possible for users of these systems as people are commonly
storing personal "critical" data like photos mostly on these unprotected
drives.

Even for the high end users, array based mirroring and so on can only do
so much to protect you.

Mirroring a corrupt file system to a remote data center will mirror your
corruption.

Rolling back to a snapshot typically only happens when you notice a
corruption which can go undetected for quite a while, so even that will
benefit from having "reliability" baked into the file system (i.e., it
should grumble about corruption to let you know that you need to roll
back or fsck or whatever).

An even larger issue is that our tools, like fsck, which are used to
uncover these silent corruptions need to scale up to the point that they
can uncover issues in minutes instead of days. A lot of the focus at
the file system workshop was around how to dramatically reduce the
repair time of file systems.

In a way, having super reliable storage hardware is only as good as the
file system layer on top of it - reliability needs to be baked into the
entire IO system stack...

ric

2006-08-01 18:36:56

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Alan, I have seen only anecdotal evidence against reiserfsck, and I have
seen formal tests from Vitaly (which it seems a user has replicated)
where our fsck did better than ext3s. Note that these tests are of the
latest fsck from us: I am sure everyone understands that it takes time
for an fsck to mature, and that our early fsck's were poor. I will also
say the V4's fsck is more robust than V3's because we made disk format
changes specifically to help fsck.

Now I am not dismissing your anecdotes as I will never dismiss data I
have not seen, and it sounds like you have seen more data than most
people, but I must dismiss your explanation of them.

Being able to throw away all of the tree but the leaves and twigs with
extent pointers and rebuild all of it makes V4 very robust, more so than
ext3. This business of inodes not moving, I don't see what the
advantage is, we can lose the directory entry and rebuild just as well
as ext3, probably better because we can at least figure out what
directory it was in.

Vitaly can say all of this more expertly than I....

Hans

2006-08-01 18:41:44

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Ric Wheeler wrote:

> Alan Cox wrote:
>
>>
>>
>> You do it turns out. Its becoming an issue more and more that the sheer
>> amount of storage means that the undetected error rate from disks,
>> hosts, memory, cables and everything else is rising.
>
>
>
> I agree with Alan

You will want to try our compression plugin, it has an ecc for every 64k....

Hans

2006-08-01 19:01:42

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Gregory Maxwell wrote:

> This is why ZFS offers block checksums... it can then try all the
> permutations of raid regens to find a solution which gives the right
> checksum.
>
ZFS performance is pretty bad in the only benchmark I have seen of it.
Does anyone have serious benchmarks of it? I suspect that our
compression plugin (with ecc) will outperform it.

2006-08-01 19:11:36

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Ric Wheeler wrote:
> Alan Cox wrote:
>> Ar Maw, 2006-08-01 am 16:52 +0200, ysgrifennodd Adrian Ulrich:
>>
>>> WriteCache, Mirroring between 2 Datacenters, snapshotting.. etc..
>>> you don't need your filesystem beeing super-robust against bad sectors
>>> and such stuff because:
>>
>>
>> You do it turns out. Its becoming an issue more and more that the sheer
>> amount of storage means that the undetected error rate from disks,
>> hosts, memory, cables and everything else is rising.

> Most people use absolutely giant disks in laptops and desktop systems
> (300GB & 500GB are common, 750GB on the way). File systems need to be as
> robust as possible for users of these systems as people are commonly
> storing personal "critical" data like photos mostly on these unprotected
> drives.

Their loss. Robust FS is good, but really, if you aren't doing backup,
you are going to lose data. End of story.

> Even for the high end users, array based mirroring and so on can only do
> so much to protect you.
>
> Mirroring a corrupt file system to a remote data center will mirror your
> corruption.

Assuming it's undetected. Why would it be undetected?

> Rolling back to a snapshot typically only happens when you notice a
> corruption which can go undetected for quite a while, so even that will
> benefit from having "reliability" baked into the file system (i.e., it
> should grumble about corruption to let you know that you need to roll
> back or fsck or whatever).

Yes, the filesystem should complain about corruption. So should the
block layer -- if you don't trust the FS, use a checksum at the block
layer. So should...

There are just so many other, better places to do this than the FS. The
FS should complain, yes, but if the disk is bad, there's going to be
corruption.

> An even larger issue is that our tools, like fsck, which are used to
> uncover these silent corruptions need to scale up to the point that they
> can uncover issues in minutes instead of days. A lot of the focus at
> the file system workshop was around how to dramatically reduce the
> repair time of file systems.

That would be interesting. I know from experience that fsck.reiser4 is
amazing. Blew away my data with something akin to an rm -rf, and fsck
fixed it. Tons of crashing/instability in the early days, but only once
-- before they even had a version instead of a date, I think -- did I
ever have a case where fsck couldn't fix it.

So I guess the next step would be to make fsck faster. Someone
mentioned a fsck that repairs the FS in the background?

> In a way, having super reliable storage hardware is only as good as the
> file system layer on top of it - reliability needs to be baked into the
> entire IO system stack...

That bit makes no sense. If you have super reliable storage failure
(never dies), and your FS is also reliable (never dies unless hardware
does, but may go bat-shit insane when hardware dies), then you've got a
super reliable system.

You're right, running Linux's HFS+ or NTFS write support is generally a
bad idea, no matter how reliable your hardware is. But this discussion
was not about whether an FS is stable, but how well an FS survives
hardware corruption.

2006-08-01 19:27:14

by Krzysztof Halasa

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

David Masover <[email protected]> writes:

>> RAID deals with the case where a device fails. RAID 1 with 2 disks
>> can
>> in theory detect an internal inconsistency but cannot fix it.
>
> Still, if it does that, that should be enough. The scary part wasn't
> that there's an internal inconsistency, but that you wouldn't know.

RAID1 can do that in theory but it practice there is no verification,
so the other disk can perform another read simultaneously (thus
increasing performance).

Some high-end systems, maybe.

That would be hardly economical. Per-block checksums (like used by the
ZFS) are different story, they add only little additional load.

> And it can fix it if you can figure out which disk went. Or give it 3
> disks and it should be entirely automatic -- admin gets paged, admin
> hotswaps in a new disk, done.

Yep, that could be done. Or with 2 disks with block checksums.
Actually, while I don't exactly buy their ads, I think ZFS employs
some useful ideas.

> And yet, if you can do that, I'd suspect you can, should, must do it
> at a lower level than the FS. Again, FS robustness is good, but if
> the disk itself is going, what good is having your directory (mostly)
> intact if the files themselves have random corruptions?

With per-block checksum you will know. Of course, that's still not
end to end checksum.

> If you can't trust the disk, you need more than just an FS which can
> mostly survive hardware failure. You also need the FS itself (or
> maybe the block layer) to support bad block relocation and all that
> good stuff, or you need your apps designed to do that job by
> themselves.

Drives have internal relocation mechanisms, I don't think the
filesystem needs to duplicate them (though it should try to work
with bad blocks - relocations are possible on write).

> It just doesn't make sense to me to do this at the FS level. You
> mention TCP -- ok, but if TCP is doing its job, I shouldn't also need
> to implement checksums and other robustness at the protocol layer
> (http, ftp, ssh), should I?

Sure you have to, if you value your data.

> Similarly, the FS (and the apps) shouldn't have to know
> about hardware problems until it really can't do anything about it
> anymore, at which point the right thing to do is for the FS and apps
> to go "oh shit" and drop what they're doing, and the admin replaces
> hardware and restores from backup. Or brings a backup server online,
> or...

I don't think so. Going read-only if the disk returns write error,
ok. But taking the fs offline? Why?

Continuous backups (or rather transaction logs) are possible but
who has them? Do you have them? Would you throw away several hours
of work just because some file (or, say, unused area) contained
unreadable block (which could probably be transient problem, and/or
could be corrected by write)?
--
Krzysztof Halasa

2006-08-03 13:58:23

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On Tue, 01 Aug 2006, David Masover wrote:

> >RAID deals with the case where a device fails. RAID 1 with 2 disks can
> >in theory detect an internal inconsistency but cannot fix it.
>
> Still, if it does that, that should be enough. The scary part wasn't
> that there's an internal inconsistency, but that you wouldn't know.

You won't usually know, unless you run a consistency check: RAID-1 will
only read from one of the two drives for speed - except if you make the
system check consistency as it goes, which would imply waiting for both
disks at the same time. And in that case, you'd better look for drives
that allow to synchronize their platter staples in order to avoid the
read access penalty that waiting for two drives entails.

> And it can fix it if you can figure out which disk went.

If it's decent and detects a bad block, it'll log it and rewrite it with
data from the mirror and let the drive do the remapping through ARWE.

> >Depending how far you propogate it. Someone people working with huge
> >data sets already write and check user level CRC values for this reason
> >(in fact bitkeeper does it for one example). It should be relatively
> >cheap to get much of that benefit without doing application to
> >application just as TCP gets most of its benefit without going app to
> >app.
>
> And yet, if you can do that, I'd suspect you can, should, must do it at
> a lower level than the FS. Again, FS robustness is good, but if the
> disk itself is going, what good is having your directory (mostly) intact
> if the files themselves have random corruptions?

Berkeley DB can, since version 4.1 (IIRC), write checksums (newer
versions document this as SHA1) on its database pages, to detect
corruptions and writes that were supposed to be atomic but failed
(because you cannot write 4K or 16K atomically on a disk drive).

--
Matthias Andree

2006-08-03 14:03:13

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On Tue, 01 Aug 2006, Ric Wheeler wrote:

> Mirroring a corrupt file system to a remote data center will mirror your
> corruption.
>
> Rolling back to a snapshot typically only happens when you notice a
> corruption which can go undetected for quite a while, so even that will
> benefit from having "reliability" baked into the file system (i.e., it
> should grumble about corruption to let you know that you need to roll
> back or fsck or whatever).
>
> An even larger issue is that our tools, like fsck, which are used to
> uncover these silent corruptions need to scale up to the point that they
> can uncover issues in minutes instead of days. A lot of the focus at
> the file system workshop was around how to dramatically reduce the
> repair time of file systems.

Which makes me wonder if backup systems shouldn't help with this. If
they are reading the whole file anyways, they can easily compute strong
checksums as they go, and record them for later use, and check so many
percent of unchanged files every day to complain about corruptions.

--
Matthias Andree

2006-08-03 14:03:50

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On Tue, 01 Aug 2006, Hans Reiser wrote:

> You will want to try our compression plugin, it has an ecc for every 64k....

What kind of forward error correction would that be, and how much and
what failure patterns can it correct? URL suffices.

--
Matthias Andree

2006-08-03 15:45:12

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Matthias Andree wrote:
> On Tue, 01 Aug 2006, Hans Reiser wrote:
>
>
>>You will want to try our compression plugin, it has an ecc for every 64k....
>
>
> What kind of forward error correction would that be,

Actually we use checksums, not ECC. If checksum is wrong, then run
fsck - it will remove the whole disk cluster, that represent 64K of
data.

and how much and
> what failure patterns can it correct? URL suffices.
>

Checksum is checked before unsafe decompression (when trying to
decompress incorrect data can lead to fatal things). It can be
broken because of many reasons. The main one is tree corruption
(for example, when disk cluster became incomplete - ECC can not
help here). Perhaps such checksumming is also useful for other
things, I didnt classify the patterns..

Edward.

2006-08-03 16:51:45

by Theodore Ts'o

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On Thu, Aug 03, 2006 at 04:03:07PM +0200, Matthias Andree wrote:
> On Tue, 01 Aug 2006, Ric Wheeler wrote:
>
> > Mirroring a corrupt file system to a remote data center will mirror your
> > corruption.
> >
>
> Which makes me wonder if backup systems shouldn't help with this. If
> they are reading the whole file anyways, they can easily compute strong
> checksums as they go, and record them for later use, and check so many
> percent of unchanged files every day to complain about corruptions.

They absolutely should do this sort of thing.

Also sounds like yet another option that could be added to rsync.
(Only half-joking. :-)

- Ted

2006-08-03 18:27:06

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Edward Shishkin wrote:

> Matthias Andree wrote:
>
>> On Tue, 01 Aug 2006, Hans Reiser wrote:
>>
>>
>>> You will want to try our compression plugin, it has an ecc for every
>>> 64k....
>>
>>
>>
>> What kind of forward error correction would that be,
>
>
>
> Actually we use checksums, not ECC. If checksum is wrong, then run
> fsck - it will remove the whole disk cluster, that represent 64K of
> data.

How about we switch to ecc, which would help with bit rot not sector loss?

>
>
> and how much and
>
>> what failure patterns can it correct? URL suffices.
>>
>
> Checksum is checked before unsafe decompression (when trying to
> decompress incorrect data can lead to fatal things). It can be
> broken because of many reasons. The main one is tree corruption
> (for example, when disk cluster became incomplete - ECC can not
> help here). Perhaps such checksumming is also useful for other
> things, I didnt classify the patterns..
>
> Edward.
>
>

2006-08-03 23:22:08

by Russell Leighton

[permalink] [raw]

Subject: Checksumming blocks? [was Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion]

If the software (filesystem like ZFS or database like Berkeley DB)
finds a mismatch for a checksum on a block read, then what?

Is there a recovery mechanism, or do you just be happy you know there is
a problem (and go to backup)?

Thx

Matthias Andree wrote:

>Berkeley DB can, since version 4.1 (IIRC), write checksums (newer
>versions document this as SHA1) on its database pages, to detect
>corruptions and writes that were supposed to be atomic but failed
>(because you cannot write 4K or 16K atomically on a disk drive).
>

2006-08-03 23:35:09

[permalink] [raw]

Subject: Re: Checksumming blocks? [was Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion]

(I've stripped the Cc: list down to the bones.
No need to shout side topics from the rooftops.)

On Thu, 03 Aug 2006, Russell Leighton wrote:

> If the software (filesystem like ZFS or database like Berkeley DB)
> finds a mismatch for a checksum on a block read, then what?

(Note that this assumes a Berkeley DB in transactional mode.) Complain,
demand recovery, set the panic flag (refusing further transactions
except close and open for recovery).

> Is there a recovery mechanism, or do you just be happy you know there is
> a problem (and go to backup)?

Recoverability depends on log retention policy (set by the user or
administrator) and how recently the block was written. There is a
recovery mechanism.

For applications that don't need their own recovery methods (few do),
db_recover can do the job.

In typical cases of power loss or kernel panic during write, the broken
page will probably either be in the log so it can be restored (recover
towards commit), or, if the commit hadn't completed but pages had been
written due to cache conflicts, the database will be rolled back to the
state before the interrupted transaction, effectively aborting the
transaction.

The details are in the Berkeley DB documentation, which please see.

--
Matthias Andree

2006-08-03 23:57:57

by Russell Leighton

[permalink] [raw]

Subject: Re: Checksumming blocks? [was Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion]

Thx, for a db this seems natural...

I am very curious about ZFS as I think we will seem more protection in
the FS layer as disks get larger...
If I have a very old file I am now half way throught reading and ZFS
finds a bad block, I assume I would
get some kind of read() error...but then what? Does anyone know if there
are tools with ZFS to inspect the file?

Matthias Andree wrote:

>(I've stripped the Cc: list down to the bones.
>No need to shout side topics from the rooftops.)
>
>On Thu, 03 Aug 2006, Russell Leighton wrote:
>
>
>
>>If the software (filesystem like ZFS or database like Berkeley DB)
>>finds a mismatch for a checksum on a block read, then what?
>>
>>
>
>(Note that this assumes a Berkeley DB in transactional mode.) Complain,
>demand recovery, set the panic flag (refusing further transactions
>except close and open for recovery).
>
>
>
>>Is there a recovery mechanism, or do you just be happy you know there is
>>a problem (and go to backup)?
>>
>>
>
>Recoverability depends on log retention policy (set by the user or
>administrator) and how recently the block was written. There is a
>recovery mechanism.
>
>For applications that don't need their own recovery methods (few do),
>db_recover can do the job.
>
>In typical cases of power loss or kernel panic during write, the broken
>page will probably either be in the log so it can be restored (recover
>towards commit), or, if the commit hadn't completed but pages had been
>written due to cache conflicts, the database will be rolled back to the
>state before the interrupted transaction, effectively aborting the
>transaction.
>
>The details are in the Berkeley DB documentation, which please see.
>
>
>

2006-08-04 11:41:34

by Tomasz Torcz

[permalink] [raw]

Subject: Re: Checksumming blocks? [was Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion]

On Thu, Aug 03, 2006 at 07:25:19PM -0400, Russell Leighton wrote:
>
> If the software (filesystem like ZFS or database like Berkeley DB)
> finds a mismatch for a checksum on a block read, then what?
>
> Is there a recovery mechanism, or do you just be happy you know there is
> a problem (and go to backup)?

ZFS readsthis block again from different mirror, and if checksum is
right -- returns good data to userspace and rewrites failed block with
good data.

Note, that there could be multiple mirrors, either physically (like
RAID1) or logically (blocks could be mirrored on different areas of the
same disk; some files can be protected with multiple mirrors, some left
unprotected without mirrors).

--
Tomasz Torcz To co nierealne -- tutaj jest normalne.
[email protected] Ziomale na ?ycie maj? tu patenty specjalne.

Attachments:

(No filename) (872.00 B)
(No filename) (229.00 B)
Download all attachments

2006-08-04 17:04:42

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Hans Reiser wrote:
> Edward Shishkin wrote:
>
>
>>Matthias Andree wrote:
>>
>>
>>>On Tue, 01 Aug 2006, Hans Reiser wrote:
>>>
>>>
>>>
>>>>You will want to try our compression plugin, it has an ecc for every
>>>>64k....
>>>
>>>
>>>
>>>What kind of forward error correction would that be,
>>
>>
>>
>>Actually we use checksums, not ECC. If checksum is wrong, then run
>>fsck - it will remove the whole disk cluster, that represent 64K of
>>data.
>
>
> How about we switch to ecc, which would help with bit rot not sector loss?

Interesting aspect.

Yes, we can implement ECC as a special crypto transform that inflates
data. As I mentioned earlier, it is possible via translation of key
offsets with scale factor > 1.

Of course, it is better then nothing, but anyway meta-data remains
ecc-unprotected, and, hence, robustness is not increased..

Edward.

>
>>
>> and how much and
>>
>>
>>>what failure patterns can it correct? URL suffices.
>>>
>>
>>Checksum is checked before unsafe decompression (when trying to
>>decompress incorrect data can lead to fatal things). It can be
>>broken because of many reasons. The main one is tree corruption
>>(for example, when disk cluster became incomplete - ECC can not
>>help here). Perhaps such checksumming is also useful for other
>>things, I didnt classify the patterns..
>>
>>Edward.
>>
>>
>
>
>
>

2006-08-04 18:57:59

by Antonio Vargas

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On 8/4/06, Edward Shishkin <[email protected]> wrote:
> Hans Reiser wrote:
> > Edward Shishkin wrote:
> >
> >
> >>Matthias Andree wrote:
> >>
> >>
> >>>On Tue, 01 Aug 2006, Hans Reiser wrote:
> >>>
> >>>
> >>>
> >>>>You will want to try our compression plugin, it has an ecc for every
> >>>>64k....
> >>>
> >>>
> >>>
> >>>What kind of forward error correction would that be,
> >>
> >>
> >>
> >>Actually we use checksums, not ECC. If checksum is wrong, then run
> >>fsck - it will remove the whole disk cluster, that represent 64K of
> >>data.
> >
> >
> > How about we switch to ecc, which would help with bit rot not sector loss?
>
> Interesting aspect.
>
> Yes, we can implement ECC as a special crypto transform that inflates
> data. As I mentioned earlier, it is possible via translation of key
> offsets with scale factor > 1.
>
> Of course, it is better then nothing, but anyway meta-data remains
> ecc-unprotected, and, hence, robustness is not increased..
>
> Edward.
>
> >
> >>
> >> and how much and
> >>
> >>
> >>>what failure patterns can it correct? URL suffices.
> >>>
> >>
> >>Checksum is checked before unsafe decompression (when trying to
> >>decompress incorrect data can lead to fatal things). It can be
> >>broken because of many reasons. The main one is tree corruption
> >>(for example, when disk cluster became incomplete - ECC can not
> >>help here). Perhaps such checksumming is also useful for other
> >>things, I didnt classify the patterns..
> >>
> >>Edward.
> >>
> >>

Would the storage + plugin subsystem support storing >1 copies of the
metadata tree?

--
Greetz, Antonio Vargas aka winden of network

http://network.amigascne.org/
[email protected]
[email protected]

Every day, every year
you have to work
you have to study
you have to scene.

2006-08-04 20:42:17

by Horst H. von Brand

[permalink] [raw]

Subject: Re: Checksumming blocks? [was Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion]

Tomasz Torcz <[email protected]> wrote:
> On Thu, Aug 03, 2006 at 07:25:19PM -0400, Russell Leighton wrote:
> >
> > If the software (filesystem like ZFS or database like Berkeley DB)
> > finds a mismatch for a checksum on a block read, then what?
> >
> > Is there a recovery mechanism, or do you just be happy you know there is
> > a problem (and go to backup)?
>
> ZFS readsthis block again from different mirror, and if checksum is
> right -- returns good data to userspace and rewrites failed block with
> good data.
>
> Note, that there could be multiple mirrors, either physically (like
> RAID1) or logically (blocks could be mirrored on different areas of the
> same disk; some files can be protected with multiple mirrors, some left
> unprotected without mirrors).

Murphy's law will ensure that the important files are unprotected. And the
1st Law of Disk Drives (they are always full) will ensure that there are no
mirrored pieces anyway...
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2006-08-04 20:52:03

[permalink] [raw]

Subject: Re: Checksumming blocks? [was Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion]

Russell Leighton wrote:

> Is there a recovery mechanism, or do you just be happy you know there is
> a problem (and go to backup)?

You probably go to backup anyway. The recovery mechanism just means you
get to choose the downtime to restore from backup (if there is
downtime), versus being suddenly down until you can restore.

2006-08-05 01:56:29

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Edward Shishkin wrote:

>
>>
>>
>> How about we switch to ecc, which would help with bit rot not sector
>> loss?
>
>
> Interesting aspect.
>
> Yes, we can implement ECC as a special crypto transform that inflates
> data. As I mentioned earlier, it is possible via translation of key
> offsets with scale factor > 1.
>
> Of course, it is better then nothing, but anyway meta-data remains
> ecc-unprotected, and, hence, robustness is not increased..
>
> Edward.

Would you prefer to do it as a node layout plugin instead, so as to get
the metadata?

Hans

2006-08-05 02:02:19

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Antonio Vargas wrote:

> On 8/4/06, Edward Shishkin <[email protected]> wrote:
>
>> Hans Reiser wrote:
>> > Edward Shishkin wrote:
>> >
>> >
>> >>Matthias Andree wrote:
>> >>
>> >>
>> >>>On Tue, 01 Aug 2006, Hans Reiser wrote:
>> >>>
>> >>>
>> >>>
>> >>>>You will want to try our compression plugin, it has an ecc for every
>> >>>>64k....
>> >>>
>> >>>
>> >>>
>> >>>What kind of forward error correction would that be,
>> >>
>> >>
>> >>
>> >>Actually we use checksums, not ECC. If checksum is wrong, then run
>> >>fsck - it will remove the whole disk cluster, that represent 64K of
>> >>data.
>> >
>> >
>> > How about we switch to ecc, which would help with bit rot not
>> sector loss?
>>
>> Interesting aspect.
>>
>> Yes, we can implement ECC as a special crypto transform that inflates
>> data. As I mentioned earlier, it is possible via translation of key
>> offsets with scale factor > 1.
>>
>> Of course, it is better then nothing, but anyway meta-data remains
>> ecc-unprotected, and, hence, robustness is not increased..
>>
>> Edward.
>>
>> >
>> >>
>> >> and how much and
>> >>
>> >>
>> >>>what failure patterns can it correct? URL suffices.
>> >>>
>> >>
>> >>Checksum is checked before unsafe decompression (when trying to
>> >>decompress incorrect data can lead to fatal things). It can be
>> >>broken because of many reasons. The main one is tree corruption
>> >>(for example, when disk cluster became incomplete - ECC can not
>> >>help here). Perhaps such checksumming is also useful for other
>> >>things, I didnt classify the patterns..
>> >>
>> >>Edward.
>> >>
>> >>
>
>
> Would the storage + plugin subsystem support storing >1 copies of the
> metadata tree?
>
>
I suppose....

What would be nice would be to have a plugin that when a node fails its
checksum/ecc it knows to get it from another mirror, and which generally
handles faults with a graceful understanding of its ability to get
copies from a mirror (or RAID parity calculation).

I would happily accept such a patch (subject to usual reservation of
right to complain about implementation details).

2006-08-06 22:19:23

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Hans Reiser wrote:
> Edward Shishkin wrote:
>
>
>>>
>>>How about we switch to ecc, which would help with bit rot not sector
>>>loss?
>>
>>
>>Interesting aspect.
>>
>>Yes, we can implement ECC as a special crypto transform that inflates
>>data. As I mentioned earlier, it is possible via translation of key
>>offsets with scale factor > 1.
>>
>>Of course, it is better then nothing, but anyway meta-data remains
>>ecc-unprotected, and, hence, robustness is not increased..
>>
>>Edward.
>
>
> Would you prefer to do it as a node layout plugin instead, so as to get
> the metadata?
>

Yes, it looks like a business of node plugin, but AFAIK, you
objected against such checks: currently only bitmap nodes have
a protection (checksum); supporting ecc-signatures is more
space/cpu expensive.

Edward.

2006-08-06 22:59:41

by Pavel Machek

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On Tue 01-08-06 11:57:10, David Masover wrote:
> Horst H. von Brand wrote:
> >Bernd Schubert <[email protected]> wrote:
>
> >>While filesystem speed is nice, it also would be great
> >>if reiser4.x would be very robust against any kind of
> >>hardware failures.
> >
> >Can't have both.
>
> Why not? I mean, other than TANSTAAFL, is there a
> technical reason for them being mutually exclusive? I
> suspect it's more "we haven't found a way yet..."

What does the acronym mean?

Yes, I'm afraid redundancy/checksums kill write speed, and you need
that for robustness...

You could have filesystem that can be tuned for reliability and tuned
for speed... but you can't have both in one filesystem instance.
--
Thanks for all the (sleeping) penguins.

2006-08-06 23:36:25

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Pavel Machek wrote:
> On Tue 01-08-06 11:57:10, David Masover wrote:
>> Horst H. von Brand wrote:
>>> Bernd Schubert <[email protected]> wrote:
>>>> While filesystem speed is nice, it also would be great
>>>> if reiser4.x would be very robust against any kind of
>>>> hardware failures.
>>> Can't have both.
>> Why not? I mean, other than TANSTAAFL, is there a
>> technical reason for them being mutually exclusive? I
>> suspect it's more "we haven't found a way yet..."
>
> What does the acronym mean?

There Ain't No Such Thing As A Free Lunch.

> Yes, I'm afraid redundancy/checksums kill write speed, and you need
> that for robustness...

Not necessarily -- if you do it on flush, and store it near the data it
relates to, you can expect a similar impact to compression, except that
due to slow disks, the compression can actually speed things up 2x,
whereas checksums should be some insignificant amount slower than 1x.

Redundancy, sure, but checksums should be easy, and I don't see what
robustness (abilities of fsck) has to do with it.

> You could have filesystem that can be tuned for reliability and tuned
> for speed... but you can't have both in one filesystem instance.

That's an example of TANSTAAFL, if it's true.

2006-08-07 07:57:48

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

[stripping Cc: list]

On Thu, 03 Aug 2006, Edward Shishkin wrote:

> >What kind of forward error correction would that be,
>
> Actually we use checksums, not ECC. If checksum is wrong, then run
> fsck - it will remove the whole disk cluster, that represent 64K of
> data.

Well, that's quite a difference...

> Checksum is checked before unsafe decompression (when trying to
> decompress incorrect data can lead to fatal things).

Is this sufficient? How about corruptions that lead to the same checksum
and can then confuse the decompressor? Is the decompressor safe in that
it does not scribble over memory it has not allocated?

--
Matthias Andree

2006-08-08 11:06:44

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Matthias Andree wrote:
> [stripping Cc: list]
>
> On Thu, 03 Aug 2006, Edward Shishkin wrote:
>
>
>>>What kind of forward error correction would that be,
>>
>>Actually we use checksums, not ECC. If checksum is wrong, then run
>>fsck - it will remove the whole disk cluster, that represent 64K of
>>data.
>
>
> Well, that's quite a difference...
>
>
>>Checksum is checked before unsafe decompression (when trying to
>>decompress incorrect data can lead to fatal things).
>
>
> Is this sufficient? How about corruptions that lead to the same checksum
> and can then confuse the decompressor?

It is a multiplication of two unlikely events: fs corruption
and 32-hash collision. Paranoid people can assign zlib-based
transform plugin: afaik everything is safe there.

Is the decompressor safe in that
> it does not scribble over memory it has not allocated?
>

yes

2006-08-09 09:37:48

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Pavel Machek wrote:

>
>
>Yes, I'm afraid redundancy/checksums kill write speed,
>
they kill write speed to cache, but not to disk.... our compression
plugin is faster than the uncompressed plugin.....

2006-08-09 09:40:56

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Edward Shishkin wrote:

> Hans Reiser wrote:
>
>> Edward Shishkin wrote:
>>
>>
>>>>
>>>> How about we switch to ecc, which would help with bit rot not sector
>>>> loss?
>>>
>>>
>>>
>>> Interesting aspect.
>>>
>>> Yes, we can implement ECC as a special crypto transform that inflates
>>> data. As I mentioned earlier, it is possible via translation of key
>>> offsets with scale factor > 1.
>>>
>>> Of course, it is better then nothing, but anyway meta-data remains
>>> ecc-unprotected, and, hence, robustness is not increased..
>>>
>>> Edward.
>>
>>
>>
>> Would you prefer to do it as a node layout plugin instead, so as to get
>> the metadata?
>>
>
> Yes, it looks like a business of node plugin, but AFAIK, you
> objected against such checks:

Did I really? Well, I think that allowing users to choose whether to
checksum or not is a reasonable thing to allow them. I personally would
skip the checksum on my computer, but others....

It could be a useful mkfs option....

> currently only bitmap nodes have
> a protection (checksum); supporting ecc-signatures is more
> space/cpu expensive.
>
> Edward.
>
>

2006-08-09 09:48:39

by Pavel Machek

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

On Wed 2006-08-09 02:37:45, Hans Reiser wrote:
> Pavel Machek wrote:
>
> >
> >
> >Yes, I'm afraid redundancy/checksums kill write speed,
> >
> they kill write speed to cache, but not to disk.... our compression
> plugin is faster than the uncompressed plugin.....

Yes, you can get clever. But your compression plugin also means that
single bit error means whole block is lost, so there _is_ speed
vs. stability-against-hw-problems.

But you are right that compression will catch same class of errors
checksums will, so that it is probably good thing w.r.t. stability.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2006-08-09 10:15:34

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Pavel Machek wrote:

>On Wed 2006-08-09 02:37:45, Hans Reiser wrote:
>
>
>>Pavel Machek wrote:
>>
>>
>>
>>>Yes, I'm afraid redundancy/checksums kill write speed,
>>>
>>>
>>>
>>they kill write speed to cache, but not to disk.... our compression
>>plugin is faster than the uncompressed plugin.....
>>
>>
>
>Yes, you can get clever. But your compression plugin also means that
>single bit error means whole block is lost, so there _is_ speed
>vs. stability-against-hw-problems.
>
>But you are right that compression will catch same class of errors
>checksums will, so that it is probably good thing w.r.t. stability.
>
> Pavel
>
>
So we need to use ecc not checksums if we want to increase
reliability. Edward, can you comment in more detail regarding your
views and the performance issues for ecc that you see?

2006-08-09 12:29:05

by Jan Engelhardt

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

>> Yes, it looks like a business of node plugin, but AFAIK, you
>> objected against such checks:
>
>Did I really? Well, I think that allowing users to choose whether to
>checksum or not is a reasonable thing to allow them. I personally would
>skip the checksum on my computer, but others....
>
>It could be a useful mkfs option....

It should preferably a runtime tunable variable, at best even
per-superblock and (overriding the sb setting), per-file.

Jan Engelhardt
--

2006-08-09 15:48:39

[permalink] [raw]

Subject: Re: the " 'official' point of view" expressed by kernelnewbies.org regarding reiser4 inclusion

Jan Engelhardt wrote:
>>> Yes, it looks like a business of node plugin, but AFAIK, you
>>> objected against such checks:
>> Did I really? Well, I think that allowing users to choose whether to
>> checksum or not is a reasonable thing to allow them. I personally would
>> skip the checksum on my computer, but others....
>>
>> It could be a useful mkfs option....
>
> It should preferably a runtime tunable variable, at best even
> per-superblock and (overriding the sb setting), per-file.

Sounds almost exactly like a plugin. And yes, that would be the way to
do it, especially considering some files will already have internal
consistency checking -- just as we should allow direct disk IO to some
files (no journaling) when the files in question are databases that do
their own journaling.

2006-08-09 15:52:36