2009-09-03 02:00:58

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 08/31/2009 09:21 AM, Christoph Hellwig wrote:
> On Mon, Aug 31, 2009 at 09:19:58AM -0400, Mark Lord wrote:
>>> In my opinion even that is too weak. We know how to control the cache
>>> settings on all common disks (that is scsi and ata), so we should always
>>> disable the write cache unless we know that the whole stack (filesystem,
>>> raid, volume managers) supports barriers. And even then we should make
>>> sure the filesystems does actually use barriers everywhere that's needed
>>> which failed at for years.
>> ..
>>
>> That stack does not know that my MD device has full battery backup,
>> so it bloody well better NOT prevent me from enabling the write caches.
>
> No one is going to prevent you from doing it. That question is one of
> sane defaults. And always safe, but slower if you have advanced
> equipment is a much better default than usafe by default on most of
> the install base.
>

Just to add some support to this, all of the external RAID arrays that I know of
normally run with write cache disabled on the component drives. In addition,
many of them will disable their internal write cache if/when they detect that
they have lost their UPS.

I think that if we had done this kind of sane default earlier for MD levels that
do not handle barriers, we would not have left some people worried about our
software RAID.

To be clear, if a sophisticated user wants to override this default, that should
be supported. It is not (in my opinion) a safe default behaviour.

Ric


2009-09-03 11:12:25

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Ric Wheeler <[email protected]> writes:

> Just to add some support to this, all of the external RAID arrays that
> I know of normally run with write cache disabled on the component
> drives.

Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
--
Krzysztof Halasa

2009-09-03 11:19:27

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 09/03/2009 07:12 AM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>> Just to add some support to this, all of the external RAID arrays that
>> I know of normally run with write cache disabled on the component
>> drives.
>
> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?

Which drives various vendors ships changes with specific products. Usually, they
ship drives that have carefully vetted firmware, etc. but they are close to the
same drives you buy on the open market.

Seagate has a huge slice of the market,

ric

2009-09-03 13:34:47

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Ric Wheeler <[email protected]> writes:

>>> Just to add some support to this, all of the external RAID arrays that
>>> I know of normally run with write cache disabled on the component
>>> drives.
>>
>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>
> Which drives various vendors ships changes with specific products.
> Usually, they ship drives that have carefully vetted firmware, etc.
> but they are close to the same drives you buy on the open market.

But they aren't the same, are they? If they are not, the fact they can
run well with the write-through cache doesn't mean the off-the-shelf
ones can do as well.

Are they SATA (or PATA) at all? SCSI etc. are usually different
animals, though there are SCSI and SATA models which differ only in
electronics.

Do you have battery-backed write-back RAID cache (which acknowledges
flushes before the data is written out to disks)? PC can't do that.
--
Krzysztof Halasa

2009-09-03 13:51:54

by Ric Wheeler

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On 09/03/2009 09:34 AM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>>>> Just to add some support to this, all of the external RAID arrays that
>>>> I know of normally run with write cache disabled on the component
>>>> drives.
>>>
>>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>>
>> Which drives various vendors ships changes with specific products.
>> Usually, they ship drives that have carefully vetted firmware, etc.
>> but they are close to the same drives you buy on the open market.
>
> But they aren't the same, are they? If they are not, the fact they can
> run well with the write-through cache doesn't mean the off-the-shelf
> ones can do as well.

Storage vendors have a wide range of options, but what you get today is a
collection of s-ata (not much any more), sas or fc.

Some times they will have different firmware, other times it is the same.


>
> Are they SATA (or PATA) at all? SCSI etc. are usually different
> animals, though there are SCSI and SATA models which differ only in
> electronics.
>
> Do you have battery-backed write-back RAID cache (which acknowledges
> flushes before the data is written out to disks)? PC can't do that.

We (red hat) have all kinds of different raid boxes...

ric

2009-09-03 13:59:18

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

Ric Wheeler <[email protected]> writes:

> We (red hat) have all kinds of different raid boxes...

A have no doubt about it, but are those you know equipped with
battery-backed write-back cache? Are they using SATA disks?

We can _at_best_ compare non-battery-backed RAID using SATA disks with
what we typically have in a PC.
--
Krzysztof Halasa

2009-09-03 14:16:01

by Ric Wheeler

[permalink] [raw]
Subject: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/03/2009 09:59 AM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>> We (red hat) have all kinds of different raid boxes...
>
> A have no doubt about it, but are those you know equipped with
> battery-backed write-back cache? Are they using SATA disks?
>
> We can _at_best_ compare non-battery-backed RAID using SATA disks with
> what we typically have in a PC.

The whole thread above is about software MD using commodity drives (S-ATA or
SAS) without battery backed write cache.

We have that (and I have it personally) and do test it.

You must disable the write cache on these commodity drives *if* the MD RAID
level does not support barriers properly.

This will greatly reduce errors after a power loss (both in degraded state and
non-degraded state), but it will not eliminate data loss entirely. You simply
cannot do that with any storage device!

Note that even without MD raid, the file system issues IO's in file system block
size (4096 bytes normally) and most commodity storage devices use a 512 byte
sector size which means that we have to update 8 512b sectors.

Drives can (and do) have multiple platters and surfaces and it is perfectly
normal to have contiguous logical ranges of sectors map to non-contiguous
sectors physically. Imagine a 4KB write stripe that straddles two adjacent
tracks on one platter (requiring a seek) or mapped across two surfaces
(requiring a head switch). Also, a remapped sector can require more or less a
full surface seek from where ever you are to the remapped sector area of the drive.

These are all examples that can after a power loss, even a local (non-MD)
device, do a partial update of that 4KB write range of sectors. Note that
unlike unlike RAID/MD, local storage has no parity on the server to detect this
partial write.

This is why new file systems like btrfs and zfs do checksumming of data and
metadata. This won't prevent partial updates during a write, but can at least
detect them and try to do some kind of recovery.

In other words, this is not just an MD issue, it is entirely possible even with
non-MD devices.

Also, when you enable the write cache (MD or not) you are buffering multiple
MB's of data that can go away on power loss. Far greater (10x) the exposure that
the partial RAID rewrite case worries about.

ric

2009-09-03 14:28:40

by Florian Weimer

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

* Ric Wheeler:

> Note that even without MD raid, the file system issues IO's in file
> system block size (4096 bytes normally) and most commodity storage
> devices use a 512 byte sector size which means that we have to update
> 8 512b sectors.

Database software often attempts to deal with this phenomenon
(sometimes called "torn page writes"). For example, you can make sure
that the first time you write to a database page, you keep a full copy
in your transaction log. If the machine crashes, the log is replayed,
first completely overwriting the partially-written page. Only after
that, you can perform logical/incremental logging.

The log itself has to be protected with a different mechanism, so that
you don't try to replay bad data. But you haven't comitted to this
data yet, so it is fine to skip bad records.

Therefore, sub-page corruption is a fundamentally different issue from
super-page corruption.

BTW, older textbooks will tell you that mirroring requires that you
read from two copies of the data and compare it (and have some sort of
tie breaker if you need availability). And you also have to re-read
data you've just written to disk, to make sure it's actually there and
hit the expected sectors. We can't even do this anymore, thanks to
disk caches. And it doesn't seem to be necessary in most cases.

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

2009-09-03 14:36:17

by David Lang

[permalink] [raw]
Subject: Re: raid is dangerous but that's secret (was Re: [patch] ext2/3: document conditions when reliable operation is possible)

On Thu, 3 Sep 2009, Krzysztof Halasa wrote:

> Ric Wheeler <[email protected]> writes:
>
>>>> Just to add some support to this, all of the external RAID arrays that
>>>> I know of normally run with write cache disabled on the component
>>>> drives.
>>>
>>> Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
>>
>> Which drives various vendors ships changes with specific products.
>> Usually, they ship drives that have carefully vetted firmware, etc.
>> but they are close to the same drives you buy on the open market.
>
> But they aren't the same, are they? If they are not, the fact they can
> run well with the write-through cache doesn't mean the off-the-shelf
> ones can do as well.

frequently they are exactly the same drives, with exactly the same
firmware.

you disable the write caches on the drives themselves, but you add a large
write cache (with battery backup) in the raid card/chassis

> Are they SATA (or PATA) at all? SCSI etc. are usually different
> animals, though there are SCSI and SATA models which differ only in
> electronics.

it depends on what raid array you use, some use SATA, some use SAS/SCSI

David Lang

2009-09-03 15:08:55

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/03/2009 10:26 AM, Florian Weimer wrote:
> * Ric Wheeler:
>
>> Note that even without MD raid, the file system issues IO's in file
>> system block size (4096 bytes normally) and most commodity storage
>> devices use a 512 byte sector size which means that we have to update
>> 8 512b sectors.
>
> Database software often attempts to deal with this phenomenon
> (sometimes called "torn page writes"). For example, you can make sure
> that the first time you write to a database page, you keep a full copy
> in your transaction log. If the machine crashes, the log is replayed,
> first completely overwriting the partially-written page. Only after
> that, you can perform logical/incremental logging.
>
> The log itself has to be protected with a different mechanism, so that
> you don't try to replay bad data. But you haven't comitted to this
> data yet, so it is fine to skip bad records.

Yes - databases worry a lot about this. Another technique that they tend to use
is to have state bits at the beginning and end of their logical pages. For
example, the first byte and last byte toggle together from 1 to 0 to 1 to 0 as
you update.

If the bits don't match, that is a quick level indication of a torn write.

Even with the above scheme, you can still have data loss of course - you just
need an IO error in the log and in your db table that was recently updated. Not
entirely unlikely, especially if you use write cache enabled storage and don't
flush that cache :-)

>
> Therefore, sub-page corruption is a fundamentally different issue from
> super-page corruption.

We have to be careful to keep our terms clear since the DB pages are (usually)
larger than the FS block size which in turn is larger than non-RAID storage
sector size. At the FS level, we send down multiples of fs blocks (not
blocked/aligned at RAID stripe levels, etc).

In any case, we can get sub-FS block level "torn writes" even with a local S-ATA
drive in edge conditions.


>
> BTW, older textbooks will tell you that mirroring requires that you
> read from two copies of the data and compare it (and have some sort of
> tie breaker if you need availability). And you also have to re-read
> data you've just written to disk, to make sure it's actually there and
> hit the expected sectors. We can't even do this anymore, thanks to
> disk caches. And it doesn't seem to be necessary in most cases.
>

We can do something like this with the built in RAID in btrfs. If you detect an
IO error (or bad checksum) on a read, btrfs knows how to request/grab another copy.

Also note that the SCSI T10 DIF/DIX has baked in support for applications to
layer on extra data integrity (look for MKP's slide decks). This is really neat
since you can intercept bad IO's on the way down and prevent overwriting good data.

ric

2009-09-03 23:50:47

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Ric Wheeler <[email protected]> writes:

> The whole thread above is about software MD using commodity drives
> (S-ATA or SAS) without battery backed write cache.

Yes. However, you mentioned external RAID arrays disable disk caches.
That's why I asked if they are using SATA or SCSI/etc. disks, and if
they have battery-backed cache.

> Also, when you enable the write cache (MD or not) you are buffering
> multiple MB's of data that can go away on power loss. Far greater
> (10x) the exposure that the partial RAID rewrite case worries about.

The cache is flushed with working barriers. I guess it should be
superior to disabled WB cache, in both performance and expected disk
lifetime.
--
Krzysztof Halasa

2009-09-04 00:40:02

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/03/2009 07:50 PM, Krzysztof Halasa wrote:
> Ric Wheeler<[email protected]> writes:
>
>
>> The whole thread above is about software MD using commodity drives
>> (S-ATA or SAS) without battery backed write cache.
>>
> Yes. However, you mentioned external RAID arrays disable disk caches.
> That's why I asked if they are using SATA or SCSI/etc. disks, and if
> they have battery-backed cache.
>
>

Sorry for the confusion - they disable the write caches on the component
drives normally, but have their own write cache which is not disabled in
most cases.

>> Also, when you enable the write cache (MD or not) you are buffering
>> multiple MB's of data that can go away on power loss. Far greater
>> (10x) the exposure that the partial RAID rewrite case worries about.
>>
> The cache is flushed with working barriers. I guess it should be
> superior to disabled WB cache, in both performance and expected disk
> lifetime.
>

True - barriers (especially on big, slow s-ata drives) usually give you
an overall win. SAS drives it seems to make less of an impact, but then
you always need to benchmark your workload on anything to get the only
numbers that really matter :-)

ric

2009-09-04 21:21:30

by Mark Lord

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Ric Wheeler wrote:
..
> You must disable the write cache on these commodity drives *if* the MD
> RAID level does not support barriers properly.
..

Rather than further trying to cripple Linux on the notebook,
(it's bad enough already)..

How about instead, *fixing* the MD layer to properly support barriers?
That would be far more useful, productive, and better for end-users.

Cheers

2009-09-04 21:30:04

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/04/2009 05:21 PM, Mark Lord wrote:
> Ric Wheeler wrote:
> ..
>> You must disable the write cache on these commodity drives *if* the
>> MD RAID level does not support barriers properly.
> ..
>
> Rather than further trying to cripple Linux on the notebook,
> (it's bad enough already)..

People using MD on notebooks (not sure there are that many using RAID5
MD) could leave their write cache enabled.

>
> How about instead, *fixing* the MD layer to properly support barriers?
> That would be far more useful, productive, and better for end-users.
>
> Cheers

Fixing MD would be great - not sure that it would end up still faster
(look at md1 devices with working barriers with compared to md1 with
write cache disabled).

In the mean time, if you are using MD to make your data more reliable, I
would still strongly urge you to disable the write cache when you see
"barriers disabled" messages spit out in /var/log/messages :-)

ric

2009-09-05 12:57:59

by Mark Lord

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Ric Wheeler wrote:
> On 09/04/2009 05:21 PM, Mark Lord wrote:
..
>> How about instead, *fixing* the MD layer to properly support barriers?
>> That would be far more useful, productive, and better for end-users.
..
> Fixing MD would be great - not sure that it would end up still faster
> (look at md1 devices with working barriers with compared to md1 with
> write cache disabled).
..

There's no inherent reason for it to be slower, except possibly
drives with b0rked FUA support.

So the first step is to fix MD to pass barriers to the LLDs
for most/all RAID types.

Then, if it has performance issues, those can be addressed
by more application of little grey cells. :)

Cheers

2009-09-05 13:41:45

by Ric Wheeler

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On 09/05/2009 08:57 AM, Mark Lord wrote:
> Ric Wheeler wrote:
>> On 09/04/2009 05:21 PM, Mark Lord wrote:
> ..
>>> How about instead, *fixing* the MD layer to properly support barriers?
>>> That would be far more useful, productive, and better for end-users.
> ..
>> Fixing MD would be great - not sure that it would end up still faster
>> (look at md1 devices with working barriers with compared to md1 with
>> write cache disabled).
> ..
>
> There's no inherent reason for it to be slower, except possibly
> drives with b0rked FUA support.
>
> So the first step is to fix MD to pass barriers to the LLDs
> for most/all RAID types.
> Then, if it has performance issues, those can be addressed
> by more application of little grey cells. :)
>
> Cheers

The performance issue with MD is that the "simple" answer is to not only
pass on those downstream barrier ops, but also to block and wait until
all of those dependent barrier ops complete before ack'ing the IO.

When you do that implementation at least, you will see a very large
performance impact and I am not sure that you would see any degradation
vs just turning off the write caches.

Sounds like we should actually do some testing and actually measure, I
do think that it will vary with the class of device quite a lot just
like we see with single disk barriers vs write cache disabled on SAS vs
S-ATA, etc...

ric

2009-09-05 21:44:01

by NeilBrown

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On Sat, September 5, 2009 10:57 pm, Mark Lord wrote:
> Ric Wheeler wrote:
>> On 09/04/2009 05:21 PM, Mark Lord wrote:
> ..
>>> How about instead, *fixing* the MD layer to properly support barriers?
>>> That would be far more useful, productive, and better for end-users.
> ..
>> Fixing MD would be great - not sure that it would end up still faster
>> (look at md1 devices with working barriers with compared to md1 with
>> write cache disabled).
> ..
>
> There's no inherent reason for it to be slower, except possibly
> drives with b0rked FUA support.
>
> So the first step is to fix MD to pass barriers to the LLDs
> for most/all RAID types.

Having MD "pass barriers" to LLDs isn't really very useful.
The barrier need to act with respect to all addresses of the device,
and once you pass it down, it can only act with respect to addresses
on that device.
What any striping RAID level needs to do when it sees a barrier
is:
suspend all future writes
drain and flush all queues
submit the barrier write
drain and flush all queues
unsuspend writes

I guess "drain can flush all queues" can be done with an empty barrier
so maybe that is exactly what you meant.

The double flush which (I think) is required by the barrier semantic
is unfortunate. I wonder if it would actually make things slower than
necessary.

NeilBrown

>
> Then, if it has performance issues, those can be addressed
> by more application of little grey cells. :)
>
> Cheers
>

2009-09-07 11:45:41

by Pavel Machek

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

Hi!

> Note that even without MD raid, the file system issues IO's in file
> system block size (4096 bytes normally) and most commodity storage
> devices use a 512 byte sector size which means that we have to update 8
> 512b sectors.
>
> Drives can (and do) have multiple platters and surfaces and it is
> perfectly normal to have contiguous logical ranges of sectors map to
> non-contiguous sectors physically. Imagine a 4KB write stripe that
> straddles two adjacent tracks on one platter (requiring a seek) or mapped
> across two surfaces (requiring a head switch). Also, a remapped sector
> can require more or less a full surface seek from where ever you are to
> the remapped sector area of the drive.

Yes, but ext3 was designed to handle the partial write (according to
tytso).

> These are all examples that can after a power loss, even a local
> (non-MD) device, do a partial update of that 4KB write range of
> sectors.

Yes, but ext3 journal protects metadata integrity in that case.

> In other words, this is not just an MD issue, it is entirely possible
> even with non-MD devices.
>
> Also, when you enable the write cache (MD or not) you are buffering
> multiple MB's of data that can go away on power loss. Far greater (10x)
> the exposure that the partial RAID rewrite case worries about.

Yes, that's what barriers are for. Except that they are not there on
MD0/MD5/MD6. They actually work on local sata drives...

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-07 13:11:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage

On Mon, Sep 07, 2009 at 01:45:34PM +0200, Pavel Machek wrote:
>
> Yes, but ext3 was designed to handle the partial write (according to
> tytso).

I'm not sure what made you think that I said that. In practice things
usually work out, as a conseuqence of the fact that ext3 uses physical
block journaling, but it's not perfect, becase...

> > Also, when you enable the write cache (MD or not) you are buffering
> > multiple MB's of data that can go away on power loss. Far greater (10x)
> > the exposure that the partial RAID rewrite case worries about.
>
> Yes, that's what barriers are for. Except that they are not there on
> MD0/MD5/MD6. They actually work on local sata drives...

Yes, but ext3 does not enable barriers by default (the patch has been
submitted but akpm has balked because he doesn't like the performance
degredation and doesn't believe that Chris Mason's "workload of doom"
is a common case). Note though that it is possible for dirty blocks
to remain in the track buffer for *minutes* without being written to
spinning rust platters without a barrier.

See Chris Mason's report of this phenonmenon here:

http://lkml.org/lkml/2009/3/30/297

Here's Chris Mason "barrier test" which will corrupt ext3 filesystems
50% of the time after a power drop if the filesystem is mounted with
barriers disabled (which is the default; use the mount option
barrier=1 to enable barriers):

http://lkml.indiana.edu/hypermail/linux/kernel/0805.2/1518.html

(Yes, ext4 has barriers enabled by default.)

- Ted

2010-04-04 13:47:46

by Pavel Machek

[permalink] [raw]
Subject: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

Hi!

> > Yes, but ext3 was designed to handle the partial write (according to
> > tytso).
>
> I'm not sure what made you think that I said that. In practice things
> usually work out, as a conseuqence of the fact that ext3 uses physical
> block journaling, but it's not perfect, becase...

Ok; so the journalling actually is not reliable on many machines --
not even disk drive manufacturers guarantee full block writes AFAICT.

Maybe there's time to reviwe the patch to increase mount count by >1
when journal is replayed, to do fsck more often when powerfails are
present?


> > > Also, when you enable the write cache (MD or not) you are buffering
> > > multiple MB's of data that can go away on power loss. Far greater (10x)
> > > the exposure that the partial RAID rewrite case worries about.
> >
> > Yes, that's what barriers are for. Except that they are not there on
> > MD0/MD5/MD6. They actually work on local sata drives...
>
> Yes, but ext3 does not enable barriers by default (the patch has been
> submitted but akpm has balked because he doesn't like the performance
> degredation and doesn't believe that Chris Mason's "workload of doom"
> is a common case). Note though that it is possible for dirty blocks
> to remain in the track buffer for *minutes* without being written to
> spinning rust platters without a barrier.

So we do wrong thing by default. Another reason to do fsck more often
when powerfails are present?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-04 17:40:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun, Apr 04, 2010 at 03:47:29PM +0200, Pavel Machek wrote:
> > Yes, but ext3 does not enable barriers by default (the patch has been
> > submitted but akpm has balked because he doesn't like the performance
> > degredation and doesn't believe that Chris Mason's "workload of doom"
> > is a common case). Note though that it is possible for dirty blocks
> > to remain in the track buffer for *minutes* without being written to
> > spinning rust platters without a barrier.
>
> So we do wrong thing by default. Another reason to do fsck more often
> when powerfails are present?

Or migrate to ext4, which does use barriers by defaults, as well as
journal-level checksumming. :-)

As far as changing the default to enable barriers for ext3, you'll
need to talk to akpm about that; he's the one who has been against it
in the past.

- Ted

2010-04-04 17:59:27

by Rob Landley

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sunday 04 April 2010 08:47:29 Pavel Machek wrote:
> Maybe there's time to reviwe the patch to increase mount count by >1
> when journal is replayed, to do fsck more often when powerfails are
> present?

Wow, you mean there are Linux users left who _don't_ rip that out?

The auto-fsck stuff is an instance of "we the developers know what you the
users need far more than you ever could, so let me ram this down your throat".
I don't know of a server anywhere that can afford an unscheduled extra four
hours of downtime due to the system deciding to fsck itself, and I don't know
a Linux laptop user anywhere who would be happy to fire up their laptop and
suddenly be told "oh, you can't do anything with it for two hours, and you
can't power it down either".

I keep my laptop backed up to an external terabyte USB drive and the volatile
subset of it to a network drive (rsync is great for both), and when it dies,
it dies. But I've never lost data due to an issue fsck would have fixed. I've
lost data to disks overheating, disks wearing out, disks being run undervolt
because the cat chewed on the power supply cord... I've copied floppy images to
/dev/hda instead of /dev/fd0... I even ran over my laptop with my car once.
(Amazingly enough, that hard drive survived.)

But fsck has never once protected any data of mine, that I am aware of, since
journaling was introduced.

I'm all for btrfs coming along and being able to fsck itself behind my back
where I don't have to care about it. (Although I want to tell it _not_ to do
that when on battery power.) But the "fsck lottery" at powerup is just
stupid.

> > > > Also, when you enable the write cache (MD or not) you are buffering
> > > > multiple MB's of data that can go away on power loss. Far greater
> > > > (10x) the exposure that the partial RAID rewrite case worries about.
> > >
> > > Yes, that's what barriers are for. Except that they are not there on
> > > MD0/MD5/MD6. They actually work on local sata drives...
> >
> > Yes, but ext3 does not enable barriers by default (the patch has been
> > submitted but akpm has balked because he doesn't like the performance
> > degredation and doesn't believe that Chris Mason's "workload of doom"
> > is a common case). Note though that it is possible for dirty blocks
> > to remain in the track buffer for *minutes* without being written to
> > spinning rust platters without a barrier.
>
> So we do wrong thing by default. Another reason to do fsck more often
> when powerfails are present?

My laptop power fails all the time, due to battery exhaustion. Back under KDE
it was decent about suspending when it was ran low on power, but ever since
KDE 4 came out and I had to switch to XFCE, it's using the gnome
infrastructure, which collects funky statistics and heuristics but can never
quite save them to disk because suddenly running out of power when it thinks
it's got 20 minutes left doesn't give it the opportunity to save its database.
So it'll never auto-suspend, just suddenly die if I don't hit the button.

As a result of one of these, two large media files in my "anime" subdirectory
are not only crosslinked, but the common sector they share is bad. (It ran
out of power in the act of writing that sector. I left it copying large files
to the drive and forgot to plug it in, and it did the loud click emergency
park and power down thing when the hardware voltage regulator tripped.)

This corruption has been there for a year now. Presumably if it overwrote
that sector it might recover (perhaps by allocating one of the spares), but
the drive firmware has proven unwilling to do so in response to _reading_ the
bad sector, and I'm largely ignoring it because it's by no means the worst
thing wrong with this laptop's hardware, and some glorious day I'll probably
break down and buy a macintosh. The stuff I have on it's backed up, and in the
year since it hasn't developed a second bad sector and I haven't deleted those
files. (Yes, I could replace the hard drive _again_ but this laptop's on its
third hard drive already and it's just not worth the effort.)

I'm much more comfortable living with this until I can get a new laptop than
with the idea of running fsck on the system and letting it do who knows what
it response to something that is not actually a problem.

> Pavel

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds

2010-04-04 18:46:02

by Pavel Machek

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun 2010-04-04 12:59:16, Rob Landley wrote:
> On Sunday 04 April 2010 08:47:29 Pavel Machek wrote:
> > Maybe there's time to reviwe the patch to increase mount count by >1
> > when journal is replayed, to do fsck more often when powerfails are
> > present?
>
> Wow, you mean there are Linux users left who _don't_ rip that out?

Yes, there are. It actually helped pinpoint corruption here, 4 time it
was major corruption.

And yes, I'd like fsck more often, when they are power failures and
less often when the shutdowns are orderly...

I'm not sure of what right intervals between check are for you, but
I'd say that fsck once a year or every 100 mounts or every 10 power
failures is probably good idea for everybody...

> The auto-fsck stuff is an instance of "we the developers know what you the
> users need far more than you ever could, so let me ram this down your throat".
> I don't know of a server anywhere that can afford an unscheduled extra four
> hours of downtime due to the system deciding to fsck itself, and I don't know
> a Linux laptop user anywhere who would be happy to fire up their laptop and
> suddenly be told "oh, you can't do anything with it for two hours, and you
> can't power it down either".

On laptop situation is easy. Pull the plug, hit reset, wait for fsck,
plug AC back in. Done that, too :-).

Yep, it would be nice if fsck had "escape" button.

> I'm all for btrfs coming along and being able to fsck itself behind my back
> where I don't have to care about it. (Although I want to tell it _not_ to do
> that when on battery power.) But the "fsck lottery" at powerup is just
> stupid.

fsck lottery. :-).
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2010-04-04 19:30:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote:
> I don't know of a server anywhere that can afford an unscheduled
> extra four hours of downtime due to the system deciding to fsck
> itself, and I don't know a Linux laptop user anywhere who would be
> happy to fire up their laptop and suddenly be told "oh, you can't do
> anything with it for two hours, and you can't power it down either".

So what I recommend for server class machines is to either turn off
the automatic fsck's (it's the default, but it's documented and there
are supported ways of turning it off --- that's hardly developers
"ramming" it down user's throats), or more preferably, to use LVM, and
then use a snapshot and running fsck on the snapshot.

> I'm all for btrfs coming along and being able to fsck itself behind
> my back where I don't have to care about it. (Although I want to
> tell it _not_ to do that when on battery power.)

You can do this with ext3/ext4 today, now. Just take a look at
e2croncheck in the contrib directory of e2fsprogs. Changing it to not
do this when on battery power is a trivial exercise.

> My laptop power fails all the time, due to battery exhaustion. Back
> under KDE it was decent about suspending when it was ran low on
> power, but ever since KDE 4 came out and I had to switch to XFCE,
> it's using the gnome infrastructure, which collects funky statistics
> and heuristics but can never quite save them to disk because
> suddenly running out of power when it thinks it's got 20 minutes
> left doesn't give it the opportunity to save its database. So it'll
> never auto-suspend, just suddenly die if I don't hit the button.

Hmm, why are you running on battery so often? I make a point of
running connected to the AC mains whenever possible, because a LiOn
battery only has about 200 full-cycle charge/discharges in it, and
given the cost of LiOn batteries, basically each charge/discharge
cycle costs a dollar each. So I only run on batteries when I
absolutely have to, and in practice it's rare that I dip below 30% or
so.

> As a result of one of these, two large media files in my "anime"
> subdirectory are not only crosslinked, but the common sector they
> share is bad. (It ran out of power in the act of writing that
> sector. I left it copying large files to the drive and forgot to
> plug it in, and it did the loud click emergency park and power down
> thing when the hardware voltage regulator tripped.)

So e2fsck would fix the cross-linking. We do need to have some better
tools to do forced rewrite of sectors that have gone bad in a HDD. It
can be done by using badblocks -n, but translating the sector number
emitted by the device driver (which for some drivers is relative to
the beginning of the partition, and for others is relative to the
beginning of the disk). It is possible to run badblocks -w on the
whole disk, of course, but it's better to just run it on the specific
block in question.

> I'm much more comfortable living with this until I can get a new laptop than
> with the idea of running fsck on the system and letting it do who knows what
> it response to something that is not actually a problem.

Well, it actually is a problem. And there may be other problems
hiding that you're not aware of. Running "badblocks -b 4096 -n" may
discover other blocks that have failed, and you can then decide
whether you want to let fsck fix things up. If you don't, though,
it's probably not fair to blame ext3 or e2fsck for any future
failures (not that it's likely to stop you :-).

- Ted

2010-04-04 19:35:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sun, Apr 04, 2010 at 08:45:46PM +0200, Pavel Machek wrote:
>
> I'm not sure of what right intervals between check are for you, but
> I'd say that fsck once a year or every 100 mounts or every 10 power
> failures is probably good idea for everybody...

For people using e2croncheck, where you can check it when the system
is idle and without needing to do a power cycle, I'd recommend once a
week, actually.

> > hours of downtime due to the system deciding to fsck itself, and I
> > don't know a Linux laptop user anywhere who would be happy to fire
> > up their laptop and suddenly be told "oh, you can't do anything
> > with it for two hours, and you can't power it down either".
>
> On laptop situation is easy. Pull the plug, hit reset, wait for fsck,
> plug AC back in. Done that, too :-).

Some distributions will allow you to cancel an fsck; either by using
^C, or hitting escape. That's a matter for the boot scripts, which
are distribution specific. Ubuntu has a way of doing this, for
example, if I recall correctly --- although since I've started using
e2croncheck, I've never had an issue with an e2fsck taking place on
bootup. Also, ext4, fscks are so much much faster that even before I
upgraded to using an SSD, it's never been an issue for me. It's
certainly not hours any more....

> Yep, it would be nice if fsck had "escape" button.

Complain to your distribution. :-)

Or this is Linux and open source; fix it yourself, and submit the
patches back to your distribution. If all you want to do is whine,
then maybe Rob's choice is the best way, go switch to the velvet-lined
closed system/jail which is the Macintosh. :-)

(I created e2croncheck to solve my problem; if that isn't good enough
for you, I encourage you to find/create your own fixes.)

- Ted

2010-04-04 23:58:51

by Rob Landley

[permalink] [raw]
Subject: Re: fsck more often when powerfail is detected (was Re: wishful thinking about atomic, multi-sector or full MD stripe width, writes in storage)

On Sunday 04 April 2010 14:29:12 [email protected] wrote:
> On Sun, Apr 04, 2010 at 12:59:16PM -0500, Rob Landley wrote:
> > I don't know of a server anywhere that can afford an unscheduled
> > extra four hours of downtime due to the system deciding to fsck
> > itself, and I don't know a Linux laptop user anywhere who would be
> > happy to fire up their laptop and suddenly be told "oh, you can't do
> > anything with it for two hours, and you can't power it down either".
>
> So what I recommend for server class machines is to either turn off
> the automatic fsck's (it's the default, but it's documented and there
> are supported ways of turning it off --- that's hardly developers
> "ramming" it down user's throats), or more preferably, to use LVM, and
> then use a snapshot and running fsck on the snapshot.

Turning off the automatic fsck is what I see people do, yes.

My point is that if you don't force the thing to run memtest86 overnight every
20 boots, forcing it to run fsck seems a bit silly.

> > I'm all for btrfs coming along and being able to fsck itself behind
> > my back where I don't have to care about it. (Although I want to
> > tell it _not_ to do that when on battery power.)
>
> You can do this with ext3/ext4 today, now. Just take a look at
> e2croncheck in the contrib directory of e2fsprogs. Changing it to not
> do this when on battery power is a trivial exercise.
>
> > My laptop power fails all the time, due to battery exhaustion. Back
> > under KDE it was decent about suspending when it was ran low on
> > power, but ever since KDE 4 came out and I had to switch to XFCE,
> > it's using the gnome infrastructure, which collects funky statistics
> > and heuristics but can never quite save them to disk because
> > suddenly running out of power when it thinks it's got 20 minutes
> > left doesn't give it the opportunity to save its database. So it'll
> > never auto-suspend, just suddenly die if I don't hit the button.
>
> Hmm, why are you running on battery so often?

Personal working style?

When I was in Pittsburgh, I used the laptop on the bus to and from work every
day. Here in Austin, my laundromat has free wifi. It also gets usable free
wifi from the coffee shop to the right, the japanese restaurant to the left, and
the ice cream shop across the street. (And when I'm not in a wifi area, my
cell phone can bluetooth associate to give me net access too.)

I like coffee shops. (Of course the fact that if I try to work from home I
have to fight off the affections of four cats might have something to do with it
too...)

> I make a point of
> running connected to the AC mains whenever possible, because a LiOn
> battery only has about 200 full-cycle charge/discharges in it, and
> given the cost of LiOn batteries, basically each charge/discharge
> cycle costs a dollar each.

Actually the battery's about $50, so that would be 25 cents each.

My laptop is on its third battery. It's also on its third hard drive.

> So I only run on batteries when I
> absolutely have to, and in practice it's rare that I dip below 30% or
> so.

Actually I find the suckers die just as quickly from simply being plugged in
and kept hot by the electronics, and never used so they're pegged at 100% with
slight trickle current beyond that constantly overcharging them.

> > As a result of one of these, two large media files in my "anime"
> > subdirectory are not only crosslinked, but the common sector they
> > share is bad. (It ran out of power in the act of writing that
> > sector. I left it copying large files to the drive and forgot to
> > plug it in, and it did the loud click emergency park and power down
> > thing when the hardware voltage regulator tripped.)
>
> So e2fsck would fix the cross-linking. We do need to have some better
> tools to do forced rewrite of sectors that have gone bad in a HDD. It
> can be done by using badblocks -n, but translating the sector number
> emitted by the device driver (which for some drivers is relative to
> the beginning of the partition, and for others is relative to the
> beginning of the disk). It is possible to run badblocks -w on the
> whole disk, of course, but it's better to just run it on the specific
> block in question.

The point I was trying to make is that running "preemptive" fsck is imposing a
significant burden on users in an attempt to find purely theoretical problems,
with the expectation that a given run will _not_ find them. I've had systems
taken out by actual hardware issues often enough that keeping good backups and
being prepared to lose the entire laptop at any time is just common sense.

I knocked my laptop into the bathtub last month. Luckily there wasn't any
water in the thing at the time, but it made a very loud bang when it hit, and
it was on at the time. (Checked dmesg several times over the next few days
and it didn't start spitting errors at me, so that's something...)

> > I'm much more comfortable living with this until I can get a new laptop
> > than with the idea of running fsck on the system and letting it do who
> > knows what it response to something that is not actually a problem.
>
> Well, it actually is a problem. And there may be other problems
> hiding that you're not aware of. Running "badblocks -b 4096 -n" may
> discover other blocks that have failed, and you can then decide
> whether you want to let fsck fix things up. If you don't, though,
> it's probably not fair to blame ext3 or e2fsck for any future
> failures (not that it's likely to stop you :-).

I'm not blaming ext2. I'm saying I've spilled sodas into my working machines
on so many occasions over the years I've lost _track_. (The vast majority of
'em survived, actually.)

Random example of current cascading badness: The latch sensor on my laptop is
no longer debounced. That happened when I upgraded to Ubuntu 9.04 but I'm not
sure how that _can_ screw that up, you'd think the bios would be in charge of
that. So anyway, it now has a nasty habit of waking itself up in the nice
insulated pocket in my backpack and then shutting itself down hard five minutes
later when the thermal sensors trip (at the bios level I think, not in the
OS). So I now regularly suspend to disk instead of to ram because that way it
can't spuriously wake itself back up just because it got jostled slightly.
Except that when it resumes from disk, the console it suspended in is totally
misprogrammed (vertical lines on what it _thinks_ is text mode), and sometimes
the chip is so horked I can hear the sucker making a screeching noise. The
easy workarond is to ctrl-alt-F1 and suspend from a text console, then Ctrl-
alt-f7 gets me back to the desktop. But going back to that text console
remembers the misprogramming, and I get vertical lines and an adible whine
coming from something that isn't a speaker. (Luckly cursor-up and enter works
to re-suspend, so I can just sacrifice one console to the suspend bug.)

The _fun_ part is that the last system I had where X11 regularly misprogramed
it so badly I could _hear_ the video chip, said video chip eventually
overheated and melted bits of the motherboard. (That was a toshiba laptop.
It took out the keyboard controller first, and I used it for a few months with
an external keyboard until the whole thing just went one day. The display you
get when your video chip finally goes can be pretty impressive. Way prettier
than the time I was caught in a thunderstorm and my laptop got soaked and two
vertical sections of the display were flickering white while the rest was
displaying normally -- that system actally started working again when it dried
out...)

It just wouldn't be a Linux box to me if I didn't have workarounds for the
side effects of my workarounds.

Anyway, this is the perspective from which I say that the fsck to look for
purely theoretical badness on my otherwise perfect system is not worth 2 hours
to never find anything wrong.

If Ubuntu's little upgrade icon had a "recommend fsck" thing that lights up
every 3 months which I could hit some weekend when I was going out anyway,
that would be one thing. But "Ah, Ubuntu 9.04 moved DRM from X11 into the
kernel and the Intel 945 3D driver is now psychotic and it froze your machine
for the second time this week. Since you're rebooting anyway, you won't mind
if I add an extra 3 hours to the process"...? That stopped really being a
viable assumption some time before hard drives were regularly measured in
terabytes.

> - Ted

Rob
--
Latency is more important than throughput. It's that simple. - Linus Torvalds