LinuxLists.cc - Journaling pointless with today's hard disks?

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sat, 24 Nov 2001, Phil Howard wrote:

> Now I can see a problem if the drive can't flush a write-back cache
> during the "power fade". With some pretty big caches many drives
> have these days (although I wonder just how useful that is with OS
> caches being as good as they are), the time it takes to flush could
> be long (a few seconds ... and lights are out by then). I sure hope
> all my drives do write-through caching or don't cache writes at all.

Well, the DTLA drives ship with their writeback cache ENABLED and
transparent remapping DISABLED by default, so putting /sbin/hdparm -W0
/dev/hdX into your boot sequence before mounting the first filesystem
r/w and before calling upon fsck is certainly not a bad idea with
those. Alternatively, you can use IBM's feature tool to reconfigure the
drive.

On a related issue, I asked a person with access to DARA OEM (2.5" HDD)
data to look up caching specifications, and IBM does not guarantee data
integrity for cached blocks that have not yet made it to the disk,
although the drives start to flush their caches immediately. So up to
(cache size / block size) blocks may be lost. With the write cache
turned off, the data loss is at most 1 block.

> I would think that as fast as these drives spin these days, they
> could finish a sector between the time the power fade is detected
> and the time the voltage is too low to have the correct write
> current and servo speed. Obviously one problem with lighter weight
> platters is the momentum advantage is reduced for keeping the speed
> right as the power is declining (if the speed is an issue, which I
> am not sure of at all).

Well, I never saw big capacitors on disks, so they just go park and
that's about it. If DTLA corrupt their blocks in a way that low-level
formatting becomes necessary, those disk drives must be phased out at
once unless IBM update their firmware so to be able "this is a hard
checksum error, but actually, we can safely overwrite this block".

> OOC, do you think there is any real advantage to the 1m to 4m cache
> that drives have these days, considering the effective caching in
> the OS that all OSes these days have ... over adding that much
> memory to your system RAM? The only use for caching I can see in
> a drive is if it has physical sector sizes greater than the logical
> sector write granularity size which would require a read-mod-write
> kind of operation internally. But that's not really "cache" anyway.

Yes, these caches allow for bigger write requests or less latency
(didn't figure), doubling throughput on linear writes at least with IBM
DTLA and DJNA drives.

However, if it's really true that DTLA drives and their successor
corrupt blocks (generate bad blocks) on power loss during block writes,
these drives are crap.

HTH,
Matthias

2001-11-24 19:20:31

by Florian Weimer

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Matthias Andree <[email protected]> writes:

> However, if it's really true that DTLA drives and their successor
> corrupt blocks (generate bad blocks) on power loss during block writes,
> these drives are crap.

They do, even IBM admits that (on

http://www.cooling-solutions.de/dtla-faq

you find a quote from IBM confirming this). IBM says it's okay, you
have to expect this to happen. So much for their expertise in making
hard disks. This makes me feel rather dizzy (lots of IBM drives in
use).

--
Florian Weimer [email protected]
University of Stuttgart http://cert.uni-stuttgart.de/
RUS-CERT +49-711-685-5973/fax +49-711-685-5898

2001-11-24 19:30:31

by Rik van Riel

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On 24 Nov 2001, Florian Weimer wrote:

> They do, even IBM admits that (on
>
> http://www.cooling-solutions.de/dtla-faq
>
> you find a quote from IBM confirming this). IBM says it's okay,

That quote is priceless. I know I'll be avoiding IBM
disks from now on ;)

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-11-24 22:28:31

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Followup to: <[email protected]>
By author: Florian Weimer <[email protected]>
In newsgroup: linux.dev.kernel
>
> > However, if it's really true that DTLA drives and their successor
> > corrupt blocks (generate bad blocks) on power loss during block writes,
> > these drives are crap.
>
> They do, even IBM admits that (on
>
> http://www.cooling-solutions.de/dtla-faq
>
> you find a quote from IBM confirming this). IBM says it's okay, you
> have to expect this to happen. So much for their expertise in making
> hard disks. This makes me feel rather dizzy (lots of IBM drives in
> use).
>

No sh*t. I have always been favouring IBM drives, and I had a RAID
system with these drives bought. It will be a LONG time before I buy
another IBM drive, that's for sure. I can't believe they don't even
have the decency of saying "we fucked".

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-24 22:52:22

by John Alvord

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sat, 24 Nov 2001, Rik van Riel wrote:

> On 24 Nov 2001, Florian Weimer wrote:
>
> > They do, even IBM admits that (on
> >
> > http://www.cooling-solutions.de/dtla-faq
> >
> > you find a quote from IBM confirming this). IBM says it's okay,
>
> That quote is priceless. I know I'll be avoiding IBM
> disks from now on ;)

It could be true for many disks and only IBM has admitted it...

john

2001-11-24 23:05:45

by Pedro M. Rodrigues

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

I've always favoured IBM disks in all my hardware, from enterprise external
scsi raid hardware to small ide hardware raid devices (3ware fyi). At home all
my four disks are IBM (two DTLA). But with your information it seems i have
been bitten by that problem twice at the same time. Several months ago a less
zealous system administrator, while shutting down a couple servers for
maintenance at night, made a mistake in the console kvm switch, and pushed
the red button on a live server with four DTLA IBM disks plugged to a 3ware
raid card. On recovery, and after some time, one of the volumes started
complaining about errors, and went into degraded mode. One of the disks was
clearly broken we thought. So we exchanged it, but alas a couple hours later
another one in another volume complained. We also exchanged that one and
rebuilt everything. After checking the disks with IBM drive fitness software both
presented bad blocks that were recovered with a low level format. I dismissed
the events as something weird, but with some logical explanation beyond my
grasp. Now all makes sense.

/Pedro

On 24 Nov 2001 at 20:20, Florian Weimer wrote:

> Matthias Andree <[email protected]> writes:
>
> > However, if it's really true that DTLA drives and their successor
> > corrupt blocks (generate bad blocks) on power loss during block
> > writes, these drives are crap.
>
> They do, even IBM admits that (on
>
> http://www.cooling-solutions.de/dtla-faq
>
> you find a quote from IBM confirming this). IBM says it's okay, you
> have to expect this to happen. So much for their expertise in making
> hard disks. This makes me feel rather dizzy (lots of IBM drives in
> use).
>
> --
> Florian Weimer [email protected]
> University of Stuttgart http://cert.uni-stuttgart.de/
> RUS-CERT +49-711-685-5973/fax
> +49-711-685-5898 - To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in the body of a message to
> [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html Please read the FAQ at
> http://www.tux.org/lkml/
>

2001-11-24 23:24:06

by Stephen Satchell

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

At 02:28 PM 11/24/01 -0800, H. Peter Anvin wrote:
> > > However, if it's really true that DTLA drives and their successor
> > > corrupt blocks (generate bad blocks) on power loss during block writes,
> > > these drives are crap.
> >
> > They do, even IBM admits that (on
> >
> > http://www.cooling-solutions.de/dtla-faq
> >
> > you find a quote from IBM confirming this). IBM says it's okay, you
> > have to expect this to happen. So much for their expertise in making
> > hard disks. This makes me feel rather dizzy (lots of IBM drives in
> > use).
> >
>
>No sh*t. I have always been favouring IBM drives, and I had a RAID
>system with these drives bought. It will be a LONG time before I buy
>another IBM drive, that's for sure. I can't believe they don't even
>have the decency of saying "we fucked".

It is the responsibility of the power monitor to detect a power-fail event
and tell the drive(s) that a power-fail event is occurring. If power goes
out of specification before the drive completes a commanded write, what do
you expect the poor drive to do? ANY glitch in the write current will
corrupt the current block no matter what -- the final CRC isn't
recorded. Most drives do have a panic-stop mode when they detect voltage
going out of range so as to minimize the damage caused by an
out-of-specification power-down event, and more importantly use the energy
in the spinning platter to get the heads moved to a safe place before the
drive completely spins down. The panic-stop mode is EXACTLY like a Linux
OOPS -- it's a catastrophic event that SHOULD NOT OCCUR.

Most power supplies are not designed to hold up for more than 30-60 ms at
full load upon removal of mains power. Power-fail detect typically
requires 12 ms (three-quarters cycle average at 60 Hz) or 15 ms
(three-quarters cycle average at 50 Hz) to detect that mains power has
failed, leaving your system a very short time to abort that long queue of
disk write commands. It's very possible that by the time the system wakes
up to the fact that its electron feeding tube is empty it has already
started a write operation that cannot be completed before power goes out of
specification. It's a race condition.

Fix your system.

If you don't have a UPS on that RAID, and some means of shutting down the
RAID gracefully when mains power goes, you are sealing your own doom,
regardless of the maker of the hard drive you use in that RAID. Even the
original CDC disk drives, some of the best damn drives ever manufactured in
the world, would corrupt data when power failed during a write.

Satch

2001-11-24 23:29:36

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Stephen Satchell wrote:

>
> It is the responsibility of the power monitor to detect a power-fail
> event and tell the drive(s) that a power-fail event is occurring. If
> power goes out of specification before the drive completes a commanded
> write, what do you expect the poor drive to do? ANY glitch in the write
> current will corrupt the current block no matter what -- the final CRC
> isn't recorded. Most drives do have a panic-stop mode when they detect
> voltage going out of range so as to minimize the damage caused by an
> out-of-specification power-down event, and more importantly use the
> energy in the spinning platter to get the heads moved to a safe place
> before the drive completely spins down. The panic-stop mode is EXACTLY
> like a Linux OOPS -- it's a catastrophic event that SHOULD NOT OCCUR.
>

There is no "power monitor" in a PC system (at least not that is visible
to the drive) -- if the drive needs it, it has to provide it itself.

It's definitely the responsibility of the drive to recover gracefully
from such an event, which means that it writes anything that it has
committed to the host to write; anything it hasn't gotten committed to
write (but has received) can be written or not written, but must not
cause a failure of the drive.

A drive is a PERSISTENT storage device, and as such has responsibilities
the other devices don't.

Anything else is brainless rationalization.

-hpa

2001-11-24 23:42:01

by Phil Howard

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sat, Nov 24, 2001 at 02:51:38PM -0800, John Alvord wrote:

| On Sat, 24 Nov 2001, Rik van Riel wrote:
|
| > On 24 Nov 2001, Florian Weimer wrote:
| >
| > > They do, even IBM admits that (on
| > >
| > > http://www.cooling-solutions.de/dtla-faq
| > >
| > > you find a quote from IBM confirming this). IBM says it's okay,
| >
| > That quote is priceless. I know I'll be avoiding IBM
| > disks from now on ;)
|
| It could be true for many disks and only IBM has admitted it...

Only the IBM drives are having the high return rates. And IBM seems
to be blaming this on powering off during writes. But why would the
other brands not be having this situation? Is it because they don't
get powered off?

It could be that other drives have the capability to detect and write
over sectors made bad by power off. Or maybe they lock out the sector
and map to a spare. They might even have enough spin left to finish
the sector correctly in more cases.

So I doubt the issue is present in other drives, unless the issue is
not really as big of one as we might think and the problems with IBM
drives are something else.

I do worry that the lighter the platters are, the faster they try to
make the drives spin with smaller motors, and the quicker they slow
down when power is lost.

--
-----------------------------------------------------------------
| Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ |
| [email protected] | Texas, USA | http://phil.ipal.org/ |
-----------------------------------------------------------------

2001-11-25 00:25:06

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

<snip>
> It could be that other drives have the capability to detect and write
> over sectors made bad by power off. Or maybe they lock out the sector
> and map to a spare. They might even have enough spin left to finish
> the sector correctly in more cases.
>
> So I doubt the issue is present in other drives, unless the issue is
> not really as big of one as we might think and the problems with IBM
> drives are something else.
>
> I do worry that the lighter the platters are, the faster they try to
> make the drives spin with smaller motors, and the quicker they slow
> down when power is lost.

Utterly unimportant.
Let's say for the sake of argument that the drives spins down to a stop
in 1 second.
Now, the datarate for this 40G IDE drive I've got in my box is about
25 megabytes per second, or about 50K sectors per second.
Slowing down isn't a problem.

Somewhere I've got a databook, ca 85 I think, for a motor driver chip,
to drive spindle motors on hard disks, with integrated
diodes that rectify the power coming from the disk when the power fails,
to give a little grace.

If written by people with a clue, the drive does not need to do much
seeking to write the data from a write-cache to dics, just one seek
to a journal track, and a write.
This needs maybe 3 revs to complete, at most.

2001-11-25 00:53:41

by Phil Howard

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sun, Nov 25, 2001 at 12:24:28AM +0000, Ian Stirling wrote:

| <snip>
| > It could be that other drives have the capability to detect and write
| > over sectors made bad by power off. Or maybe they lock out the sector
| > and map to a spare. They might even have enough spin left to finish
| > the sector correctly in more cases.
| >
| > So I doubt the issue is present in other drives, unless the issue is
| > not really as big of one as we might think and the problems with IBM
| > drives are something else.
| >
| > I do worry that the lighter the platters are, the faster they try to
| > make the drives spin with smaller motors, and the quicker they slow
| > down when power is lost.
|
| Utterly unimportant.
| Let's say for the sake of argument that the drives spins down to a stop
| in 1 second.
| Now, the datarate for this 40G IDE drive I've got in my box is about
| 25 megabytes per second, or about 50K sectors per second.
| Slowing down isn't a problem.

If it takes 1 second to spin down to a stop, the it probably will
have slowed to a point where serialization writing a sector cannot
be kept in sync within 1 to 5 milliseconds. Once they _start_
slowing down, time is an extremely precious resource. That data
pattern has to be read back at full speed.

|
| Somewhere I've got a databook, ca 85 I think, for a motor driver chip,
| to drive spindle motors on hard disks, with integrated
| diodes that rectify the power coming from the disk when the power fails,
| to give a little grace.
|
| If written by people with a clue, the drive does not need to do much
| seeking to write the data from a write-cache to dics, just one seek
| to a journal track, and a write.
| This needs maybe 3 revs to complete, at most.

By the time the seek completes, the speed is probably too slow to do a
good write. Options to deal with this include special handling for the
emergency track to allow reading it back by intentionally slowing down
the drive for that recovery. Another option is flash disk.

The apparent problem in the IBM DTLA is the write didn't have enough
time to complete with the platter still spinning within spec. That
means the sector gets compressed at the end and the bit density is
increased beyond readable levels (if it could go higher reliably, they
would just record everything that way). That and the end of the sector
doesn't fall off into the gap between sectors where there is probably
some low level stuff. So on readback, some bits are in error due to
the clocking rate rising due to the compression, and the trailing edge
hits the previous sector occupant's un-erased end before the gap.

--
-----------------------------------------------------------------
| Phil Howard - KA9WGN | Dallas | http://linuxhomepage.com/ |
| [email protected] | Texas, USA | http://phil.ipal.org/ |
-----------------------------------------------------------------

2001-11-25 01:20:51

by dnu478nt5w

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Stephen Satchell wrote:

> Most power supplies are not designed to hold up for more than 30-60 ms at
> full load upon removal of mains power. Power-fail detect typically
> requires 12 ms (three-quarters cycle average at 60 Hz) or 15 ms
> (three-quarters cycle average at 50 Hz) to detect that mains power has
> failed, leaving your system a very short time to abort that long queue of
> disk write commands. It's very possible that by the time the system wakes
> up to the fact that its electron feeding tube is empty it has already
> started a write operation that cannot be completed before power goes out of
> specification. It's a race condition.
>
> If power goes out of specification before the drive completes a commanded
> write, what do you expect the poor drive to do?

I expect it to have enough capacitor power and rotational inertia that
it can decide before it *starts* a given sector write whether it will
be able, barring a disaster rather less likely than instantaneous loss
of DC power, to complete it.

It doesn't need that long. Take, as an example, a Really Old drive...
an original 20 MB MFM drive. 3600 RPM, 17 sectors/track.

That's 60*17 = 1020 sectors per second passing under the head. So the
actual duration of a sector write is 1 ms.

In a more modern hard drive (IBM 40GV) will spin faster and have
from 370 (inner tracks) to 792 (outer tracks) sectors per track.
(http://www.storagereview.com/guide2000/ref/hdd/geom/tracksZBR.html)

Even at 5400 rpm, on the innermost track, that's 90*370 = 33300
sectors/second passing under the head, or 30 *microseconds* per sector.

I think it's reasonable to expect a drive to keep functioning for 30
microseconds between when it notices the power is dropping and when it
really can't continue. Heck, even 1 ms isn't unreasonable.

We exepct a drive to look at the power supply shortly before the write,
and decide if it's "go for launch" or not.

Given that modern drives already save enough power to unload the heads
before the platter slows to the point that they'd touch down, this doesn't
seem like a big problem. (Of course, unloading the heads doesn't require
that drive RPM, head position, or anything else be within spec.)

What we'd *like* is for the drive to have enough power to be able to
seek to a reserved location and dump out the entire write-behind cache
before dying, but that possibly requires a full-bore seek (longer than
the typical 9 ms "average" seek) plus head settle time (it's okay to
start *reading* before you're sure the head is in place; the CRC will
tell you if you didn't make it), plus writing 4000 sectors (5+ rotations,
with head switching and extra settle time between, for a 2 MB buffer),
but that's adding up to a good fraction of 100 ms, which *is* a bit long
for power-loss operation.

2001-11-25 01:26:41

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Followup to: <[email protected]>
By author: Phil Howard <[email protected]>
In newsgroup: linux.dev.kernel
>
> By the time the seek completes, the speed is probably too slow to do a
> good write. Options to deal with this include special handling for the
> emergency track to allow reading it back by intentionally slowing down
> the drive for that recovery. Another option is flash disk.
>

And yet another option is to dynamically adjust the data speed fed to
the head, to match the rotation speed of the platter. This assumes
that the rotation speed can be measured, which should be trivial if
they use the rotation to power the drive electronics during shutdown.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-25 01:45:11

by Sven.Riedel

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sat, Nov 24, 2001 at 06:53:21PM -0600, Phil Howard wrote:
> If it takes 1 second to spin down to a stop, the it probably will
> have slowed to a point where serialization writing a sector cannot
> be kept in sync within 1 to 5 milliseconds. Once they _start_
> slowing down, time is an extremely precious resource. That data
> pattern has to be read back at full speed.

Makes you wonder why drive manufacturers don't use some kind of NVRAM to
simply remember the sectornumber that is being written as power fails -
a capacitor, or even a small rechargeable battery (for the truely
paranoid), could supply the writing voltage. No (further) writing to the
sector would be needed during spindown. And when the drive initializes
again at boottime, it could check to see if the contents of the NVRAM is
not an "all OK" pattern, and simply rewrite the CRC of the sector in
question, unless that sector is already present in the bad-sector list
of the drive.
Yes, this would be a bit more complex, and presents one more possible
point of failure, but the current situation seems rather abysmal...
And the data in that sector is as good as lost, anyway.

Regs,
Sven
--
Sven Riedel [email protected]
Osteroeder Str. 6 / App. 13 [email protected]
38678 Clausthal "Call me bored, but don't call me boring."
- Larry Wall

2001-11-25 04:20:23

by Pete Zaitcev

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

>[...]
> It is the responsibility of the power monitor to detect a power-fail event
> and tell the drive(s) that a power-fail event is occurring.

> Most power supplies are not designed to hold up for more than 30-60 ms at
> full load upon removal of mains power. Power-fail detect typically
> requires 12 ms (three-quarters cycle average at 60 Hz) or 15 ms
> (three-quarters cycle average at 50 Hz) to detect that mains power has
> failed, leaving your system a very short time to abort that long queue of
> disk write commands.

This is a total crap argument, because you invent an impossible
request, pretend that your opponent made that request, then show
that it's impossible to fulfill the impossible requesti. No shit,
sherlock! Of course it's "a very short time to abort that long
queue of disk write commands".

However, what is asked here is entirely different: disks must
complete writes of sectors that they started writing, this is all.
They do not need to report _anything_ to the host, in fact they
may ignore the host interface completely the moment the power
failure sequence is triggered. Neither they need to do anything
about queued commands: abort them, discard in any way, or whatever.
Just complete the sector, and start head landing sequence.

IBM Deskstar is completely broken, and that's a fact.

BTW, hpa went on how he was buying IBM drives, how good they were,
and what a surprise it was that IBM fucked Deskstar. Hardly a
surprise. The first time I heard of IBM drive was a horror story.
Our company was making RAID arrays, and we sourced new IBM SCSI disks.
They were qualified through a rigorous testing as it was the
procedure in the company. So, after a while they started to fail.
It turned out that bearings leaked grease to platters. Of course,
we shipped tens of thousands of those when IBM explained to us
that every one of them will die in a year. We shipped Seagates
ever after.

-- Pete

2001-11-25 04:51:32

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On 24 Nov 2001, H. Peter Anvin wrote:

> Followup to: <[email protected]>
> By author: Florian Weimer <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > > However, if it's really true that DTLA drives and their successor
> > > corrupt blocks (generate bad blocks) on power loss during block writes,
> > > these drives are crap.
> >
> > They do, even IBM admits that (on
> >
> > http://www.cooling-solutions.de/dtla-faq
> >
> > you find a quote from IBM confirming this). IBM says it's okay, you
> > have to expect this to happen. So much for their expertise in making
> > hard disks. This makes me feel rather dizzy (lots of IBM drives in
> > use).
> >
>
> No sh*t. I have always been favouring IBM drives, and I had a RAID
> system with these drives bought. It will be a LONG time before I buy
> another IBM drive, that's for sure. I can't believe they don't even
> have the decency of saying "we fucked".

Peter,

Remember my soon to be famous quote.
Everything about storage is a LIE, and that is the only true I stand by.

Andre Hedrick
Linux Disk Certification Project Linux ATA Development

2001-11-25 09:13:04

by Chris Wedgwood

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sat, Nov 24, 2001 at 02:03:11PM +0100, Florian Weimer wrote:

When the drive is powered down during a write operation, the
sector which was being written has got an incorrect checksum
stored on disk. So far, so good---but if the sector is read
later, the drive returns a *permanent*, *hard* error, which can
only be removed by a low-level format (IBM provides a tool for
it). The drive does not automatically map out such sectors.

AVOID SUCH DRIVES... I have both Seagate and IBM SCSI drives which a
are hot-swappable in a test machine that I used for testing various
journalling filesystems a while back for reliability.

Some (many) of those tests involved removed the disk during writes
(literally) and checking the results afterwards.

The drives were set not to write-cache (they don't by default, but all
my IDE drives do, so maybe this is a SCSI thing?)

At no point did I ever see a partial write or corrupted sector; nor
have I seen any appear in the grown table, so as best as I can tell
even under removal with sustain writes there are SOME DRIVES WHERE
THIS ISN'T A PROBLEM.

Now, since EMC, NetApp, Sun, HP, Compaq, etc. all have products which
presumable depend on this behavior, I don't think it's going to go
away, it perhaps will just become important to know which drives are
brain-damaged and list them so people can avoid them.

As this will affect the Windows world too consumer pressure will
hopefully rectify this problem.

--cw

P.S. Write-caching in hard-drives is insanely dangerous for
journalling filesystems and can result in all sorts of nasties.
I recommend people turn this off in their init scripts (perhaps I
will send a patch for the kernel to do this on boot, I just
wonder if it will eat some drives).

2001-11-25 12:30:44

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sat, 24 Nov 2001, Florian Weimer wrote:

> > However, if it's really true that DTLA drives and their successor
> > corrupt blocks (generate bad blocks) on power loss during block writes,
> > these drives are crap.
>
> They do, even IBM admits that (on
>
> http://www.cooling-solutions.de/dtla-faq
>
> you find a quote from IBM confirming this). IBM says it's okay, you
> have to expect this to happen. So much for their expertise in making
> hard disks. This makes me feel rather dizzy (lots of IBM drives in
> use).

Well, claiming the OS to cause hard errors? Design fault.
Claiming DC loss to cause hard errors? Design fault.

IBM would really better shed some real light on this issue, and if they
spoiled their firmware (heck, there ARE firmware updates for OEM disks
of the 75GXP series) or electronics, they'd better admit that so as to
reinstore the trust people had before DTLA drives were sold.

FUD works its way, so personally, I'm not buying IBM drives until this
issue is FULLY resolved, so I presume, I won't buy any DTLA or and
IC35Lxx drives of the current series. This is not a recommendation, just
what I'm doing.

2001-11-25 13:53:15

by Pedro M. Rodrigues

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

With those Seagates you probably just got yourself something else to
worry about, maybe even more sneaky than disks failing completely after
one year. I've had a $40.000+ external raid system (brand withhold),
promising reliability and data security at all levels, and with enough bells
and whistles to bore a "geek". It came with Hitachi disks and that surprised
me, because that box was replacing a same brand one that was sold with
IBM disks - the best and the only thing they used, i was told. I thought
maybe they knew something we don't, or maybe they were really special.
Anyway, some time later we started having complete disk lockups in the
device. Honest, the hardware would find a bad block in one of the disks
with parity that weren't remaped. And for some reason the hardware would
just freeze after some time. After checking with support we were sent a
new batch of disks to replace the current ones, with a different firmware
level. It did the trick. After backing up and restoring 360GB of data, of
course. But this begs for some questions. And it really makes me worry
about where the industry is going. Is it the increasing complexity of the
technology? Are they cutting too many corners on trying to reach the
market sooner? Or just cost cutting with old fashioned second source
suppliers? I am more and more worried about what passes as "enterprise
level storage" these days.

/Pedro

On 24 Nov 2001 at 23:20, Pete Zaitcev wrote:

>
> IBM Deskstar is completely broken, and that's a fact.
>
> BTW, hpa went on how he was buying IBM drives, how good they were, and
> what a surprise it was that IBM fucked Deskstar. Hardly a surprise.
> The first time I heard of IBM drive was a horror story. Our company
> was making RAID arrays, and we sourced new IBM SCSI disks. They were
> qualified through a rigorous testing as it was the procedure in the
> company. So, after a while they started to fail. It turned out that
> bearings leaked grease to platters. Of course, we shipped tens of
> thousands of those when IBM explained to us that every one of them
> will die in a year. We shipped Seagates ever after.
>
> -- Pete
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-11-25 15:04:15

by Barry K. Nathan

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

> Claiming DC loss to cause hard errors? Design fault.
>
> IBM would really better shed some real light on this issue, and if they
> spoiled their firmware (heck, there ARE firmware updates for OEM disks
> of the 75GXP series) or electronics, they'd better admit that so as to
> reinstore the trust people had before DTLA drives were sold.

"Power off during write operations may make an incomplete sector which
will report hard data error when read. The sector can be recovered by a
rewrite operation."

http://www-3.ibm.com/storage/hdd/tech/techlib.nsf/techdocs/85256AB8006A31E587256A77006E0E91/$file/D60gxp_sp21.pdf

Deskstar 60GXP specifications, section 6.0

The above quote and URL are IBM's official word, from their OEM
specification manual. FWIW, I checked the OEM manual for the 73LZX as
well (not that that drive is available anywhere, but I wanted to see
what IBM did/is doing for that drive), and the corresponding section in
that manual mentions nothing about incomplete sectors causing hard
errors. I just checked the 36LZX OEM spec as well and that also omits
the same clause.

OTOH, A few hours ago I checked the specs for several TravelStars and they
mentioned this incomplete sector thing. So, I guess IBM's position on
this is that this failure mode is OK for IDE drives but not for SCSI.

Here's a starting point for finding the IBM manuals:
http://www-3.ibm.com/storage/hdd/tech/techlib.nsf/pages/main?OpenDocument

(Just for my curiosity, I checked for the microdrives too. The phrasing
is different there: "There is a possibility that power off during a
write operation might make a maximum of 1 sector of data unreadable.
This state can be recovered by a rewrite operation.")

-Barry K. Nathan <[email protected]>

2001-11-25 16:31:46

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

> "Power off during write operations may make an incomplete sector which
> will report hard data error when read. The sector can be recovered by a
> rewrite operation."

So the proper defect management would be to simply initialize the broken
sector once a fsck hits it (still, I've never seen disks develop so many
bad blocks so quickly as those failed DTLA-307045 drives I had).

Note, the specifications say that the write cache setting is ignored
when the drive runs out of spare blocks for reassignment after defects
(so that the drive can return the error code right away when it cannot
guarantee the write actually goes to disk).

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-11-25 22:53:32

by Daniel Phillips

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On November 25, 2001 10:14 am, Chris Wedgwood wrote:
> On Sat, Nov 24, 2001 at 02:03:11PM +0100, Florian Weimer wrote:
> Now, since EMC, NetApp, Sun, HP, Compaq, etc. all have products which
> presumable depend on this behavior, I don't think it's going to go
> away, it perhaps will just become important to know which drives are
> brain-damaged and list them so people can avoid them.
>
> As this will affect the Windows world too consumer pressure will
> hopefully rectify this problem.

Andre Hedrik has put together a site with exactly this intention, check out:

http://linuxdiskcert.org/

Of course, there's a lot of hard work between here and having a useful
database, but, hey, well begun and all that...

According to Andre:

"the requirements are they apply a patch run a series of tests and then I
will submit to the OEM for rebutal and if there is no resolution the drive
and the test procedure on how to reproduce the error will be posted"

--
Daniel

2001-11-26 17:15:10

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

While I am not familiar with the IBM drives in particular, I am
familar with this particular problem.

The problem is that half of a sector gets new data, then when power is
dropped, the old data+CRC/ECC is left on second part of that sector,
and a subsequent read on the whole sector will detect the CRC/ECC
mismatch, and determine the error burst is larger than what it can
correct with retries, and ECC, and report it as a HARD ERROR. (03-1100
in the SCSI World)

Since the error is non-recoverable, the disk drive should not
auto-reassign the sector, since it cannot succeed at moving good data
to the newly assigned sector.

This type of error does not require a low-level format. Just writing
any data to the sector in error should give the sector a CRC/ECC field
that matches the data in the sector, and you should not get hard
errors when reading that sector anymore.

This was more of a problem with older disk drives (8-Inch platters, or
older), because the time required to finish any given sector was more
than the amount of time the electronics would run reliably. All that
could be guranteed on these older drives was that a power loss would
not corrupt any adjacent data, ie write gate must be crow-bared
inactive before the heads start retracting, emergency-style, to the
landing zone.

I believe that the time to complete a sector is so short on current
drives, that they should be able to complete writing their current
sector, but I do not believe that there are any drive manufacturers
out there that gurrantee this. Thus, there is probably a window, on
all disk drives out there, where a loss of power durring an active
write will end up causing a hard error when that sector is
subsequently read (I haven't looked though, and could be wrong).
Writing to the sector with the error should clear the hard-error when
that sector is read. A low-level format should not be required to fix
this, and if it is, the drive is definitely broken in design.

This is basic power-economics, and one of the reasons for UPS's

Steve Brueggeman

On 24 Nov 2001 14:03:11 +0100, you wrote:

>In the German computer community, a statement from IBM[1] is
>circulating which describes a rather peculiar behavior of certain IBM
>IDE hard drivers (the DTLA series):
>
>When the drive is powered down during a write operation, the sector
>which was being written has got an incorrect checksum stored on disk.
>So far, so good---but if the sector is read later, the drive returns a
>*permanent*, *hard* error, which can only be removed by a low-level
>format (IBM provides a tool for it). The drive does not automatically
>map out such sectors.
>
>IBM claims this isn't a firmware error, but thinks that this explains
>the failures frequently observed with DTLA drivers (which might
>reflect reality or not, I don't know, but that's not the point
>anyway).
>
>Now my question: Obviously, journaling file systems do not work
>correctly on drivers with such behavior. In contrast, a vital data
>structure is frequently written to (the journal), so such file systems
>*increase* the probability of complete failure (with a bad sector in
>the journal, the file system is probably unusable; for non-journaling
>file systems, only a part of the data becomes unavailable). Is the
>DTLA hard disk behavior regarding aborted writes more common among
>contemporary hard drives? Wouldn't this make journaling pretty
>pointless?
>
>
>1. http://www.cooling-solutions.de/dtla-faq (German)

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

2001-11-26 18:05:51

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sat, 24 Nov 2001 15:29:05 -0800, you wrote:

>Stephen Satchell wrote:
>
>>
>> It is the responsibility of the power monitor to detect a power-fail
>> event and tell the drive(s) that a power-fail event is occurring. If
>> power goes out of specification before the drive completes a commanded
>> write, what do you expect the poor drive to do? ANY glitch in the write
>> current will corrupt the current block no matter what -- the final CRC
>> isn't recorded. Most drives do have a panic-stop mode when they detect
>> voltage going out of range so as to minimize the damage caused by an
>> out-of-specification power-down event, and more importantly use the
>> energy in the spinning platter to get the heads moved to a safe place
>> before the drive completely spins down. The panic-stop mode is EXACTLY
>> like a Linux OOPS -- it's a catastrophic event that SHOULD NOT OCCUR.
>>
>
Correct, sort-of. The storage is not allowed to corrupt any data that
is unrelated to the currently active operation, (ie adjacent tracks or
sectors). Of course write-caching is asking for trouble.
>
>There is no "power monitor" in a PC system (at least not that is visible
>to the drive) -- if the drive needs it, it has to provide it itself.
>
>It's definitely the responsibility of the drive to recover gracefully
>from such an event, which means that it writes anything that it has
>committed to the host to write;
Correct. If a write gets interrupted in the middle of it's operation,
it has not yet returned any completion status, (unless you've enabled
write-caching, in which case, you're already asking for trouble) A
subsequent read of this half-written sector can return uncorrectable
status though, which would be unfortunate if this sector was your
allocation table, and the write was a read-modify-write.

>anything it hasn't gotten committed to
>write (but has received) can be written or not written, but must not
>cause a failure of the drive.
Reading a sector that was a partial-write because of a power-loss, and
returning UNCORRECTABLE status, is not a failure of the drive.

>
>A drive is a PERSISTENT storage device, and as such has responsibilities
>the other devices don't.
>
>Anything else is brainless rationalization.
>
> -hpa

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

2001-11-26 20:02:43

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Sunday 25 November 2001 04:14, Chris Wedgwood wrote:

>
> P.S. Write-caching in hard-drives is insanely dangerous for
> journalling filesystems and can result in all sorts of nasties.
> I recommend people turn this off in their init scripts (perhaps I
> will send a patch for the kernel to do this on boot, I just
> wonder if it will eat some drives).

Anybody remember back when hard drives didn't reliably park themselves when
they cut power? This isn't something drive makers seem to pay much attention
to until customers scream at them for a while...

Having no write caching on the IDE side isn't a solution either. The problem
is the largest block of data you can send to an ATA drive in a single command
is smaller than modern track sizes (let alone all the tracks under the heads
on a multi-head drive), so without any sort of cacheing in the drive at all
you add rotational latency between each write request for the point you left
off writing to come back under the head again. This will positively KILL
write performance. (I suspect the situation's more or less the same for read
too, but nobody's objecting to read cacheing.)

The solution isn't to avoid write cacheing altogether (performance is 100%
guaranteed to suck otherwise, for reasons unrelated to how well your hardware
works but to legacy request size limits in the ATA specification), but to
have a SMALL write buffer, the size of one or two tracks to allow linear ATA
write requests to be assembled into single whole-track writes, and to make
sure the disks' electronics has enough capacitance in it to flush this buffer
to disk. (How much do capacitors cost? We're talking what, an extra 20
miliseconds? The buffer should be small enough you don't have to do that
much seeking.)

Just add an off-the-shelf capacitor to your circuit. The firmware already
has to detect power failure in order to park the head sanely, so make it
flush the buffers along the way. This isn't brain surgery, it just wasn't a
design criteria on IBM's checklist of features approved in the meeting.
(Maybe they ran out of donuts and adjourned the meeting early?)

Rob

2001-11-26 20:34:12

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

2001-11-26 20:39:11

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Steve,

Dream on fellow, it is SOP that upon media failure the device logs the
failure and does an internal re-allocation in the slip-sector stream.
If the media is out of slip-sectors then it does an out-of-bounds
re-allocation. Once the total number of out-of-bounds sectors are gone
you need to deal with getting new media or exectute a seek and purge
operation; however, if the badblock list is full you are toast.

That is what is done - knowledge is first hand.

Regards,

Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project

On Mon, 26 Nov 2001, Steve Brueggeman wrote:

> While I am not familiar with the IBM drives in particular, I am
> familar with this particular problem.
>
> The problem is that half of a sector gets new data, then when power is
> dropped, the old data+CRC/ECC is left on second part of that sector,
> and a subsequent read on the whole sector will detect the CRC/ECC
> mismatch, and determine the error burst is larger than what it can
> correct with retries, and ECC, and report it as a HARD ERROR. (03-1100
> in the SCSI World)
>
> Since the error is non-recoverable, the disk drive should not
> auto-reassign the sector, since it cannot succeed at moving good data
> to the newly assigned sector.
>
> This type of error does not require a low-level format. Just writing
> any data to the sector in error should give the sector a CRC/ECC field
> that matches the data in the sector, and you should not get hard
> errors when reading that sector anymore.
>
> This was more of a problem with older disk drives (8-Inch platters, or
> older), because the time required to finish any given sector was more
> than the amount of time the electronics would run reliably. All that
> could be guranteed on these older drives was that a power loss would
> not corrupt any adjacent data, ie write gate must be crow-bared
> inactive before the heads start retracting, emergency-style, to the
> landing zone.
>
> I believe that the time to complete a sector is so short on current
> drives, that they should be able to complete writing their current
> sector, but I do not believe that there are any drive manufacturers
> out there that gurrantee this. Thus, there is probably a window, on
> all disk drives out there, where a loss of power durring an active
> write will end up causing a hard error when that sector is
> subsequently read (I haven't looked though, and could be wrong).
> Writing to the sector with the error should clear the hard-error when
> that sector is read. A low-level format should not be required to fix
> this, and if it is, the drive is definitely broken in design.
>
> This is basic power-economics, and one of the reasons for UPS's
>
> Steve Brueggeman
>
>
>
> On 24 Nov 2001 14:03:11 +0100, you wrote:
>
> >In the German computer community, a statement from IBM[1] is
> >circulating which describes a rather peculiar behavior of certain IBM
> >IDE hard drivers (the DTLA series):
> >
> >When the drive is powered down during a write operation, the sector
> >which was being written has got an incorrect checksum stored on disk.
> >So far, so good---but if the sector is read later, the drive returns a
> >*permanent*, *hard* error, which can only be removed by a low-level
> >format (IBM provides a tool for it). The drive does not automatically
> >map out such sectors.
> >
> >IBM claims this isn't a firmware error, but thinks that this explains
> >the failures frequently observed with DTLA drivers (which might
> >reflect reality or not, I don't know, but that's not the point
> >anyway).
> >
> >Now my question: Obviously, journaling file systems do not work
> >correctly on drivers with such behavior. In contrast, a vital data
> >structure is frequently written to (the journal), so such file systems
> >*increase* the probability of complete failure (with a bad sector in
> >the journal, the file system is probably unusable; for non-journaling
> >file systems, only a part of the data becomes unavailable). Is the
> >DTLA hard disk behavior regarding aborted writes more common among
> >contemporary hard drives? Wouldn't this make journaling pretty
> >pointless?
> >
> >
> >1. http://www.cooling-solutions.de/dtla-faq (German)

2001-11-26 20:56:31

by Richard B. Johnson

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, 26 Nov 2001, Rob Landley wrote:

> On Sunday 25 November 2001 04:14, Chris Wedgwood wrote:
>
> >
> > P.S. Write-caching in hard-drives is insanely dangerous for
> > journalling filesystems and can result in all sorts of nasties.
> > I recommend people turn this off in their init scripts (perhaps I
> > will send a patch for the kernel to do this on boot, I just
> > wonder if it will eat some drives).
>
> Anybody remember back when hard drives didn't reliably park themselves when
> they cut power? This isn't something drive makers seem to pay much attention
> to until customers scream at them for a while...
>
> Having no write caching on the IDE side isn't a solution either. The problem
> is the largest block of data you can send to an ATA drive in a single command
> is smaller than modern track sizes (let alone all the tracks under the heads
> on a multi-head drive), so without any sort of cacheing in the drive at all
> you add rotational latency between each write request for the point you left
> off writing to come back under the head again. This will positively KILL
> write performance. (I suspect the situation's more or less the same for read
> too, but nobody's objecting to read cacheing.)
>
> The solution isn't to avoid write cacheing altogether (performance is 100%
> guaranteed to suck otherwise, for reasons unrelated to how well your hardware
> works but to legacy request size limits in the ATA specification), but to
> have a SMALL write buffer, the size of one or two tracks to allow linear ATA
> write requests to be assembled into single whole-track writes, and to make
> sure the disks' electronics has enough capacitance in it to flush this buffer
> to disk. (How much do capacitors cost? We're talking what, an extra 20
> miliseconds? The buffer should be small enough you don't have to do that
> much seeking.)
>
> Just add an off-the-shelf capacitor to your circuit. The firmware already
> has to detect power failure in order to park the head sanely, so make it
> flush the buffers along the way. This isn't brain surgery, it just wasn't a
> design criteria on IBM's checklist of features approved in the meeting.
> (Maybe they ran out of donuts and adjourned the meeting early?)
>
> Rob

It isn't that easy! Any kind of power storage within the drive would
have to be isolated with diodes so that it doesn't try to run your
motherboard as well as the drive. This means that +5 volt logic supply
would now be 5.0 - 0.6 = 4.4 volts at the drive, well below the design
voltage. Use of a Schottky diode (0.34 volts) would help somewhat, but you
have now narrowed the normal power design-margin by 90 percent, not good.

There is supposed to be a "power good" line out of your power supply
which is supposed to tell equipment when the main power has failed or
is about to fail. There isn't a "power good" line in SCSI so that
doesn't help.

Basically, when the power fails, all bets are off. A write in progress
may not succeed any more than a seek in progress would. Seeks take a
lot of power, usually from the +12 volt line. Typically, if a write
is in progress, when low power is sensed by the drive, write current
is terminated. At one time, there was a electromagnet that was
released to move the heads to a landing zone. Now there is none.
The center of radius of the head arm is slightly forward of the
center of rotation of the disk so that when the heads "land", they
skate to the inside of the platter, off the active media. The media
is supposed to be able to take this abuse for quite some time.

When a partially written sector is read with a bad CRC, the host
(not the drive) can rewrite the sector. As long as the sector
header, which is ahead of the write-splice, isn't destroyed
the disk doesn't need to be re-formatted. In the remote case where
the sector header is destroyed, the bad sector may be re-mapped by
the drive if there are any spare sectors still available. The first
error returned to the host is the bad CRC. Subsequent reads will
not return a bad CRC if the sector was re-mapped. However, the data
is invalid! Therefore, the drivers can't retry reads expecting that
a bad CRC got fixed so the data is okay. The driver needs to read
all the sense data and try to figure it out.

The solution is an UPS. When the UPS power gets low, shut down
the computer, preferably automatically.

Also, if your computer is on all day long as is typical at a
workplace, never shut it off. Just turn off the monitor when you
go home. Your disk drives will last until you decide to replace
then because they are too small or too slow.

And beware when you finally do turn off the computer. The disks
may not spin up the next time you start the computer. It's a good
idea to back up everything before shutting down a computer that
has been running for a year or two.

Of course you can re-boot as much as you want. Just don't kill the power!

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.

2001-11-26 23:36:49

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Monday 26 November 2001 15:30, Andre Hedrick wrote:
> On Mon, 26 Nov 2001, Rob Landley wrote:

> > Just add an off-the-shelf capacitor to your circuit. The firmware
> > already has to detect power failure in order to park the head sanely, so
> > make it flush the buffers along the way. This isn't brain surgery, it
> > just wasn't a design criteria on IBM's checklist of features approved in
> > the meeting. (Maybe they ran out of donuts and adjourned the meeting
> > early?)
>
> Rob,
>
> Send me an outline/discription and I will present it during the Dec T13
> meeting for a proposal number for inclusion into ATA-7.

What kind of write-up do you want? (How formal?)

The trick here is limiting the scope of the problem. Your buffer can't be
larger than you can reliably write back on a sudden power failure. (This
should be obvious.) So the obvious answer is to make your writeback cache
SMALL. The problems that go with flushing it are then correspondingly small.

Your READ cache can be as large as you like, but when the disk accepts data
written to it, a journaling FS assumes it will be committed to disk.
Explicit flush requests are largely trying to get the filesystem to know
about disk implementation issues: that's unnecessary complexity. (And
something vendors hate to implement because it kills performance.) But
either the drive can flush cache when the power goes out, or it's not
reliable.

So how big of a cache is useful? Well, what does the cache DO? A small
1-track cache helps write full tracks at a time, and if the cache can hold a
second track it can start its seek immediately upon finishing the first track
without worrying about latency of the OS getting back to it with more data.
But more than 2 tracks gives you no benefit: cacheing beyond that is the
operating system's job. No benefit, and the liability of more to flush on
power failure. The answer is simple: Don't Do That Then.

You need stored power to flush the stored data, and a capacitor's better than
a battery for several reasons. It's cheaper, it lasts longer (no repeated
charge/discharge fatigue), it can provide a LOT of power very quickly (we're
actuating motors here), and we're only asking for a fraction of a second's
extra power here which isn't what most batteries are designed for anyway.
Capacitors are.

The people talking about batteries are trying to do battery backed up cache,
which is silly and overkill. We want the data to go to a disk which is
ALREADY spinning at full speed when we lose power. Current designs already
try to flush the cache as they're losing power (a write cache is always in
the process of being flushed, barring contention with read requests), and
sometimes they even manage to do it. We just need a little extra power to
guarantee we can shut down gracefully.

A capacitor can provide a few miliseconds worth of power to keep the platters
spinning at full speed, power the logic, do a maximum of two seeks, and of
course feed power to the write head. Conceptually, our volatile ram cache
needs a power cache to flush it on power failure, and cacheing amperage is
what a capacitor DOES.

Now let's get back to cache size. You need 1 track of cache to get full
track writes with ATA. Being able to feed it a second track might be a good
idea to avoid latency at the OS between one track finishing and the next
starting. (If nothing else, you tell it where to seek to next.) But more
than 2 tracks serves no purpose if the OS has a backlog of work for the disk
to do, and if it doesn't we're not optimizing anything anyway.

Now a cache large enough to hold 2 full tracks could also hold dozens of
individual sectors scattered around the disk, which could take a full second
to write off and power down. This is a "doctor, it hurts when I do this"
question. DON'T DO THAT.

The drive should block when it's fed sectors living on more than 2 tracks.
Don't bother having the drive implement an elevator algorithm: the OS already
has one. Just don't cache sectors living on more than 2 tracks at a time:
treat it as a "cache full" situation and BLOCK.

And further, don't cache anything for a SECOND track until you've already
seeked to the first track. This is to limit the number of potential seeks
the capacitor has to power. Reads work into this too: any time you get a
"seek request", for read or for write, finish with the track you're on before
moving. Accept new write requests into the buffer for the track you're
currently on, and for ONE other track. If you're not currently on a track
you have anything to write to, you can buffer stuff for only one other track.
Anything else blocks just as if the buffer was full. (Because it is.)

That way, the power down problem is strictly limited:

1) write out the track you're over
2) seek to the second track
3) write that out too
4) park the head

You're done. You can measure this in the lab, determine exactly how much
power your capacitor needs to supply to guarantee that, and implement it.
Your worst case scenario is a full track write next to where the head
normally parks, a full track write at the far end of the disk, and then
seeking back to the landing zone. This is two seeks including the park,
which should still be easily measured in miliseconds.

There's no elevator algorithm (that's the OS's job), no battery backed up
cache (not needed, the platters are already persistent), just a cheap
solution for a cheap ATA drive, arrived at by limiting the size of the
problem you're handling.

What new hardware is involved?

Add a capacitor.

Add a power level sensor. (Drives may already have this to know when to park
the head.)

Firmware to manage the cache (limiting its data intake, and flushing right
before parking).

I think that's it. Did I miss anything? Oh yeah, on power fail stop
worrying about read requests. (They can theoretically starve the write
requests on this capacitor-powered guaranteed seek thing, although if the
power IS failing there shouldn't be too many more of them coming in, should
there? But they may be queued.) But that's fairly obvious, and there has to
be logic for this already or else the read head would run out of power and
crash into the disk before it got a chance to park...

Again, just unload the real write cacheing on the OS because the purpose of
the drive's cacheing is to batch requests to the track level and to disguise
seek latency a bit, and if that's ALL it does it's easy to reliably flush
that on power down with just a capacitor. Any cacheing beyond 2 tracks worth
(not 2 tracks worth of individual sectors scattered all over the disk but
"the current track and 1 other track") just gets in the way of reliability.
Yes the drive maker may be wasting DRAM by doing that. Tell them they can
dedicate that other ram to a read cache, but writes need to block to maintain
the implicit guarantee that if the drive accepted the write the data will
still be there after a power off..

Rob

2001-11-26 21:14:13

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Well, since you don't clearify what part you object to, I'll have to
assume that you object to my statement that the disk drive will not
auto-reallocate when it cannot recover the data.

If you think that a disk drive should auto-reallocate a sector (ARRE
enabled in the mode pages) that it cannot recover the original data
from, than you can dream on. I seriously hope this is not what you're
recommending for ATA. If a disk drive were to auto-reallocate a
sector that it couldn't get valid data from, you'd have serious
corruptions probelms!!! Tell me, what data should exist in the sector
that gets reallocated if it cannot retrieve the data the system
believes to be there??? If the reallocated sector has random data,
and the next read to it doesn't return an error, than the system will
get no indication that it should not be using that data.

If the unrecoverable error happens durring a write, the disk drive
still has the data in the buffer, so auto-reallocation on writes (AWRE
enabled in the mode pages), is usually OK

That said, it'd be my bet that most disk drives still have a window of
opportunity durring the reallocation operation, where if the drive
lost power, they'd end up doing bad things.

You can force a reallocation, but the data you get when you first read
that unreadable reallocated sector is usually undefined, and often is
the data pattern written when the drive was low-level formatted.

That IS what is done, my knowledge is also first hand.

I have no descrepency with your description of how spare sectors are
dolled out.

Steve Brueggeman

On Mon, 26 Nov 2001 12:36:02 -0800 (PST), you wrote:

>
>Steve,
>
>Dream on fellow, it is SOP that upon media failure the device logs the
>failure and does an internal re-allocation in the slip-sector stream.
>If the media is out of slip-sectors then it does an out-of-bounds
>re-allocation. Once the total number of out-of-bounds sectors are gone
>you need to deal with getting new media or exectute a seek and purge
>operation; however, if the badblock list is full you are toast.
>
>That is what is done - knowledge is first hand.
>
>Regards,
>
>Andre Hedrick
>CEO/President, LAD Storage Consulting Group
>Linux ATA Development
>Linux Disk Certification Project
>
>On Mon, 26 Nov 2001, Steve Brueggeman wrote:

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

2001-11-26 21:40:23

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, 26 Nov 2001, Steve Brueggeman wrote:

> Well, since you don't clearify what part you object to, I'll have to
> assume that you object to my statement that the disk drive will not
> auto-reallocate when it cannot recover the data.
>
> If you think that a disk drive should auto-reallocate a sector (ARRE
> enabled in the mode pages) that it cannot recover the original data
> from, than you can dream on. I seriously hope this is not what you're

One has to go read the general purpose error logs to determine the
location of the original platter assigned sector of the relocated LBA.

Reallocation generally occurs on write to media not read, and you should
know that point.

> recommending for ATA. If a disk drive were to auto-reallocate a
> sector that it couldn't get valid data from, you'd have serious
> corruptions probelms!!! Tell me, what data should exist in the sector
> that gets reallocated if it cannot retrieve the data the system
> believes to be there??? If the reallocated sector has random data,
> and the next read to it doesn't return an error, than the system will
> get no indication that it should not be using that data.
>
> If the unrecoverable error happens durring a write, the disk drive
> still has the data in the buffer, so auto-reallocation on writes (AWRE
> enabled in the mode pages), is usually OK

By the time an ATA device gets to generating this message, either the bad
block list is full or all reallocation sectors are used. Unlike SCSI
which has to be hand held, 90% of all errors are handle by the device.
Good or Bad -- that is how it does it.

Well there is an additional problem in all of storage, that drives do
reorder and do not always obey the host-driver. Thus if the device is
suffering from performance and you have disabled WB-Cache, it may elect to
self enable. Now you have the device returning ack to platter that may
not be true. Most host-drivers (all of Linux, mine include) release and
dequeue the request once the ack has been presented. This is dead wrong.
If a flush cache fails I get back the starting lba of the write request,
and if the request is dequeued -- well you know -- bye bye data! SCSI
will do the same, even with TCQ. Once the sense is cleared to platter and
the request is dequeued, and a hiccup happens -- bye bye data!

> That said, it'd be my bet that most disk drives still have a window of
> opportunity durring the reallocation operation, where if the drive
> lost power, they'd end up doing bad things.

That is a given.

> You can force a reallocation, but the data you get when you first read
> that unreadable reallocated sector is usually undefined, and often is
> the data pattern written when the drive was low-level formatted.
>
> That IS what is done, my knowledge is also first hand.

Excellent to see another Storage Industry person here.

> I have no descrepency with your description of how spare sectors are
> dolled out.

Cool.

Question -- are you up to fixing the low-level drivers and all the stuff
above ?

> Steve Brueggeman
>
>
> On Mon, 26 Nov 2001 12:36:02 -0800 (PST), you wrote:
>
> >
> >Steve,
> >
> >Dream on fellow, it is SOP that upon media failure the device logs the
> >failure and does an internal re-allocation in the slip-sector stream.
> >If the media is out of slip-sectors then it does an out-of-bounds
> >re-allocation. Once the total number of out-of-bounds sectors are gone
> >you need to deal with getting new media or exectute a seek and purge
> >operation; however, if the badblock list is full you are toast.
> >
> >That is what is done - knowledge is first hand.

Regards,

Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project

2001-11-26 23:49:30

by Martin Eriksson

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

----- Original Message -----
From: "Steve Brueggeman" <[email protected]>
To: <[email protected]>
Sent: Monday, November 26, 2001 7:05 PM
Subject: Re: Journaling pointless with today's hard disks?

<snip>

> >There is no "power monitor" in a PC system (at least not that is visible
> >to the drive) -- if the drive needs it, it has to provide it itself.
> >
> >It's definitely the responsibility of the drive to recover gracefully
> >from such an event, which means that it writes anything that it has
> >committed to the host to write;
> Correct. If a write gets interrupted in the middle of it's operation,
> it has not yet returned any completion status, (unless you've enabled
> write-caching, in which case, you're already asking for trouble) A
> subsequent read of this half-written sector can return uncorrectable
> status though, which would be unfortunate if this sector was your
> allocation table, and the write was a read-modify-write.
>
> >anything it hasn't gotten committed to
> >write (but has received) can be written or not written, but must not
> >cause a failure of the drive.
> Reading a sector that was a partial-write because of a power-loss, and
> returning UNCORRECTABLE status, is not a failure of the drive.

I sure think the drives could afford the teeny-weeny cost of a power failure
detection unit, that when a power loss/sway is detected, halts all
operations to the platters except for the writing of the current sector.

_____________________________________________________
| Martin Eriksson <[email protected]>
| MSc CSE student, department of Computing Science
| Ume? University, Sweden

2001-11-27 00:03:31

by Andreas Dilger

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Nov 26, 2001 15:35 -0500, Rob Landley wrote:
> The drive should block when it's fed sectors living on more than 2 tracks.
> Don't bother having the drive implement an elevator algorithm: the OS already
> has one. Just don't cache sectors living on more than 2 tracks at a time:
> treat it as a "cache full" situation and BLOCK.

The other thing that concerns a journaling fs is write ordering. If you
can _guarantee_ that an entire track (or whatever) can be written to disk
in _all_ cases, then it is OK to reorder write requests within that track
AS LONG AS YOU DON'T REORDER WRITES WHERE YOU SKIP BLOCKS THAT ARE NOT
GUARANTEED TO COMPLETE.

Generally, in Linux, ext3 will wait on all of the journal transaction
blocks to be written before it writes a commit record, which is its way
of guaranteeing that everything before the commit is valid. If you start
write cacheing the transaction blocks, return, and then write the commit
record to disk before the other transaction blocks are written, this is
SEVERELY BROKEN. If it was guaranteed that the commit record would hit
the platters at the same time as the other journal transaction blocks,
that would be the minimum acceptable behaviour.

Obviously a working TCQ or write barrier would also allow you to optimize
all writes before the commit block is written, but that should be an
_enhancement_ above the basic write operations, only available if you
start using this feature.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-27 00:08:42

by Andreas Dilger

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Nov 27, 2001 00:49 +0100, Martin Eriksson wrote:
> I sure think the drives could afford the teeny-weeny cost of a power failure
> detection unit, that when a power loss/sway is detected, halts all
> operations to the platters except for the writing of the current sector.

What happens if you have a slightly bad power supply? Does it immediately
go read only all the time? It would definitely need to be able to
recover operations as soon as the power was "normal" again, even if this
caused basically "sync" I/O to the disk. Maybe it would be able to
report this to the user via SMART, I don't know.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-27 00:19:03

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, 26 Nov 2001, Andreas Dilger wrote:

> On Nov 27, 2001 00:49 +0100, Martin Eriksson wrote:
> > I sure think the drives could afford the teeny-weeny cost of a power failure
> > detection unit, that when a power loss/sway is detected, halts all
> > operations to the platters except for the writing of the current sector.
>
> What happens if you have a slightly bad power supply? Does it immediately
> go read only all the time? It would definitely need to be able to
> recover operations as soon as the power was "normal" again, even if this
> caused basically "sync" I/O to the disk. Maybe it would be able to
> report this to the user via SMART, I don't know.

ATA/SCSI SMART is already DONE!

To bad most people have not noticed.

Regards,

Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project

2001-11-27 00:21:24

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks? [wandering OT]

On Monday 26 November 2001 15:53, Richard B. Johnson wrote:
>
>
> It isn't that easy! Any kind of power storage within the drive would
> have to be isolated with diodes so that it doesn't try to run your
> motherboard as well as the drive. This means that +5 volt logic supply
> would now be 5.0 - 0.6 = 4.4 volts at the drive, well below the design
> voltage. Use of a Schottky diode (0.34 volts) would help somewhat, but you
> have now narrowed the normal power design-margin by 90 percent, not good.

At this point I have to hand the conversation over to either my father (a
professional electrical engineer), my grandfather (ditto for 50 years,
including helping GE debug its early vacuum tube lines), or my friend chip
(who got a 4.0 from a technical college and who modifies playstations with a
soldering iron for fun).

Me, I'm mostly a software person, but this strikes me as a fairly
straightforward voltage regulation and switching problem. Must admit I was
considering transistors sealing off the rest of the world's power supply when
the sensor says it's going bye-bye, but I can't say I'm familiar with the
kind of load you can hit one of them with. (I remember using one to drive a
motor once, but that was smoke signals lab back in college and a significant
number of the components I used gave up their magic smoke along the way. I
ran an awful lot of current through the big evil black three-prong
transistors, though. That's a problem they solved back in the 1960's, isn't
it?)

> There is supposed to be a "power good" line out of your power supply
> which is supposed to tell equipment when the main power has failed or
> is about to fail. There isn't a "power good" line in SCSI so that
> doesn't help.

Shouldn't be too hard to fake something up to detect a current fluctuation.
Sheesh, in a way that's what the whole high/low logic gates reading the data
bus do, isn't it? And the cache dump logic is more or less constant (you
WANT it to go to disk), it's not so much triggering it as making sure you
limit what it has to do to what you can guarantee it'll have time to do, and
then adding a few miliseconds of extra power to guarantee it'll have time to
do it.

Maybe I'm oversimplifying. I'm a software person. We do that with
hardware...

> Basically, when the power fails, all bets are off. A write in progress
> may not succeed any more than a seek in progress would.

Currently, sure. But nobody said this was a GOOD thing.

> Seeks take a
> lot of power, usually from the +12 volt line.

I've seen capacitors melt screws. (And in one instance, a screwdriver.)
Admittedly those were the monster big ones (the screw melter was about 10
cubic centimeters, the screwdriver got melted by a friend poking around in
the back of an unplugged television set; he lived), but saying a capacitor
doesn't have enough power to do something without specifying the capacitor in
question...

My grandfather has capacitors that simulate lightning strikes to stress-test
equipment against electromagnetic pulse interference during thrunderstorms.
(They're a little larger than a printer paper box, and he hooks a half-dozen
of them up in series.)

> Typically, if a write
> is in progress, when low power is sensed by the drive, write current
> is terminated. At one time, there was a electromagnet that was
> released to move the heads to a landing zone. Now there is none.
> The center of radius of the head arm is slightly forward of the
> center of rotation of the disk so that when the heads "land", they
> skate to the inside of the platter, off the active media. The media
> is supposed to be able to take this abuse for quite some time.

I'd heard the parking these days was sometimes done centrifugally, but didn't
know it skipped in...

> The solution is an UPS. When the UPS power gets low, shut down
> the computer, preferably automatically.

I admit that laptops are driving desktops into the "workstation" market, so
we'll all have battery backup automatically anyway, but saying a piece of
equipment that doesn't gracefully deal with a condition CAN'T gracefully deal
with that condition...

If current processors ate their microcode on an unclean loss of power, or
flashable bioses glitched themselves on an unclean loss of power, would you
consider this behavior justifiable because you should have been using a UPS?

We're not talking server side hosted RAID systems here. (Although this could
easily take out multiple drives from a raid simultaneously.) We're talking a
college student's home desktop system went bye-bye because his roommate hit
the light switch that the computer's outlet was plugged into, and his
journaling FS did no good.

You're arguing that there's no real world demand for journaling filesystems.
You realise this, don't you? (If an unclean shutdown can create hard errors
on your drive as well as eating who knows how much write-cached data that the
journal thought was committed, what's the point of journaling?)

> Also, if your computer is on all day long as is typical at a
> workplace, never shut it off.

I don't.

> Just turn off the monitor when you
> go home. Your disk drives will last until you decide to replace
> then because they are too small or too slow.

They do. However, I have power failures from time to time. Even with a UPS,
the power cord has been knocked out of the back of the box (or the switch got
hit by somebody's foot) on more than one occasion. And then there was the
time an entire Dr. Pepper went flying all over the machine and a very quick
power down was required before liquid could drip down onto the electronics.
(Not a server room scenario, no. But more common than you'd think in
desktops and workstations.)

> And beware when you finally do turn off the computer. The disks
> may not spin up the next time you start the computer. It's a good
> idea to back up everything before shutting down a computer that
> has been running for a year or two.

Why wait until you shut the box down?

http://content.techweb.com/wire/story/TWB20010409S0012

If you have 3 year old data you still care about and you haven't backed it up
yet, something is wrong. Forget the drive going bad, I had lighting cause
one of the chips in my modem to explode once. (Literally. Strangely, the
rest of the system, an old 386, worked fine after a reboot, but there was no
reason to expect that.) Or the power supply filling up with dust and doing
all SORTS of fun things to the rest of the system.

> Of course you can re-boot as much as you want. Just don't kill the power!

Worst case scenario this is what data recovery services are for. Assuming
you can budget $10k for them to crack open your drive in their cleanroom. :)

Also, sticking the drive in the freezer for a bit often works long enough to
get the data off. Several theories on why (lower the resistance of stuff in
the motor, contract and bring worn contacts closer together, stop the
lubrication from acting like glue) but it's a good "the drive's hosed, what
do we do" hail mary pass. Just don't think it's a fix longer than it takes
the drive to warm up. (Oh yeah, put it in a plastic bag first.
Condensation, you know. Bad for electronics.)

In my personal experience the drive's bearings seem to go before the motor,
but I know that's not a general rule...

> Cheers,
> Dick Johnson

Rob

2001-11-27 00:19:21

by Jonathan Lundell

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

At 12:49 AM +0100 11/27/01, Martin Eriksson wrote:
>I sure think the drives could afford the teeny-weeny cost of a power failure
>detection unit, that when a power loss/sway is detected, halts all
>operations to the platters except for the writing of the current sector.

That's hard to do. You really need to do the power-fail detection on
the AC line, or have some sort of energy storage and a dc-dc
converter, which is expensive. If you simply detect a drop in dc
power, there simply isn't enough margin to reliably write a block.

Years (many years) back, Diablo had a short-lived model (400, IIRC)
that had an interesting twist on this. On a power failure, the
spinning disk (this was in the days of 14" platters, so plenty of
energy) drove the spindle motor as a generator, providing power to
the drive electronics for several seconds before it spun down to
below operating speed.

Of course, that was in the days of thousands of dollars for maybe
20MB of storage....
--
/Jonathan Lundell.

2001-11-27 00:24:54

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Followup to: <[email protected]>
By author: Andreas Dilger <[email protected]>
In newsgroup: linux.dev.kernel
>
> The other thing that concerns a journaling fs is write ordering. If you
> can _guarantee_ that an entire track (or whatever) can be written to disk
> in _all_ cases, then it is OK to reorder write requests within that track
> AS LONG AS YOU DON'T REORDER WRITES WHERE YOU SKIP BLOCKS THAT ARE NOT
> GUARANTEED TO COMPLETE.
>
> Generally, in Linux, ext3 will wait on all of the journal transaction
> blocks to be written before it writes a commit record, which is its way
> of guaranteeing that everything before the commit is valid. If you start
> write cacheing the transaction blocks, return, and then write the commit
> record to disk before the other transaction blocks are written, this is
> SEVERELY BROKEN. If it was guaranteed that the commit record would hit
> the platters at the same time as the other journal transaction blocks,
> that would be the minimum acceptable behaviour.
>
> Obviously a working TCQ or write barrier would also allow you to optimize
> all writes before the commit block is written, but that should be an
> _enhancement_ above the basic write operations, only available if you
> start using this feature.
>

Indeed; having explicit write barriers would be a very useful feature,
but the drives MUST default to strict ordering unless reordering (with
write barriers) have been enabled explicitly by the OS.

Furthermore, I would like to add the following constraint to your
writeup:

** For each individual sector, a write MUST either complete or not
take place at all. In other words, writes are guaranteed to be
atomic on a sector-by-sector basis.

-hpa

P.S. Thanks, Andre, for taking the initiative of getting an actual
commit model into the standardized specification. Otherwise we'd be
doomed to continue down the path where what operating systems need for
sane operation and what disk drives provide are increasingly
divergent.
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-27 00:32:44

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Followup to: <[email protected]>
By author: "Richard B. Johnson" <[email protected]>
In newsgroup: linux.dev.kernel
>
> It isn't that easy! Any kind of power storage within the drive would
> have to be isolated with diodes so that it doesn't try to run your
> motherboard as well as the drive. This means that +5 volt logic supply
> would now be 5.0 - 0.6 = 4.4 volts at the drive, well below the design
> voltage. Use of a Schottky diode (0.34 volts) would help somewhat, but you
> have now narrowed the normal power design-margin by 90 percent, not good.
>

Hardly a big deal since most logic is 3.3V these days (remember, you
don't need to maintain VccIO since the bus is dead anyway).

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-27 00:53:44

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Followup to: <[email protected]>
By author: "H. Peter Anvin" <[email protected]>
In newsgroup: linux.dev.kernel
>
> Indeed; having explicit write barriers would be a very useful feature,
> but the drives MUST default to strict ordering unless reordering (with
> write barriers) have been enabled explicitly by the OS.
>

On the subject of write barriers... such a setup probably should have
a serial number field for each write barrier command, and a "WAIT FOR
WRITE BARRIER NUMBER #" command -- which will wait until all writes
preceeding the specified write barrier has been committed to stable
storage. It might also be worthwhile to have the equivalent
nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-27 01:05:44

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

>
> At 12:49 AM +0100 11/27/01, Martin Eriksson wrote:
> >I sure think the drives could afford the teeny-weeny cost of a power failure
<snip>
> converter, which is expensive. If you simply detect a drop in dc
> power, there simply isn't enough margin to reliably write a block.
>
> Years (many years) back, Diablo had a short-lived model (400, IIRC)
> that had an interesting twist on this. On a power failure, the
> spinning disk (this was in the days of 14" platters, so plenty of
> energy) drove the spindle motor as a generator, providing power to
> the drive electronics for several seconds before it spun down to
> below operating speed.

I have a (IIRC) elantec databook from 1985 or so, that I've found chips in
disks from the MFM/RLL PC era.
These are motor driver chips aimed at PCs, which support generation
using the motor.

2001-11-27 01:13:06

by Andrew Morton

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

"H. Peter Anvin" wrote:
>
> Followup to: <[email protected]>
> By author: "H. Peter Anvin" <[email protected]>
> In newsgroup: linux.dev.kernel
> >
> > Indeed; having explicit write barriers would be a very useful feature,
> > but the drives MUST default to strict ordering unless reordering (with
> > write barriers) have been enabled explicitly by the OS.
> >
>
> On the subject of write barriers... such a setup probably should have
> a serial number field for each write barrier command, and a "WAIT FOR
> WRITE BARRIER NUMBER #" command -- which will wait until all writes
> preceeding the specified write barrier has been committed to stable
> storage. It might also be worthwhile to have the equivalent
> nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED.
>

For ext3 at least, all that is needed is a barrier which says
"don't reorder writes across here". Asynchronous behaviour
beyond that is OK - the disk is free to queue multiple transactions
internally as long as the barriers are observed. If the power
goes out we'll just recover up to and including the last-written
commit block.

-

2001-11-27 01:16:04

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Andrew Morton wrote:

> "H. Peter Anvin" wrote:
>
>>Followup to: <[email protected]>
>>By author: "H. Peter Anvin" <[email protected]>
>>In newsgroup: linux.dev.kernel
>>
>>>Indeed; having explicit write barriers would be a very useful feature,
>>>but the drives MUST default to strict ordering unless reordering (with
>>>write barriers) have been enabled explicitly by the OS.
>>>
>>>
>>On the subject of write barriers... such a setup probably should have
>>a serial number field for each write barrier command, and a "WAIT FOR
>>WRITE BARRIER NUMBER #" command -- which will wait until all writes
>>preceeding the specified write barrier has been committed to stable
>>storage. It might also be worthwhile to have the equivalent
>>nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED.
>>
>>
>
> For ext3 at least, all that is needed is a barrier which says
> "don't reorder writes across here". Asynchronous behaviour
> beyond that is OK - the disk is free to queue multiple transactions
> internally as long as the barriers are observed. If the power
> goes out we'll just recover up to and including the last-written
> commit block.
>

Waiting for write barriers to clear is key to implementing fsync()
efficiently and correctly.

-hpa

2001-11-27 01:28:57

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

>
> On Monday 26 November 2001 15:30, Andre Hedrick wrote:
> > On Mon, 26 Nov 2001, Rob Landley wrote:
>
> > > Just add an off-the-shelf capacitor to your circuit. The firmware
> > > already has to detect power failure in order to park the head sanely, so

> > Send me an outline/discription and I will present it during the Dec T13
> > meeting for a proposal number for inclusion into ATA-7.
>
> What kind of write-up do you want? (How formal?)
>
> The trick here is limiting the scope of the problem. Your buffer can't be
> larger than you can reliably write back on a sudden power failure. (This
> should be obvious.) So the obvious answer is to make your writeback cache
> SMALL. The problems that go with flushing it are then correspondingly small.
<snip>
>
> Now a cache large enough to hold 2 full tracks could also hold dozens of
> individual sectors scattered around the disk, which could take a full second
> to write off and power down. This is a "doctor, it hurts when I do this"
> question. DON'T DO THAT.

Or, to seek to a journal track, and write the cache to it.
Errors are a problem, writing twice may help.
This avoids having to block on bad write patterns, for example, if you
are writing mixed blocks that go to tracks 1 and 88, you can't start to
write blocks that would go to track 44.
Performance would rise if it can do the writes in elevator order.

<snip>
> That way, the power down problem is strictly limited:
>
> 1) write out the track you're over
> 2) seek to the second track
> 3) write that out too
> 4) park the head

Or 2) optionally seek to the journal track, and write the journal.

>
> What new hardware is involved?
>
> Add a capacitor.
>
> Add a power level sensor. (Drives may already have this to know when to park
> the head.)
Most drives I've taken apart recently seem to have passive means,
a spring to move the head to the side, and a magnet to hold it there.
<Snip>>
> I think that's it. Did I miss anything? Oh yeah, on power fail stop

It needs a power switch to stop back-feeding the computer.

2001-11-27 01:34:37

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Followup to: <[email protected]>
By author: Ian Stirling <[email protected]>
In newsgroup: linux.dev.kernel
>
> I have a (IIRC) elantec databook from 1985 or so, that I've found chips in
> disks from the MFM/RLL PC era.
> These are motor driver chips aimed at PCs, which support generation
> using the motor.
>

This is still being done, AFAIK. There is quite some amount of energy
in a 7200 rpm platter set.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-27 01:52:38

by Steve Underwood

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Jonathan Lundell wrote:

> At 12:49 AM +0100 11/27/01, Martin Eriksson wrote:
>
>> I sure think the drives could afford the teeny-weeny cost of a power
>> failure
>> detection unit, that when a power loss/sway is detected, halts all
>> operations to the platters except for the writing of the current sector.
>
>
> That's hard to do. You really need to do the power-fail detection on the
> AC line, or have some sort of energy storage and a dc-dc converter,
> which is expensive. If you simply detect a drop in dc power, there
> simply isn't enough margin to reliably write a block.
>
> Years (many years) back, Diablo had a short-lived model (400, IIRC) that
> had an interesting twist on this. On a power failure, the spinning disk
> (this was in the days of 14" platters, so plenty of energy) drove the
> spindle motor as a generator, providing power to the drive electronics
> for several seconds before it spun down to below operating speed.
>
> Of course, that was in the days of thousands of dollars for maybe 20MB
> of storage....

Quite true. The drives really need to get an "oh heck, the power's about
to die. Quick, tidy up" signal from the outside world (like down the
ribbon). Cheap, at the limit, PSUs probably couldn't give enough notice
to be very helpful. Server grade ones should - they can usually ride
over brief hiccups in the power, so they should be able to give a few
10s of ms notice before the regulated power lines start to droop.
Perhaps the ATA command set should include such a feature, so the OS
could take instruction from the hardware on the power situation, and
tell the drives what to do.

Regards,
Steve

2001-11-27 02:02:18

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Monday 26 November 2001 20:23, Ian Stirling wrote:

> > Now a cache large enough to hold 2 full tracks could also hold dozens of
> > individual sectors scattered around the disk, which could take a full
> > second to write off and power down. This is a "doctor, it hurts when I
> > do this" question. DON'T DO THAT.
>
> Or, to seek to a journal track, and write the cache to it.

Except that at most you have one seek to write out all the pending cache data
anyway, so what exactly does seeking to a journal track buy you?

Now modern drives have this fun little thing where they remap bad sectors so
writing to one logical track can involve a seek, and the idea here is to cap
seeks, so the drive has to keep track how where sectors ACTUALLY are and
block based on their physical position rather than the logical position they
present to the system. Which could be fairly evil. But oh well...

(And in theory, if you're doing a linear write on a sector by sector basis,
the discontinuous portions of a damaged track (the first half of the track,
with one sector out of line, followed by the rest of track) could still be
written in one go assuming the system unblocks when it physically seeks to
the track in question, allowing the system to write the rest of the data to
that track before it seeks away from it...)

> Errors are a problem, writing twice may help.
> This avoids having to block on bad write patterns, for example, if you
> are writing mixed blocks that go to tracks 1 and 88, you can't start to
> write blocks that would go to track 44.
> Performance would rise if it can do the writes in elevator order.

The elevator is the operating system's problem. To reliably write stuff back
you can't have an unlimited number of different tracks in cache, or the seeks
to write it all out will kill any reasonable finite power reserve you'd want
to put in a disk.

> <snip>
>
> > That way, the power down problem is strictly limited:
> >
> > 1) write out the track you're over
> > 2) seek to the second track
> > 3) write that out too
> > 4) park the head
>
> Or 2) optionally seek to the journal track, and write the journal.

Possibly. I still don't see what it gets you if you only have one track
other than the one you're over to write to. (is the journal track near the
area the head parks in? That could be a power saving method, I suppose. But
it's also wasting disk space that would probably otherwise be used for
storage or a block remapping, and how do you remap a bad sector out of the
journal track if that happens?)

> > What new hardware is involved?
> >
> > Add a capacitor.
> >
> > Add a power level sensor. (Drives may already have this to know when to
> > park the head.)
>
> Most drives I've taken apart recently seem to have passive means,
> a spring to move the head to the side, and a magnet to hold it there.

Yeah, I'd heard that. That's why the word "may" was involved. :) (That and
just trusting the inertia of the platter to aerodynamically keep the head
airborne before it can snap back to the parking position.)

You could still do this, by the way. It reduces the power requirements to
only one seek. And with the journal track hack, that seek could be in the
direction the spring pulls. Still not too thrilled about that, though...

> <Snip>>
>
> > I think that's it. Did I miss anything? Oh yeah, on power fail stop
>
> It needs a power switch to stop back-feeding the computer.

Yup.

Rob

2001-11-27 02:42:40

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Followup to: <[email protected]>
By author: Rob Landley <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Monday 26 November 2001 20:23, Ian Stirling wrote:
>
> > > Now a cache large enough to hold 2 full tracks could also hold dozens of
> > > individual sectors scattered around the disk, which could take a full
> > > second to write off and power down. This is a "doctor, it hurts when I
> > > do this" question. DON'T DO THAT.
> >
> > Or, to seek to a journal track, and write the cache to it.
>
> Except that at most you have one seek to write out all the pending cache data
> anyway, so what exactly does seeking to a journal track buy you?
>

It limits the amount you need to seek to exactly one seek.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-27 03:21:21

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Monday 26 November 2001 21:41, H. Peter Anvin wrote:
> Followup to: <[email protected]>
> By author: Rob Landley <[email protected]>
> In newsgroup: linux.dev.kernel
>
> > On Monday 26 November 2001 20:23, Ian Stirling wrote:
> > > > Now a cache large enough to hold 2 full tracks could also hold dozens
> > > > of individual sectors scattered around the disk, which could take a
> > > > full second to write off and power down. This is a "doctor, it hurts
> > > > when I do this" question. DON'T DO THAT.
> > >
> > > Or, to seek to a journal track, and write the cache to it.
> >
> > Except that at most you have one seek to write out all the pending cache
> > data anyway, so what exactly does seeking to a journal track buy you?
>
> It limits the amount you need to seek to exactly one seek.
>
> -hpa

But it's already exactly one seek in the scheme I proposed. Notice how of
the two tracks you can be write-cacheing data for, one is the track you're
currently over (no seek required, you're there). You flush to that track,
there's one more seek to flush to the second track (which you were only
cacheing data for to avoid latency, so the seek could start immediately
without waiting for the OS to provide data), and then park.

Now a journal track that's next to where the head parks could combine the
"park" sweep with that one seek, and presumably be spring powered and hence
save capacitor power. But I'm not 100% certain it would be worth it. (Are
normal with-power-on seeks towards the park area powered by the spring, or
the... I keep wanting to say "stepper motor" but I don't think those are what
drives use anymore, are they? Sigh...)

Rob

2001-11-27 03:47:52

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

>
> On Monday 26 November 2001 20:23, Ian Stirling wrote:
>
> > > Now a cache large enough to hold 2 full tracks could also hold dozens of
> > > individual sectors scattered around the disk, which could take a full
> > > second to write off and power down. This is a "doctor, it hurts when I
> > > do this" question. DON'T DO THAT.
> >
> > Or, to seek to a journal track, and write the cache to it.
>
> Except that at most you have one seek to write out all the pending cache data
> anyway, so what exactly does seeking to a journal track buy you?

The ability to possibly dramatically improve performance by allowing
more than one or two tracks to be write cached at once.
Yes, in theory, the system should be able to elevator all seeks, but
it may not know that track 400 has really been remapped to 200, the
drive does.

With write-caching on, the system doesn't know where the head is,
the drive does.
And, it's nearly free (an extra meg of space)
<snip>
> Possibly. I still don't see what it gets you if you only have one track
> other than the one you're over to write to. (is the journal track near the
> area the head parks in? That could be a power saving method, I suppose. But
> it's also wasting disk space that would probably otherwise be used for
> storage or a block remapping, and how do you remap a bad sector out of the
> journal track if that happens?)

You simply pick another track for the journal, the same as you would
if an ordinary track goes bad.
(it's tested on boot)
The waste of disk space is utterly trivial.
A meg in drives where the entry level is 40G?

2001-11-27 05:04:39

by Stephen Satchell

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

At 09:57 AM 11/27/01 +0800, Steve Underwood wrote:
>Quite true. The drives really need to get an "oh heck, the power's about
>to die. Quick, tidy up" signal from the outside world (like down the
>ribbon). Cheap, at the limit, PSUs probably couldn't give enough notice to
>be very helpful. Server grade ones should - they can usually ride over
>brief hiccups in the power, so they should be able to give a few 10s of ms
>notice before the regulated power lines start to droop. Perhaps the ATA
>command set should include such a feature, so the OS could take
>instruction from the hardware on the power situation, and tell the drives
>what to do.

Looking at the various interface specifications, both SCSI and ATA have the
ability to signal to the drive that the power is going, and do it in such a
way that the drive would have at least 10 milliseconds from the time the
hardware signal is received by the drive before +5 and +12 go out of
specification.

This time is based on the specifications for ATX power supplies, as I
assume most modern boxes that are used for production applications would be
using an ATX power supply or similar. Lest you think this lets older
systems off the hook, the 1981 IBM PC Technical Reference describes (in
looser language) a similar requirement.

The question remains whether (1) modern motherboards and SCSI controllers
pass through the POWER-OK signal to the RESET- line (IDE/ATA) and RSET
(SCSI), and (2) the hard drives respond intelligently to power-failure
indications.

Telling the difference between a bus-reset event and a panic reset would be
easy: if the reset signal is asserted for more than a millisecond or two
(such as when the POWER-OK signal from the power supply goes away) then the
box is in a power panic situation. Preventing spurious power panics is the
responsibility of the power supply designer, particularly if the supply
uses a large energy-storage capacitor designed to let the system ride out
power-switching events without hiccup.

Suggestion to the people contributing to ATA-7: write some language that
talks specifically about power-failure scenarios, and define a power-crisis
state based on the signals available to the drives from ATA interfaces to
determine that a power-crisis event has occurred. If the committee would
sit still for it, make it a separate section that appears in the table of
contents.

Suggestion to the people contributing to SCSI standards: ditto.

Satch

2001-11-27 07:04:22

by Ville Herva

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, Nov 26, 2001 at 03:35:07PM -0500, you [Rob Landley] claimed:
>
> What kind of write-up do you want? (How formal?)
>

(...)

> That way, the power down problem is strictly limited:
>
> 1) write out the track you're over
> 2) seek to the second track
> 3) write that out too
> 4) park the head

(...)

A stupid question. Instead of adding there electric components and smart
features to drive logic, couldn't the problem be simply be taken care of by
adding an acknowledge message to the ATA protocol (unless it already has
one)?

So _after_ the data has been 100% committed to _disk_, the disk would
acknowledge the OS. The OS wouldn't have to wait on the command (unless it
wants to -- think of write ordering barrier!), and the disk could have as
large cache as it needs. It would simply accept the write command to its
cache and send the ACKs even half a second later. The OS wouldn't consider
anything as committed to disk before its gets the ACK.

Again, I know nothing of ATA so this can be impossible to do (strict ordered
command-reply protocol?), or already implemented but not enough. Please
correct me. I must be missing something.

-- v --

[email protected]

2001-11-27 07:39:55

by Andreas Dilger

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Nov 26, 2001 16:16 -0800, Andre Hedrick wrote:
> On Mon, 26 Nov 2001, Andreas Dilger wrote:
> > What happens if you have a slightly bad power supply? Does it immediately
> > go read only all the time? It would definitely need to be able to
> > recover operations as soon as the power was "normal" again, even if this
> > caused basically "sync" I/O to the disk. Maybe it would be able to
> > report this to the user via SMART, I don't know.
>
> ATA/SCSI SMART is already DONE!
>
> To bad most people have not noticed.

Oh, I know SMART is implemented, although I haven't actually seen/used a
tool which takes advantage of it (do you have such a thing?). It would
be nice if there were messages appearing in my syslog (just like the
AIX days) which said "there were 10 temporary read errors at block M on
drive X yesterday" and "1 permanent write error at block M, block remapped
on drive X yesterday", so I would know _before_ my drive craps out
after all of the remapping table is full, or the temporary read errors
become permanent. (I have a special interest in that because my laptop
hard drive sounds like a jet engine at times... ;-).

What I was originally suggesting is that it have a field which can report
to the user that "there were 800 sync/reset operations because of power
drops that were later found not to be power failures". That is what
I was suggesting SMART report in this case (actual power failures are
not interesting). Note also, that this is purely hypothetical, based
on only a vague understanding on what actually happens when the drive
thinks it is losing power, and only ever having seen the hex output
of /proc/ide/hda/smart_{values,thresholds}.

Being able to get a number back from the hard drive that it is performing
poorly (i.e. synchronous I/O + lots of resets) because of a bad power supply
is exactly what SMART was designed to do - predictive failure analysis.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-27 11:49:37

by Ville Herva

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Tue, Nov 27, 2001 at 12:38:43AM -0700, you [Andreas Dilger] claimed:
>
> Oh, I know SMART is implemented, although I haven't actually seen/used a
> tool which takes advantage of it (do you have such a thing?). It would
> be nice if there were messages appearing in my syslog (just like the
> AIX days) which said "there were 10 temporary read errors at block M on
> drive X yesterday" and "1 permanent write error at block M, block remapped
> on drive X yesterday", so I would know _before_ my drive craps out

There are packaged smartsuite and ide-smart at linux-ide.org. I think smartd
from smartsuite does just that.

At least smartctl does read the values in understandable format.

BTW: does anyone know if it is supposed to understand the temperature
sensors supposedly present in newer IBM drives?

-- v --

[email protected]

2001-11-27 16:36:41

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

My experience is with SCSI not ATA, so adjust accordingly...

On Mon, 26 Nov 2001 13:36:06 -0800 (PST), you wrote:

>On Mon, 26 Nov 2001, Steve Brueggeman wrote:
>
Snip-out my stuff
>
>One has to go read the general purpose error logs to determine the
>location of the original platter assigned sector of the relocated LBA.
>
>Reallocation generally occurs on write to media not read, and you should
>know that point.

Actually, it has been my experience that most reallocations occur on
reads. The reasons for this are two fold.

1) most systems out there do an order of magnitude more reads than
writes,
2) the amount of data read (servo/headers, sync...) for a write
operation is an order of magnitude less than it is for a read
operation (same servo/header/sync plus the whole data-field and
CRC/ECC).

Note: no media related errors can be detected while write-gate is
active. Only servo positioning errors, and even that's not likely
with the current drives using embeded servo.

>
Snip some more of my stuff
>
>By the time an ATA device gets to generating this message, either the bad
>block list is full or all reallocation sectors are used. Unlike SCSI
>which has to be hand held, 90% of all errors are handle by the device.
>Good or Bad -- that is how it does it.

I think what you meant is, 90% of all errors are handled silently by
the device. I don't like silent errors.

>
>Well there is an additional problem in all of storage, that drives do
>reorder and do not always obey the host-driver. Thus if the device is
>suffering from performance and you have disabled WB-Cache, it may elect to
>self enable. Now you have the device returning ack to platter that may
>not be true. Most host-drivers (all of Linux, mine include) release and
>dequeue the request once the ack has been presented. This is dead wrong.
>If a flush cache fails I get back the starting lba of the write request,
>and if the request is dequeued -- well you know -- bye bye data! SCSI
>will do the same, even with TCQ. Once the sense is cleared to platter and
>the request is dequeued, and a hiccup happens -- bye bye data!
>

It is my convicted opinion that any device that automatically enables
write-caching is broken and anyone who enables write caching probably
doesn't know what they're doing. A system simply cannot get reliable
error reporting with write caching enabled.

Without write-caching, if you get good completion status, the data is
GURANTEED to be on the platter. `With` write-caching, the best you
can hope for are deferred errors, but I am yet to see a system that
can properly cope with deferred errors, so at best, they're
informational only.

I once had to write some drive test software that ran with
write-caching enabled, on a drive in degraded mode. The only option I
could come up with was to maintain a history of the last 2 X queue
depth commands sent to each device, and do a look-up for the LBA in
the deferred error for all commands in the history that had a range
that covered the LBA in error. Unfortunately, this was under DOS and
this was not an option because the memory was too tight. What I ended
up with was better than nothing, but still could not catch 100% of the
deferred errors.

(More snippage)

>
>Question -- are you up to fixing the low-level drivers and all the stuff
>above ?
>

Probably not, as my plate's pretty full.

Though, I would like to understand more specifically what you're
talking about...

I see the following opportunities..

1) Read returns unrecoverable error.
write to bad sector and re-read
if re-read returns unrecoverable error, manually
reallocate

This should not be done automatically, since there is no easy way to
determine whether that sector is in a free list, and we are only
allowed to write to sectors in a free list. This would best be done
by the badblocks utility, in combination with fsck. Maybe it already
does this, I haven't looked.

2) At device initialization, and after device resets, force
write-cache to be disabled. (not very nice to those that would rather
have write cache enabled... poor souls)

3) Set the Force Unit Access bit for all write commands. (again, not
very nice to those poor souls that would rather have write cache
enabled)

4) Reordering of commands is rather unrelated to the problem at hand,
but it is a concern for anything that needs ordered transactions. The
Linux SCSI layers only inject an ordered command every so-often to
prevent command starvation, but for ordered transactions, the SCSI
layer should probably be forcing the sending of Ordered command queue
messages with the CDB. I'd rather hate to see every SCSI request
become an ordered command queue, since the disk drive really does know
best how to reorder it's queue of commands. The SCSI block layer
really needs some clues from the upper layers in my opinion, about
whether a given request needs to be ordered or not. But I digress.
This is a whole other topic.

Steve Brueggeman

_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

2001-11-27 16:39:31

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, 26 Nov 2001, Rob Landley wrote:

> Having no write caching on the IDE side isn't a solution either. The problem
> is the largest block of data you can send to an ATA drive in a single command
> is smaller than modern track sizes (let alone all the tracks under the heads
> on a multi-head drive), so without any sort of cacheing in the drive at all
> you add rotational latency between each write request for the point you left
> off writing to come back under the head again. This will positively KILL
> write performance. (I suspect the situation's more or less the same for read
> too, but nobody's objecting to read cacheing.)
>
> The solution isn't to avoid write cacheing altogether (performance is 100%
> guaranteed to suck otherwise, for reasons unrelated to how well your hardware
> works but to legacy request size limits in the ATA specification), but to
> have a SMALL write buffer, the size of one or two tracks to allow linear ATA
> write requests to be assembled into single whole-track writes, and to make
> sure the disks' electronics has enough capacitance in it to flush this buffer
> to disk. (How much do capacitors cost? We're talking what, an extra 20
> miliseconds? The buffer should be small enough you don't have to do that
> much seeking.)

Two things:

1- power loss. Fixing things to write to disk is bound to fail in
adverse conditions. If the drive suffers from write problems and the
write takes longer than the charge of your capacitor lasts, your
data is still toasted. nonvolatile memory with finite write time
(like NVRAM/Flash) will help to save the Cache. I don't think vendors
will do that soon.

2- error handling with good power: with automatic remapping turned on,
there's no problem, the drive can re-write a block it has taken
responsibility of, IBM DTLA drives will automatically switch off the
write cache when the number of spare block gets low.

with automatic remapping turned off, write errors with enabled write
cache get a real problem because the way it is now, when the drive
reports the problem, the block has already expired from the write
queue and is no longer available to be rescheduled. That may mean
that although fsync() completed OK your block is gone.

Tagged queueing may help, as would locking a block with write faults in
the drive and sending it back along with the error condition to the
host.

(*) of course, journal data must be written in an ordered fashion to
prevent trouble in case of power loss.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-11-27 16:50:54

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Please fix your domain in your mailer, localhost.localdomain is prone
for Message-ID collisions.

Note, the power must RELIABLY last until all of the data has been
writen, which includes reassigning, seeking and the like, just don't do
it if you cannot get a real solution. battery-backed CMOS,
NVRAM/Flash/whatever which lasts a couple of months should be fine
though, as long as documents are publicly available that say how long
this data lasts. Writing to disk will not work out unless you can keep
the drive going for several seconds which will require BIG capacitors,
so that's no option, you must go for NVRAM/Flash or something.

OTOH, the OS must reliably know when something went wrong (even with
good power it has a right to know), and preferably this scheme should
not involve disabling the write cache, so TCQ or something mandatory
would be useful (not sure if it's mandatory in current ATA standards).

If a block has first been reported written OK and the disk later reports
error, it must send the block back (incompatible with any current ATA
draft I had my hands on), so I think tagged commands which are marked
complete only after write+verify are the way to go.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-11-27 16:56:53

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, 26 Nov 2001, H. Peter Anvin wrote:

> On the subject of write barriers... such a setup probably should have
> a serial number field for each write barrier command, and a "WAIT FOR
> WRITE BARRIER NUMBER #" command -- which will wait until all writes
> preceeding the specified write barrier has been committed to stable
> storage. It might also be worthwhile to have the equivalent
> nonblocking operation -- QUERY LAST WRITE BARRIER COMMITTED.

A query model is not useful, because it involves polling, which is not
what you want because it clogs up the CPU.

Write barriers may be fun, however, they impose ordering constraints on
the host side, which is not too useful. Real tagged commands and tagged
completion will be really useful for performance, with write barriers,
for example:

data000 group A
data001 group B
data254 group A
data253 group A
data274 group B
barrier group A
data002 group B

or something, and the drive could reorder anything, but it would only
have to guarantee that all group-A data sent before the barrier would
have made it to disk when the barrier command completed.

2001-11-27 17:01:13

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, 26 Nov 2001, H. Peter Anvin wrote:

> Waiting for write barriers to clear is key to implementing fsync()
> efficiently and correctly.

Well, all you want is a feature to write a set of blocks and be
acknowledged of completion of the write before you send more data, but
OTOH you would not want to serialize fsync() operations, see my "groups"
that I told previously. That would probably involve tagging data blocks
in the long run. Not sure if the current tag command API of ATA can
already provide that, if so, all is there, and the barrier can be
implemented in the driver rather than the drive firmware.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-11-27 17:42:25

by Martin Eriksson

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

----- Original Message -----
From: "Matthias Andree" <[email protected]>
To: <[email protected]>
Sent: Tuesday, November 27, 2001 5:39 PM
Subject: Re: Journaling pointless with today's hard disks?

<snip>
>
> Two things:
>
> 1- power loss. Fixing things to write to disk is bound to fail in
> adverse conditions. If the drive suffers from write problems and the
> write takes longer than the charge of your capacitor lasts, your
> data is still toasted. nonvolatile memory with finite write time
> (like NVRAM/Flash) will help to save the Cache. I don't think vendors
> will do that soon.
>
> 2- error handling with good power: with automatic remapping turned on,
> there's no problem, the drive can re-write a block it has taken
> responsibility of, IBM DTLA drives will automatically switch off the
> write cache when the number of spare block gets low.
>
> with automatic remapping turned off, write errors with enabled write
> cache get a real problem because the way it is now, when the drive
> reports the problem, the block has already expired from the write
> queue and is no longer available to be rescheduled. That may mean
> that although fsync() completed OK your block is gone.

I think we have gotten away from the original subject. The problem (as I
understood it) wasn't that we don't have time to write the whole cache...
the problem is that the hard disk stops in the middle of a write, not
updating the CRC of the sector, thus making it report as a bad sector when
trying to recover from the failure. No?

I think most people here are convinced that there is not time to write a
several-MB (worst case) cache to the platters in case of a power failure.
Special drives for this case could of course be manufactured, and here's for
a theory of mine: Wouldn't a battery backed-up SRAM cache do the thing?

Anyway, maybe it is just me who have been thrown off-track? Are we
discussing something else now maybe?

<snap>

_____________________________________________________
| Martin Eriksson <[email protected]>
| MSc CSE student, department of Computing Science
| Ume? University, Sweden

2001-11-27 20:11:38

by Bill Davidsen

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Tue, 27 Nov 2001, Steve Brueggeman wrote:

> 2) At device initialization, and after device resets, force
> write-cache to be disabled. (not very nice to those that would rather
> have write cache enabled... poor souls)
>
> 3) Set the Force Unit Access bit for all write commands. (again, not
> very nice to those poor souls that would rather have write cache
> enabled)

I don't have a problem with setting things to "most likely to succeed"
values, and (2) fits that. Those who really want w/c can enable in
rc.local. However, practice (3) is something I would associate with other
operating systems which believe that the computer knows best. You may
personally believe that you will trade any amount of performance for a
slight increase in reliability, but other may want to take the risk of
losing data and have the computer fast enough to do their work. I don't
think it's remotely Linux policy to do things like that, and I certainly
hope it doesn't happen.

Both decent disk drives and UPS systems are available, and having been in
the position of having systems which can't quite keep up with the load, I
want the option of doing what seems best. We have gotten along for years
without doing something to force bypass of w/c, it seems that hdparm is up
to continuing to allow people to make their own choices.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2001-11-27 21:29:10

by Wayne Whitney

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

In mailing-lists.linux-kernel, Andre Hedrick wrote:

> By the time an ATA device gets to generating this message, either the bad
> block list is full or all reallocation sectors are used. Unlike SCSI
> which has to be hand held, 90% of all errors are handle by the device.

Perhaps you or one of the other gurus could explain the following
observations, which I am sure that I and many other readers would find
very enlightening:

I have an IBM-DTLA-307045 drive connected to a PDC20265 controller on
an ia32 machine running 2.4.16. After reading this discussion and
hearing about the problems that others have had with the IBM 75GXP
series, I thought that I should test out my drive to see if it is OK.
So I ran 'dd if=/dev/hde of=/dev/null bs=128k'. Every thing went
fine, except for about five seconds in the middle, when the disk made
a lot of grinding sounds and the system was unresponsive. That
generated the log messages messages appended below.

However, running the dd command again (after a reboot) produced no
errors. So the drive remapped some bad sectors the first time
through? The common wisdom here is that once you get to the point
where the drive reports a bad sector, you are in trouble. If so, why
did the second dd command work OK? I have had no other problems with
this drive.

Thanks, Wayne

hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939804
end_request: I/O error, dev 21:00 (hde), sector 12939804
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939806
end_request: I/O error, dev 21:00 (hde), sector 12939806
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939808
end_request: I/O error, dev 21:00 (hde), sector 12939808
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939810
end_request: I/O error, dev 21:00 (hde), sector 12939810
hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939812
end_request: I/O error, dev 21:00 (hde), sector 12939812

2001-11-27 21:54:58

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

I strongly suggest you execute the extend tests in the smart-suite
authored by a friend of mine that I have listed on my http://www.linux-ide.org.

What you have done is trigger a process to have the device go into a
selftest mode to perform a block test. I would tell you more but I may
have exposed myself already.

Regards, you need to execute an extend smart offline test.
Also be sure to query the smart logs.

Respectfully,

Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project

On Tue, 27 Nov 2001, Wayne Whitney wrote:

> In mailing-lists.linux-kernel, Andre Hedrick wrote:
>
> > By the time an ATA device gets to generating this message, either the bad
> > block list is full or all reallocation sectors are used. Unlike SCSI
> > which has to be hand held, 90% of all errors are handle by the device.
>
> Perhaps you or one of the other gurus could explain the following
> observations, which I am sure that I and many other readers would find
> very enlightening:
>
> I have an IBM-DTLA-307045 drive connected to a PDC20265 controller on
> an ia32 machine running 2.4.16. After reading this discussion and
> hearing about the problems that others have had with the IBM 75GXP
> series, I thought that I should test out my drive to see if it is OK.
> So I ran 'dd if=/dev/hde of=/dev/null bs=128k'. Every thing went
> fine, except for about five seconds in the middle, when the disk made
> a lot of grinding sounds and the system was unresponsive. That
> generated the log messages messages appended below.
>
> However, running the dd command again (after a reboot) produced no
> errors. So the drive remapped some bad sectors the first time
> through? The common wisdom here is that once you get to the point
> where the drive reports a bad sector, you are in trouble. If so, why
> did the second dd command work OK? I have had no other problems with
> this drive.
>
> Thanks, Wayne
>
> hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939804
> end_request: I/O error, dev 21:00 (hde), sector 12939804
> hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939806
> end_request: I/O error, dev 21:00 (hde), sector 12939806
> hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939808
> end_request: I/O error, dev 21:00 (hde), sector 12939808
> hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939810
> end_request: I/O error, dev 21:00 (hde), sector 12939810
> hde: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hde: dma_intr: error=0x40 { UncorrectableError }, LBAsect=12939888, sector=12939812
> end_request: I/O error, dev 21:00 (hde), sector 12939812
>

2001-11-27 23:48:03

by Andreas Bombe

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, Nov 26, 2001 at 07:19:54PM -0500, Rob Landley wrote:
> Now a journal track that's next to where the head parks could combine the
> "park" sweep with that one seek, and presumably be spring powered and hence
> save capacitor power. But I'm not 100% certain it would be worth it.

When time if of essence it should be worth it (drive makers will use the
smallest possible capacitor, of course). Given that current 7200 RPM
disks have marketed seek times of 8 or 9 ms worst case seeks can be much
longer.

That 8ms is average and likely read seeks are weighted higher than write
seeks. Writes have to be exact, but reads can be seeked sloppier
(without waiting for the head to stop oscillating after braking) and
error correction will take care of the rest. This would gives us what
in worst case? 15ms (just a guess)?

A journal track could be near parking track and have directly adjacent
tracks left free to allow for slightly sloppier/faster seeking. An
expert could probably tell us whether this is complete BS or even
feasible.

> (Are
> normal with-power-on seeks towards the park area powered by the spring, or
> the... I keep wanting to say "stepper motor" but I don't think those are what
> drives use anymore, are they? Sigh...)

A simple spring is too slow, I guess. Also, it should not be so hard
that it would slow down seeks against the spring.

--
Andreas Bombe <[email protected]> DSA key 0x04880A44

2001-11-28 11:54:22

by Pedro M. Rodrigues

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Just curious but what can a selfttest mode and consequent block
test do to inspire you such worry? Are we dealing with the mob or
something of the sort when we buy an IBM 75GXP disk?

/Pedro

On 27 Nov 2001 at 13:52, Andre Hedrick wrote:

>
>
> What you have done is trigger a process to have the device go into a
> selftest mode to perform a block test. I would tell you more but I
> may have exposed myself already.
>

2001-11-28 14:37:35

by Galappatti, Kishantha

[permalink] [raw]

Subject: RE: Journaling pointless with today's hard disks?

im curious also.. what do you mean you've "exposed" yourself already? is
this a trade secret?

--kish

-----Original Message-----
From: Pedro M. Rodrigues [mailto:[email protected]]
Sent: Wednesday, November 28, 2001 6:53 AM
To: Wayne Whitney; Andre Hedrick
Cc: LKML
Subject: Re: Journaling pointless with today's hard disks?

Just curious but what can a selfttest mode and consequent block
test do to inspire you such worry? Are we dealing with the mob or
something of the sort when we buy an IBM 75GXP disk?

/Pedro

On 27 Nov 2001 at 13:52, Andre Hedrick wrote:

>
>
> What you have done is trigger a process to have the device go into a
> selftest mode to perform a block test. I would tell you more but I
> may have exposed myself already.
>

2001-11-28 16:37:22

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

>
> ----- Original Message -----
> =46rom: "Matthias Andree" <[email protected]>
> To: <[email protected]>
> Sent: Tuesday, November 27, 2001 5:39 PM
> Subject: Re: Journaling pointless with today's hard disks?
>
<snip>

> I think most people here are convinced that there is not time to write a
> several-MB (worst case) cache to the platters in case of a power failure.
> Special drives for this case could of course be manufactured, and here's
> a theory of mine: Wouldn't a battery backed-up SRAM cache do the thing?

No.
Sram is expensive, as are batteries (they also tend to have poor
cycle life, and mean that you only keep the data until the battery dies.

Numbers...

Taking again as an example, something that's in my machine:
The Fujitsu MPG3409AT, a bargain basement 40G drive.
2 platters, 5400RPM.
It has (at the high end) 798 sec/track.
Worst case, to write a journal track takes a full seek, and at least one
complete rev.
Assuming that we want to write it over two tracks,
This is 18ms + 11*2ms = 40ms.
Now, how much power?
6.3W is needed, so that's .252J

Assuming that the 12V line can be allowed to sag to 10V, that'll take 20%^2
of the energy of the cap out, so we need a cap that stores about .7J, or
a 2500uF cap.

12V 2500uf aluminium electrolytic is rather large, 25mm long *10mm diameter.
There is space for this, in the overall package, but it would need a slight
redesign.
The cost of the component is about 10 cents US.
Another 20-80 cents may be needed for the power switch.
This assumes that no power can be used from the spindle motor, which may
well be wrong.

2001-11-28 17:13:45

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Tuesday 27 November 2001 11:50, Matthias Andree wrote:
> Please fix your domain in your mailer, localhost.localdomain is prone
> for Message-ID collisions.

I'm using Kmail talking to @home's mail server (to avoid the evil behavior
sendmail has behind an IP masquerading firewall that triggers every spam
filter in existence), so if either one of them cares about the hostname of my
laptop ("driftwood", but apparently not being set right by Red Hat's
scripts), then something's wrong anyway.

But let's see...

Ah fun, if you change the hostname of the box, either X or KDE can't pop up
any more new applicatons until you exit X and restart it. Brilliant.
Considering how many Konqueror windows I have open at present on my 6
desktops, I think I'll leave fixing this until later in the evening. But
thanks for letting me know something's up...

>
> Note, the power must RELIABLY last until all of the data has been
> writen, which includes reassigning, seeking and the like, just don't do
> it if you cannot get a real solution.

A) At most 1 seek to a track other than the one you're on.

B) If sectors have been reassigned outside of this track to a "recovery"
track, then that counts as a seperate track. Tough.

The point of the buffer is to let the OS feed data to the write head as fast
as it can write it (which unbuffered ATA can't do because individual requests
are smaller than individual tracks). You need a small buffer to avoid
blocking between each and every ATA write while the platter rotates back into
position. So you always let it have a little more data so it knows what to
do next and can start work on it immediately (doing that next seek, writing
that next sector as it passes under the head without having to wait for it to
rotate around again.)

That's it. No more buffer than does good at the hardware level for request
merging and minimizing seek latency. Any buffering over and above that is
the operating system's job.

Yes the hardware can do a slightly better job with its own elevator algorithm
using intimate knowledge of on-disk layout, but the OS can do a fairly decent
job as long as logical linear sectors are linearly arranged on disk too.
(Relocating bad sectors breaks this, but not fatally. It causes extra seeks
in linear writes anyway where the elevator ISN'T involved, so you've already
GOT a performance hit. And it just screws up the OS's elevator, not the rest
of the scheme. You still have the current track written as one lump and an
immediate seek to the other track, at which point the drive electronics can
be accepting blocks destined for the track you seek back to.)

The advantage of limiting the amount of data buffered to current track plus
one other is you have a fixed amount of work to do on a loss of power. One
seek, two track writes, and a spring-driven park. The amount of power this
takes has a deterministic upper bound. THAT is why you block before
accepting more data than that.

> battery-backed CMOS,
> NVRAM/Flash/whatever which lasts a couple of months should be fine
> though, as long as documents are publicly available that say how long
> this data lasts. Writing to disk will not work out unless you can keep
> the drive going for several seconds which will require BIG capacitors,
> so that's no option, you must go for NVRAM/Flash or something.

You dont' need several seconds. You need MILISECONDS. Two track writes and
one seek. This is why you don't accept more data than that before blocking.
Your worst case scenario is a seek from near where the head parks to the
other end of the disk, then the spring can pull it back. This should be well
under 50 miliseconds. Your huge ram cache is there for reads. For writes
you don't accept more than you can reliably flush if you want anything
approaching reliability. If you're only going to spring for a capacitor as
your power failure hedge, than the amount of write cache you can accept is
small, but it turns out you only need a tiny amount of cache to get 90% of
the benefit of write cacheing (merging writes into full tracks and seeking
immediately to the next track).

> OTOH, the OS must reliably know when something went wrong (even with
> good power it has a right to know), and preferably this scheme should
> not involve disabling the write cache, so TCQ or something mandatory
> would be useful (not sure if it's mandatory in current ATA standards).

We're talking about what happens to the drive on a catastrophic power
failure. (Even with a UPS, this can happen if your case fan jams and your
power supply catches fire and burns through a wire, Although most server side
hosting facilities aren't that dusty, there's always worn bearings and other
such fun things. And in a desktop environment, spilled sodas.) Currently,
there are drives out there that stop writing a sector in the middle, leaving
a bad CRC at the hardware level. This isn't exactly graceful. At the other
end, drives with huge caches discard the contents of cache which a journaling
filesystem thinks are already on disk. This isn't graceful either.

> If a block has first been reported written OK and the disk later reports
> error, it must send the block back (incompatible with any current ATA
> draft I had my hands on), so I think tagged commands which are marked
> complete only after write+verify are the way to go.

If a block goes bad WHILE power is failing, you're screwed. This is just a
touch unlikely. It will happen to somebody out there someday, sure. So will
alpha particle decay corrupting a sector that was long ago written to the
drive correctly. Designing for that is not practical. Recovering after the
fact might be, but that doesn't mean you get your data back.

Rob

2001-11-28 17:23:25

by David Balazic

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Chris Wedgwood ([email protected]) wrote :

> On Sat, Nov 24, 2001 at 02:03:11PM +0100, Florian Weimer wrote:
>
> When the drive is powered down during a write operation, the
> sector which was being written has got an incorrect checksum
> stored on disk. So far, so good---but if the sector is read
> later, the drive returns a *permanent*, *hard* error, which can
> only be removed by a low-level format (IBM provides a tool for
> it). The drive does not automatically map out such sectors.
>
> AVOID SUCH DRIVES... I have both Seagate and IBM SCSI drives which a
> are hot-swappable in a test machine that I used for testing various
> journalling filesystems a while back for reliability.
>
> Some (many) of those tests involved removed the disk during writes
> (literally) and checking the results afterwards.

What do you mean by "removed the disk" ?

- rm /dev/hda ? :-)
- disconnect the disk from the SCSI or ATA bus ?
- from the power supply ?
- both ?
- something else ?

>
> The drives were set not to write-cache (they don't by default, but all
> my IDE drives do, so maybe this is a SCSI thing?)
>
> At no point did I ever see a partial write or corrupted sector; nor
> have I seen any appear in the grown table, so as best as I can tell
> even under removal with sustain writes there are SOME DRIVES WHERE
> THIS ISN'T A PROBLEM.
>
> Now, since EMC, NetApp, Sun, HP, Compaq, etc. all have products which
> presumable depend on this behavior, I don't think it's going to go
> away, it perhaps will just become important to know which drives are
> brain-damaged and list them so people can avoid them.
>
> As this will affect the Windows world too consumer pressure will
> hopefully rectify this problem.
>
> --cw
>
> P.S. Write-caching in hard-drives is insanely dangerous for
> journalling filesystems and can result in all sorts of nasties.
> I recommend people turn this off in their init scripts (perhaps I
> will send a patch for the kernel to do this on boot, I just
> wonder if it will eat some drives).

--
David Balazic
--------------
"Be excellent to each other." - Bill S. Preston, Esq., & "Ted" Theodore Logan
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

2001-11-28 17:34:27

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Tuesday 27 November 2001 18:35, Andreas Bombe wrote:
> On Mon, Nov 26, 2001 at 07:19:54PM -0500, Rob Landley wrote:
> > Now a journal track that's next to where the head parks could combine the
> > "park" sweep with that one seek, and presumably be spring powered and
> > hence save capacitor power. But I'm not 100% certain it would be worth
> > it.
>
> When time if of essence it should be worth it (drive makers will use the
> smallest possible capacitor, of course). Given that current 7200 RPM
> disks have marketed seek times of 8 or 9 ms worst case seeks can be much
> longer.
>
> That 8ms is average and likely read seeks are weighted higher than write

Sure. The time to seek halfway across the disk, probably.

> seeks. Writes have to be exact, but reads can be seeked sloppier
> (without waiting for the head to stop oscillating after braking) and
> error correction will take care of the rest. This would gives us what
> in worst case? 15ms (just a guess)?

I'd been thinking more like 20, but it really depends on the manufacturer.
(And fun little detail, faster seeks can take MORE power, driving the coil
thingy harder...)

> A journal track could be near parking track and have directly adjacent
> tracks left free to allow for slightly sloppier/faster seeking. An
> expert could probably tell us whether this is complete BS or even
> feasible.
>
> > (Are
> > normal with-power-on seeks towards the park area powered by the spring,
> > or the... I keep wanting to say "stepper motor" but I don't think those
> > are what drives use anymore, are they? Sigh...)
>
> A simple spring is too slow, I guess. Also, it should not be so hard
> that it would slow down seeks against the spring.

I.E. they've already dealt with this problem in existing designs that use
some variant of a spring to park, this is Not Our Problem.

No, the "not worth it" above, in addition to the extra logic to unjournal the
stuff on the next boot (and possibly lose power again during bootup and
hopefully not wind up with a brick) , is that the platter slows down if you
don't keep it spinning. If the spring is seeking slowly, the capacitor has
to keep the platter spinning longer, which could easily eat the power you're
trying to avoid seeking with. Add in the extra complexity and it doesn't
seem worth it, but that's for the lab guys to decide with measurements...

Oh, and one other fun detail. One reason I don't like the "battery backed up
SRAM cache", apart from being another way the disk dies of old age, is that
it doesn't fix the "we lost power in the middle of writing a sector, so we
just created a CRC error on the disk" problem, which is what started this
thread.

If you're going to fix THAT (which you seem to need a capacitor to do
anyway), then you might as well do it right.

Rob

2001-11-28 19:55:39

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Tue, 27 Nov 2001, Rob Landley wrote:

> On Tuesday 27 November 2001 11:50, Matthias Andree wrote:
> > Note, the power must RELIABLY last until all of the data has been
> > writen, which includes reassigning, seeking and the like, just don't do
> > it if you cannot get a real solution.
>
> A) At most 1 seek to a track other than the one you're on.

Not really, assuming drives don't write to multiple heads concurrently,
2 MB hardly fit on a track. We can assume several hundred sectors, say
1,000, so we need four track writes, four verifies, and not a single
block may be broken. We need even more time if we need to rewrite.

> That's it. No more buffer than does good at the hardware level for request
> merging and minimizing seek latency. Any buffering over and above that is
> the operating system's job.

Effectively, that's what tagged command queueing is all about, send a
batch of requests that can be acknowledged individually and possibly out
of order (which can lead to a trivial write barrier as suggested
elsewhere, because all you do is wait with scheduling until the disk is
idle, then send the past-the-barrier block).

> (Relocating bad sectors breaks this, but not fatally. It causes extra seeks
> in linear writes anyway where the elevator ISN'T involved, so you've already
> GOT a performance hit.

On modern drives, bad sectors are reassigned within the same track to
evade seeks for a single bad block. If the spare block area within that
track is exhausted, bad luck, you're going to seek.

> The advantage of limiting the amount of data buffered to current track plus
> one other is you have a fixed amount of work to do on a loss of power. One
> seek, two track writes, and a spring-driven park. The amount of power this
> takes has a deterministic upper bound. THAT is why you block before
> accepting more data than that.

It has not, you don't know in advance how many blocks on your journal
track are bad.

> You dont' need several seconds. You need MILISECONDS. Two track writes and
> one seek. This is why you don't accept more data than that before blocking.

No, you must verify the write, so that's one seek (say 35 ms, slow
drive ;) and two revolutions per track at least, and, as shown, more
than one track usually, so any bets of upper bounds are off. In the
average case, say 70 ms should suffice, but in adverse conditions, that
does not suffice at all. If writing the journal in the end fails because
power is failing, the data is lost, so nothing is gained.

> under 50 miliseconds. Your huge ram cache is there for reads. For writes
> you don't accept more than you can reliably flush if you want anything
> approaching reliability.

Well, that's the point, you don't know in advance how reliable your
journal track is. Worst case means: you need to consume every single
spare block until the cache is flushed. Your point about write caching
is valid, and IBM documentation for DTLA drives (minus their apparent
other issues) declares that the write cache will be ignored when the
spare block count is low.

> such fun things. And in a desktop environment, spilled sodas.) Currently,
> there are drives out there that stop writing a sector in the middle, leaving
> a bad CRC at the hardware level. This isn't exactly graceful. At the other
> end, drives with huge caches discard the contents of cache which a journaling
> filesystem thinks are already on disk. This isn't graceful either.

No-one said bad things cannot happen, but that is what actually happens.
Where we started from, fsck would be able to "repair" a bad block by
just zeroing and writing it, data that used to be there will be lost
after short write anyhow.

> If a block goes bad WHILE power is failing, you're screwed. This is just a
> touch unlikely. It will happen to somebody out there someday, sure. So will
> alpha particle decay corrupting a sector that was long ago written to the
> drive correctly. Designing for that is not practical. Recovering after the
> fact might be, but that doesn't mean you get your data back.

Alpha particles still need to fight against inner (bit-wise) and outer
(symbol- and blockwise) error correction codes, and Alpha particles
don't usually move Bloch walls or get near the coercivity otherwise.
We're talking about magnetic media, not E?PROMs or something.

Assuming that write errors on an emergency cache flush just won't happen
is just as wrong as assuming 640 kB will suffice or there's an upper
bound of write time. You just don't know.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-11-28 21:47:58

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

This is wandering far enough off topic that I'm not going to CC l-k after
this message...

On Wednesday 28 November 2001 13:43, Matthias Andree wrote:
> On Tue, 27 Nov 2001, Rob Landley wrote:
> > On Tuesday 27 November 2001 11:50, Matthias Andree wrote:
> > > Note, the power must RELIABLY last until all of the data has been
> > > writen, which includes reassigning, seeking and the like, just don't do
> > > it if you cannot get a real solution.
> >
> > A) At most 1 seek to a track other than the one you're on.
>
> Not really, assuming drives don't write to multiple heads concurrently,

Not my area of expertise. Depends how cheap they're being, I'd guess.
Writing multiple tracks concurrently is probably more of a current drain than
writing a single track at a time anyway, by the way.

> 2 MB hardly fit on a track. We can assume several hundred sectors, say
> 1,000, so we need four track writes, four verifies, and not a single
> block may be broken. We need even more time if we need to rewrite.

A 7200 RPM drive does 120 RPS, which means one revolution is 8.3 miliseconds.
We're still talking a deterministic number of miliseconds with a
double-digit total.

And again, it depends on how you define "track". If you talk about the two
tracks you can buffer as living on seperate sides of platters you can't write
to concurrently (not necessarily separated by a seek), then there is still no
problem. (After the first track writes and it starts on the second track,
the system still has 8.3 ms later to buffer another track before it drops
below full writing speed.

It's all a question of limiting how much you buffer to what you can flush
out. Artificial objections about "I could have 8 zillion platters I can only
write to one at a time" just means you're buffering too much to write out
then.

> > That's it. No more buffer than does good at the hardware level for
> > request merging and minimizing seek latency. Any buffering over and
> > above that is the operating system's job.
>
> Effectively, that's what tagged command queueing is all about, send a
> batch of requests that can be acknowledged individually and possibly out
> of order (which can lead to a trivial write barrier as suggested
> elsewhere, because all you do is wait with scheduling until the disk is
> idle, then send the past-the-barrier block).

Doesn't stop the "die in the middle of a write=crc error" problem. And I'm
not saying tagged command queueing is a bad idea, I'm just saying the idea's
been out there forever and not everybody's done it yet, and this is a
potentially simpler alternative focusing on the minimal duct-tape approach to
reliability by reducing the level of guarantees you have to make.

> > (Relocating bad sectors breaks this, but not fatally. It causes extra
> > seeks in linear writes anyway where the elevator ISN'T involved, so
> > you've already GOT a performance hit.
>
> On modern drives, bad sectors are reassigned within the same track to
> evade seeks for a single bad block. If the spare block area within that
> track is exhausted, bad luck, you're going to seek.

Cool then.

> > The advantage of limiting the amount of data buffered to current track
> > plus one other is you have a fixed amount of work to do on a loss of
> > power. One seek, two track writes, and a spring-driven park. The amount
> > of power this takes has a deterministic upper bound. THAT is why you
> > block before accepting more data than that.
>
> It has not, you don't know in advance how many blocks on your journal
> track are bad.

Another reason to not worry about an explicit dedicatedjournal track and just
buffer one extra normal data track and budget in the power for a seek to it
if necessary.

There are circumstances where this will break down, sure. Any disk that has
enough bad sectors on it will stop working. But that shouldn't be the normal
case on a fresh drive, as is happening now with IBM.

> > You dont' need several seconds. You need MILISECONDS. Two track writes
> > and one seek. This is why you don't accept more data than that before
> > blocking.
>
> No, you must verify the write, so that's one seek (say 35 ms, slow
> drive ;) and two revolutions per track at least, and, as shown, more
> than one track usually

So don't buffer 4 tracks and call it one track. That's an artificial
objection.

An extra revolution is less than a seek, and noticeably less in power terms.

>, so any bets of upper bounds are off. In the
> average case, say 70 ms should suffice, but in adverse conditions, that
> does not suffice at all. If writing the journal in the end fails because
> power is failing, the data is lost, so nothing is gained.
>
> > under 50 miliseconds. Your huge ram cache is there for reads. For
> > writes you don't accept more than you can reliably flush if you want
> > anything approaching reliability.
>
> Well, that's the point, you don't know in advance how reliable your
> journal track is.

We don't knkow in advance that the drive won't fail completely due to
excessive bad blocks. I'm trying to move the failure point, not pretending
to eliminate it. Right now we've got something that could easily take out
multiple drives in a RAID 5, and something that desktop users are likely to
see more noticeably more often than they upgrade their system.

> > such fun things. And in a desktop environment, spilled sodas.)
> > Currently, there are drives out there that stop writing a sector in the
> > middle, leaving a bad CRC at the hardware level. This isn't exactly
> > graceful. At the other end, drives with huge caches discard the contents
> > of cache which a journaling filesystem thinks are already on disk. This
> > isn't graceful either.
>
> No-one said bad things cannot happen, but that is what actually happens.
> Where we started from, fsck would be able to "repair" a bad block by
> just zeroing and writing it, data that used to be there will be lost
> after short write anyhow.

Assuming the drive's inherent bad-block detection mechanisms don't find it
and remap it on a read first, rapidly consuming the spare block reserve. But
that's a firmware problem...

> Assuming that write errors on an emergency cache flush just won't happen
> is just as wrong as assuming 640 kB will suffice or there's an upper
> bound of write time. You just don't know.

I don't assume they won't happen. They're actually more LIKELY to happen as
the power level gradually drops as the capacitor discharges. I'm just saying
there's a point beyond which any given system can't recover, and a point of
diminishing returns trying to fix things.

I'm proposing a cheap and easy improvement over the current system. I'm not
proposing a system hardened to military specifications, just something that
shouldn't fail noticeably for the majority of its users on a regular basis.
(Powering down without flushing the cache is a bad thing. It shouldn't
happen often. This is a last ditch deal-with-evil safety net system that has
a fairly good chance of saving the data without extensively redesigning the
whole system. Never said it was perfect. If a "1 in 2" failure rate drops
to "1 in 100,000", it'll still hit people. But it's a distinct improvement.
Maybe it can be improved beyond that. That would be nice. What's the
effort, expense, and inconvenience involved?)

Rob

2001-11-28 22:19:51

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Wed, 28 Nov 2001, Rob Landley wrote:

> Not my area of expertise. Depends how cheap they're being, I'd guess.
> Writing multiple tracks concurrently is probably more of a current drain than
> writing a single track at a time anyway, by the way.

Yes, and you need multiple write amplifiers and programmable filters
(remember we do zoned recording nowadays) rather than just a set of
switches.

> > Effectively, that's what tagged command queueing is all about, send a
> > batch of requests that can be acknowledged individually and possibly out
> > of order (which can lead to a trivial write barrier as suggested
> > elsewhere, because all you do is wait with scheduling until the disk is
> > idle, then send the past-the-barrier block).
>
> Doesn't stop the "die in the middle of a write=crc error" problem. And I'm

Not quite, but once you start journalling the buffer you can also write
tag data -- or you know to discard the journal block when it has a CRC
error and just rewrite it.

> been out there forever and not everybody's done it yet, and this is a
> potentially simpler alternative focusing on the minimal duct-tape approach to
> reliability by reducing the level of guarantees you have to make.

Yup.

> > On modern drives, bad sectors are reassigned within the same track to
> > evade seeks for a single bad block. If the spare block area within that
> > track is exhausted, bad luck, you're going to seek.
>
> Cool then.

I did a complete read-only benchmark of an old IBM DCAS which had like
300 grown defects and which I low-level formatted. Around the errors, it
would seek, and the otherwise good performance would drop to the floor
almost. Not sure whether that already had a strategy similar to that of
the DTLAs or just too many blocks went boom.

> Assuming the drive's inherent bad-block detection mechanisms don't find it
> and remap it on a read first, rapidly consuming the spare block reserve. But
> that's a firmware problem...

Drives should never reassign blocks on read operations, because they'd
take away the chance to try to read that block for say four hours.

> I'm proposing a cheap and easy improvement over the current system. I'm not
> proposing a system hardened to military specifications, just something that
> shouldn't fail noticeably for the majority of its users on a regular basis.
> (Powering down without flushing the cache is a bad thing. It shouldn't
> happen often. This is a last ditch deal-with-evil safety net system that has
> a fairly good chance of saving the data without extensively redesigning the
> whole system. Never said it was perfect. If a "1 in 2" failure rate drops
> to "1 in 100,000", it'll still hit people. But it's a distinct improvement.
> Maybe it can be improved beyond that. That would be nice. What's the
> effort, expense, and inconvenience involved?)

As always, the first 90% to perfection consume 10% of the efforts, but
the last 10% to perfection consume the other 90% of the efforts :-)

I'm just proposing to make sure that the margin is not too narrow when
you're writing your last blocks to the disk when you know power is
failing. I'm still wondering if flash memory is really more effort than
saving all the energy to keep this expensive mechanics going properly.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-11-28 23:26:53

by Frank de Lange

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Mon, 26 Nov 2001, Richard B. Johnson wrote:

> It isn't that easy! Any kind of power storage within the drive would
> have to be isolated with diodes so that it doesn't try to run your
> motherboard as well as the drive. This means that +5 volt logic supply
> would now be 5.0 - 0.6 = 4.4 volts at the drive, well below the design
> voltage. Use of a Schottky diode (0.34 volts) would help somewhat, but you
> have now narrowed the normal power design-margin by 90 percent, not good.

Another interesting possibility would be to use the momentum of the spinning
platters and motor assembly to power the drive electronics, simply by using the
motor as a generator. When power fails during a write, use the current
generated by the motor to finish the write.

Just a wild idea...

Cheers//Frank
--
WWWWW _______________________
## o o\ / Frank de Lange \
}# \| / \
##---# _/ <Hacker for Hire> \
#### \ +31-320-252965 /
\ [email protected] /
-------------------------
[ "Omnis enim res, quae dando non deficit, dum habetur
et non datur, nondum habetur, quomodo habenda est." ]

2001-11-29 01:52:28

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Thu, 29 Nov 2001, Frank de Lange wrote:

> Another interesting possibility would be to use the momentum of the spinning
> platters and motor assembly to power the drive electronics, simply by using the
> motor as a generator. When power fails during a write, use the current
> generated by the motor to finish the write.

Non-trivial because you'd have to adjust filters and code rate to the
actual platter speed (which goes down as energy is drained and moved
elsewhere) and design that beast to actually be a generator.

Probably too much development effort for a situation which should not
happen too often.

2001-12-01 10:40:13

by Pavel Machek

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Hi!

> > Assuming the drive's inherent bad-block detection mechanisms don't find it
> > and remap it on a read first, rapidly consuming the spare block reserve. But
> > that's a firmware problem...
>
> Drives should never reassign blocks on read operations, because they'd
> take away the chance to try to read that block for say four hours.

Why not? If drive gets ECC-correctable read error, it seems to me like
good time to reassign.
Pavel
--
"I do not steal MS software. It is not worth it."
-- Pavel Kankovsky

2001-12-02 20:30:53

by Pavel Machek

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Hi!

> > "Power off during write operations may make an incomplete sector which
> > will report hard data error when read. The sector can be recovered by a
> > rewrite operation."
>
> So the proper defect management would be to simply initialize the broken
> sector once a fsck hits it (still, I've never seen disks develop so many
> bad blocks so quickly as those failed DTLA-307045 drives I had).
>
> Note, the specifications say that the write cache setting is ignored
> when the drive runs out of spare blocks for reassignment after defects
> (so that the drive can return the error code right away when it cannot
> guarantee the write actually goes to disk).

They should turn off write-back after number-of-spare-block < cache-size,
otherwise they are not safe.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2001-12-01 10:51:33

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

Check out the hotfixing code in NWFS. It handles exactly what this
long and drawn out thread has discussed, and it's already in
Linux. The code is contained in nwvp.c. I can tell you
that in the past three years of running NWFS on Linux and all
the time I worked at Novell from about 1996 on, I never once saw
a server hotfix data after the newer "data guard" drive technologies
came out. In fact, by default, I make the hotfix area on the drive
about .1 % of the total space, since it;s probably just wasted
space at this point.

It's just wasted space these days, but it is a good idea to keep it
around, just in case the "pointless" argument turns out not to
be pointless and someone gets eaten by a shark (1 in 100,000,000) at
the same instant they are struck by lightening (1 in 200,000,000).

:-)
Jeff

On Thu, Nov 29, 2001 at 11:21:57PM +0100, Pavel Machek wrote:
> Hi!
>
> > > Assuming the drive's inherent bad-block detection mechanisms don't find it
> > > and remap it on a read first, rapidly consuming the spare block reserve. But
> > > that's a firmware problem...
> >
> > Drives should never reassign blocks on read operations, because they'd
> > take away the chance to try to read that block for say four hours.
>
> Why not? If drive gets ECC-correctable read error, it seems to me like
> good time to reassign.
> Pavel
> --
> "I do not steal MS software. It is not worth it."
> -- Pavel Kankovsky
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-12-02 00:08:42

[permalink] [raw]

Subject: Re: Journaling pointless with today's hard disks?

On Thu, 29 Nov 2001, Pavel Machek wrote:

> > Drives should never reassign blocks on read operations, because they'd
> > take away the chance to try to read that block for say four hours.
>
> Why not? If drive gets ECC-correctable read error, it seems to me like
> good time to reassign.

Because you don't know if it's just some slipped bits, a shutdown during
write, or an actual fault. When that happens on a verify after write,
that's indeed reasonable. Otherwise the drive should just mark that
block as "watch closely on next write".

2001-12-04 00:21:03