Hello all,
While investigating how various disks handle power-loss during writes, I
came across something *very* strange.
It seems that
*) Either the disk writes backwards (no I don't believe that)
*) Or the kernel is writing 256 B blocks (AFAIK it can't)
*) The disk has some internal magic that cause a power-loss during
a full block write to leave the first half of the block intact with
old data, and update the second half of a block correctly with new
data. (And I don't believe that either).
The scenario is: I wrote a program that will write a 50 MB block with
O_SYNC to /dev/hdc. The block is full of 32-bit integers, initialized
to 0. For every full block write (the block is written with one single
write() call), the integers are incremented once.
So first I have 50 MB of 0's. Then 50 MB of 1's. etc.
During this write cycle, I pull the power cable. I get the machine
back online and I dump the 50 MB block.
What I found was a 50 MB block holding:
11668992 times "0x00000002"
231168 times "0x00000003"
1174528 times "0x00000002"
32512 times "0x00000003"
Please note that 32512 is *not* a multiple of 512. And please note that
the 3's are written *after* the 2's, so actually there is a 512 byte
block on the disk which contains 2's in the first half, and 3's in the
second half!
How on earth could that happen ?
Why does the kernel not write from beginning to end ? Or why doesn't
the disk ?
And does the elevator cause the writes to be shuffled around like that -
I would have expected the kernel to write from beginning to end every
single time...
The kernel is 2.4.18 on some i686 box
The disk is a Quantum Fireball 1GB IDE (from way back then ;)
The IDE chipset is an I820 Camino 2
I can submit the test program or do further tests, if anyone is
interested.
Thank you,
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
On Mon, 2002-08-05 at 19:49, Jakob Oestergaard wrote:
> *) Either the disk writes backwards (no I don't believe that)
> *) Or the kernel is writing 256 B blocks (AFAIK it can't)
> *) The disk has some internal magic that cause a power-loss during
> a full block write to leave the first half of the block intact with
> old data, and update the second half of a block correctly with new
> data. (And I don't believe that either).
You forgot to add
*) or the disk internal logic bears no resemblance to the antiquated API
it fakes for the convenience of interface hardware and software
Linux also won't neccessarily do write outs in order.
On Mon, Aug 05, 2002 at 09:17:12PM +0100, Alan Cox wrote:
> On Mon, 2002-08-05 at 19:49, Jakob Oestergaard wrote:
> > *) Either the disk writes backwards (no I don't believe that)
> > *) Or the kernel is writing 256 B blocks (AFAIK it can't)
> > *) The disk has some internal magic that cause a power-loss during
> > a full block write to leave the first half of the block intact with
> > old data, and update the second half of a block correctly with new
> > data. (And I don't believe that either).
>
> You forgot to add
>
> *) or the disk internal logic bears no resemblance to the antiquated API
> it fakes for the convenience of interface hardware and software
Fair enough - that seems like a reasonable explanation.
On a side note - what guarantees does one have ? Any pointers to papers
or other material about this ?
For example, when updating a 3 to a 4 on the disk, could I end up with a
7 ? (having 00000011 on platter, starting write of 00000100, but
after having written the one high power fails and I now have 00000111).
The above example is simple - I doubt that it would happen - but how
much can and cannot happen ? I bet the Phase Tree (Tux2) people must
have thought about this at some point... I haven't had much luck with
Google on this one...
>
> Linux also won't neccessarily do write outs in order.
But in this case, I wonder why ?
It's one huge sequential write, from the beginning of a device and 50 MB
onwards. The write is submitted in one single write() every single
time. Why start going semi-backwards and chopping things up ?
I'm *very* certain that Linux does this non-sequentially, because the
disk might be causing the half-block oddity which really surprised me,
but the disk is not caching 20 MB of data internally, for sure.
Is this an elevator deficiency in 2.4.18, or am I just moaning for no
reason at all ? ;)
Thanks for the quick reply !
Cheers,
--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:
Jakob Oestergaard wrote:
>
> I'm *very* certain that Linux does this non-sequentially, because the
> disk might be causing the half-block oddity which really surprised me,
> but the disk is not caching 20 MB of data internally, for sure.
Maybe you shouldn't consider the powerfailure as a happening at one
single point in time, but rather happening during a short periode of
time. Maybe it is possible during this periode of time, that at some
times there is enough power for actually writing to the disk, and at
other times there is not.
I think it should be possible for the firmware on a good disk to
prevent such artifacts. But I think you can find disks that just
keeps trying to write even while power is failing.
--
Kasper Dupont -- der bruger for meget tid p? usenet.
For sending spam use mailto:[email protected]
or mailto:[email protected]
Kasper Dupont wrote:
> I think it should be possible for the firmware on a good disk to
> prevent such artifacts. But I think you can find disks that just
> keeps trying to write even while power is failing.
>
That could might give you some (sub)blocks out of order, if the disk
firmware falsely believes that it is free to reorder anything
that reach its internal cache. Writing to the bitter end
will turn at least one block to garbage as write current fail
in the middle.
Alan Cox wrote:
> *) or the disk internal logic bears no resemblance to the antiquated API
> it fakes for the convenience of interface hardware and software
One may then wonder if journalling is a safe thing to do,
with out-of-order writes exposed by a power failure...
Helge Hafting
On Monday 05 August 2002 21:49 pm, Jakob Oestergaard wrote:
> Hello all,
>
> While investigating how various disks handle power-loss during writes, I
> came across something *very* strange.
>
> It seems that
>
> *) Either the disk writes backwards (no I don't believe that)
> *) Or the kernel is writing 256 B blocks (AFAIK it can't)
> *) The disk has some internal magic that cause a power-loss during
> a full block write to leave the first half of the block intact with
> old data, and update the second half of a block correctly with new
> data. (And I don't believe that either).
>
> The scenario is: I wrote a program that will write a 50 MB block with
> O_SYNC to /dev/hdc. The block is full of 32-bit integers, initialized
> to 0. For every full block write (the block is written with one single
> write() call), the integers are incremented once.
>
> So first I have 50 MB of 0's. Then 50 MB of 1's. etc.
>
> During this write cycle, I pull the power cable. I get the machine
> back online and I dump the 50 MB block.
>
> What I found was a 50 MB block holding:
> 11668992 times "0x00000002"
> 231168 times "0x00000003"
> 1174528 times "0x00000002"
> 32512 times "0x00000003"
>
> Please note that 32512 is *not* a multiple of 512. And please note that
> the 3's are written *after* the 2's, so actually there is a 512 byte
> block on the disk which contains 2's in the first half, and 3's in the
> second half!
Integers are 32 bit, so a 512 byte disk block contains 128 such integers...
Indeed, All the values above are divisible by 128, so you have:
11668992/128 = 91164 blocks of "0x00000002"
231168/128 = 1806 blocks of "0x00000003"
1174528/128 = 9176 blocks of "0x00000002"
32512/128 = 254 blocks of "0x00000003"
This does not prove, neither disprove anything about your
main concern, that writes are non-atomic in the block level.
>
> How on earth could that happen ?
>
> Why does the kernel not write from beginning to end ? Or why doesn't
> the disk ?
>
> And does the elevator cause the writes to be shuffled around like that -
> I would have expected the kernel to write from beginning to end every
> single time...
>
I would not expect writes to be in order.
A simple elevator algorithm could write fragments (cylinder sized?)
in reverse order. On-disk write scheduling could start writing at any
sector (to minimize rotational latency).
Knowing the disk geometry and parameters could help with understanding
your results.
> The kernel is 2.4.18 on some i686 box
> The disk is a Quantum Fireball 1GB IDE (from way back then ;)
> The IDE chipset is an I820 Camino 2
>
> I can submit the test program or do further tests, if anyone is
> interested.
>
> Thank you,
-- Itai