2002-09-24 07:19:45

by Jakob Oestergaard

[permalink] [raw]
Subject: ReiserFS buglet


First:

In linux-2.4.19, I found the following:

fs/reiserfs/super.c:707

s->s_blocksize = sb_blocksize(rs);
s->s_blocksize_bits = 0;
while ((1 << s->s_blocksize_bits) != s->s_blocksize)
s->s_blocksize_bits ++;


What happens if there's a bit-flip on the disk so that s->s_blocksize is
not a power of two ?

I would suggest replacing the '!=' with a '<' in the while loop and
adding a sanity check afterwards.

After all, a single bit-flip in the root fs superblock would cause the
system to hang silently (but spinning really fast!) at boot.


Second:

As I see it, the ReiserFS journal has the same problems as jbd wrt. to
atomicity of write operations of indexes. Please see my recent mail
about the jbd problems.


--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:


2002-09-24 09:16:04

by Oleg Drokin

[permalink] [raw]
Subject: Re: ReiserFS buglet

Hello!

On Tue, Sep 24, 2002 at 09:24:55AM +0200, Jakob Oestergaard wrote:

> In linux-2.4.19, I found the following:
> fs/reiserfs/super.c:707
> s->s_blocksize = sb_blocksize(rs);
> s->s_blocksize_bits = 0;
> while ((1 << s->s_blocksize_bits) != s->s_blocksize)
> s->s_blocksize_bits ++;
> What happens if there's a bit-flip on the disk so that s->s_blocksize is
> not a power of two ?

FS will refuse to mount. Or kernel will panic (depends on some stuff).
If s->s_blocksize is zeroed, kernel will hang, as you correctly noticed.

> I would suggest replacing the '!=' with a '<' in the while loop and
> adding a sanity check afterwards.

What if overheated CPU will cause a bitflip exactly after such checks?
You cannot protect against broken hardware. Such problems should be
fixed by fsck.

> As I see it, the ReiserFS journal has the same problems as jbd wrt. to
> atomicity of write operations of indexes. Please see my recent mail
> about the jbd problems.

journal header in reiserfs only occupies first 20 bytes of the block,
since this fells within 1st 512 bytes hardware sector, it will be written
atomically, I presume.

Bye,
Oleg

2002-09-24 09:22:09

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: ReiserFS buglet

On Tue, Sep 24, 2002 at 01:21:10PM +0400, Oleg Drokin wrote:
> Hello!
>
...
>
> > I would suggest replacing the '!=' with a '<' in the while loop and
> > adding a sanity check afterwards.
>
> What if overheated CPU will cause a bitflip exactly after such checks?
> You cannot protect against broken hardware. Such problems should be
> fixed by fsck.

Disk errors are common. Software can also flip that bit.

But I agree that it shouldn't be something that commonly happens, and I
suppose it is acceptable for a filesystem to barf if it has been
corrupted by malicious software.

>
> > As I see it, the ReiserFS journal has the same problems as jbd wrt. to
> > atomicity of write operations of indexes. Please see my recent mail
> > about the jbd problems.
>
> journal header in reiserfs only occupies first 20 bytes of the block,
> since this fells within 1st 512 bytes hardware sector, it will be written
> atomically, I presume.

You presume wrong.

I posted to LKML about a month ago with some questions regarding exactly
this issue. I had a disk that worked on 128 byte atomic writes - a
standard IDE disk.

The conclusion was something like "we know jack about the disk's
internal logic" so we need consistency measures instead of relying on
anything from the disk.

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-09-24 09:37:13

by Hans Reiser

[permalink] [raw]
Subject: Re: ReiserFS buglet

Jakob Oestergaard wrote:

>
>
>You presume wrong.
>
>I posted to LKML about a month ago with some questions regarding exactly
>this issue. I had a disk that worked on 128 byte atomic writes - a
>standard IDE disk.
>
>The conclusion was something like "we know jack about the disk's
>internal logic" so we need consistency measures instead of relying on
>anything from the disk.
>
>
>
What was the model and brand?


2002-09-24 09:42:17

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: ReiserFS buglet

On Tue, Sep 24, 2002 at 01:42:25PM +0400, Hans Reiser wrote:
> Jakob Oestergaard wrote:
>
..
> >
> >
> What was the model and brand?
>

Quantum Fireball 1GB IDE

Old one, I know :) However, I doubt that newer disks are emulating
antiquated IDE closer than this one... I guess we will see stranger
write patterns on disks the more modern they get.

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-09-24 09:43:06

by Oleg Drokin

[permalink] [raw]
Subject: Re: ReiserFS buglet

Hello!

On Tue, Sep 24, 2002 at 11:27:20AM +0200, Jakob Oestergaard wrote:
> > > I would suggest replacing the '!=' with a '<' in the while loop and
> > > adding a sanity check afterwards.
> > What if overheated CPU will cause a bitflip exactly after such checks?
> > You cannot protect against broken hardware. Such problems should be
> > fixed by fsck.
> Disk errors are common. Software can also flip that bit.

Not only disk errors are common, but also CPU/memory/chipset/wiring errors are.

> > > As I see it, the ReiserFS journal has the same problems as jbd wrt. to
> > > atomicity of write operations of indexes. Please see my recent mail
> > > about the jbd problems.
> > journal header in reiserfs only occupies first 20 bytes of the block,
> > since this fells within 1st 512 bytes hardware sector, it will be written
> > atomically, I presume.
> You presume wrong.

Oh, I missed your original email, thanks for noticing me.

> I posted to LKML about a month ago with some questions regarding exactly
> this issue. I had a disk that worked on 128 byte atomic writes - a
> standard IDE disk.

Hm. This is still much larger than 20 bytes we use.

> The conclusion was something like "we know jack about the disk's
> internal logic" so we need consistency measures instead of relying on
> anything from the disk.

Actually we submit data to disk in 512 byte chunks (4k blocksize case),
and disk should not write any data before it receives all of it and
checks the integrity (hm, this is only true for UDMA, though.)
Still I do not see why any sane disk would start to write a sector before fully
receiving new sector's content (thinking of disk drives of course, solid state
stuff should take its own measures in this direction too).
This is even more insane than ACKing data and putting it in not battery
backed cache to be lost on power loss (Yes, I know this is a common
practice now. At least there is a way either to turn such feature off
or to flush a cache on demand).

Thanks for bringing our attention to such issues, though changing disk format
is our of questions for reiser3 now, some kind of verifying single instance
on-disk structures may/will be incorporated in reiser4.

Bye,
Oleg

2002-09-24 09:58:28

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: ReiserFS buglet

On Tue, Sep 24, 2002 at 01:48:16PM +0400, Oleg Drokin wrote:
> Hello!
...
> > Disk errors are common. Software can also flip that bit.
>
> Not only disk errors are common, but also CPU/memory/chipset/wiring errors are.

It's a question of which errors one wishes to handle, and which you
simply choose to ignore.

It's a compromise, and I understand that.

For example, BK uses checksums on all it's files (AFAIK). This allows
you to at least discover hardware errors. CVS doesn't - but CVS is
still "good enough" for a lot of people.

Some hardware problems cannot be detected, much less recovered from,
without adding some significant cost (run-time performance wise,
complexity wise, or ...).

This problem, however, you can both detect and recover from perfectly,
with little extra effort. Whether you want to do so or not is of course
up to you.

I'm not going to push - I just wanted to point it out :)

...
> > I posted to LKML about a month ago with some questions regarding exactly
> > this issue. I had a disk that worked on 128 byte atomic writes - a
> > standard IDE disk.
>
> Hm. This is still much larger than 20 bytes we use.

Assuming your 20 bytes do not span a 128 byte boundary ;)

Perhaps you're safe on current LVM/RAID/partition layers (which may
guarantee a coarser alignment - today).

And perhaps there is no disk out there with less than 128 byte atomic
writes. Maybe. Do you know? I certanly don't.

>
> > The conclusion was something like "we know jack about the disk's
> > internal logic" so we need consistency measures instead of relying on
> > anything from the disk.
>
> Actually we submit data to disk in 512 byte chunks (4k blocksize case),
> and disk should not write any data before it receives all of it and
> checks the integrity (hm, this is only true for UDMA, though.)
> Still I do not see why any sane disk would start to write a sector before fully
> receiving new sector's content (thinking of disk drives of course, solid state
> stuff should take its own measures in this direction too).

Please read the original mails about the IDE disk writing.

The date is 5th of August this year, the subject was "Disk (block) write
strangeness".

The conclusion really was that there is no such thing as a 512 byte
sector. Not in the real world. The disk interface may emulate it, but
that doesn't mean that the disk is internally working with a concept
even remotely close to that.

> This is even more insane than ACKing data and putting it in not battery
> backed cache to be lost on power loss (Yes, I know this is a common
> practice now. At least there is a way either to turn such feature off
> or to flush a cache on demand).

I was pretty surprised about all this myself, and I just wanted to bring
the issue to your attention.

The real world just sucks sometimes ;)

>
> Thanks for bringing our attention to such issues, though changing disk format
> is our of questions for reiser3 now, some kind of verifying single instance
> on-disk structures may/will be incorporated in reiser4.

Of course - I look forward to seeing how/if you will deal with the
problem.

Cheers,

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-09-24 10:20:09

by Oleg Drokin

[permalink] [raw]
Subject: Re: ReiserFS buglet

Hello!

On Tue, Sep 24, 2002 at 12:03:38PM +0200, Jakob Oestergaard wrote:

> It's a question of which errors one wishes to handle, and which you
> simply choose to ignore.

Yes, that's true. Reiserfs chose to not handle any HW errors, this is even
written somewhere in our FAQ.

> > > this issue. I had a disk that worked on 128 byte atomic writes - a
> > > standard IDE disk.
> >
> > Hm. This is still much larger than 20 bytes we use.
> Assuming your 20 bytes do not span a 128 byte boundary ;)

True. But they are located at the beginning of a block.

> And perhaps there is no disk out there with less than 128 byte atomic
> writes. Maybe. Do you know? I certanly don't.

Actually I read your email and I do not see why do you think your disk
does 128 bytes atomic writes. I think this is just impossible on IDE
(for UDMA at least).

> > Actually we submit data to disk in 512 byte chunks (4k blocksize case),
> > and disk should not write any data before it receives all of it and
> > checks the integrity (hm, this is only true for UDMA, though.)
> > Still I do not see why any sane disk would start to write a sector before fully
> > receiving new sector's content (thinking of disk drives of course, solid state
> > stuff should take its own measures in this direction too).
> Please read the original mails about the IDE disk writing.
> The date is 5th of August this year, the subject was "Disk (block) write
> strangeness".

Ok, I did.
You math was wrong, you forgot to account that your number should be divided
by 128, not 512, since integer itself is 4 bytes long on x86.
(See message from Itai Nahshon <[email protected]>, August 7th
Message-Id: <[email protected]> if you missed that originally.)

> The conclusion really was that there is no such thing as a 512 byte
> sector. Not in the real world. The disk interface may emulate it, but

Probably I mssed that part of converstion, then.
As I see the IDE thing, you tell the hardware that you want to write some amount
of _sectors_ to the hard drive, and then feed the controller with necessary
amount of data. If it writes these sectors from the start of the data flow,
what will it do on data transefer timeout?
So I still think that data is written on disk in 512 bytes atomic blocks
until I see IDE device that does otherwise (and then I will probably
dig some IDE datasheet and find out they are violating some spec ;) )

> that doesn't mean that the disk is internally working with a concept
> even remotely close to that.

If they still maintain atomic sector writing, that's their right.

Bye,
Oleg

2002-09-24 10:34:12

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: ReiserFS buglet

On Tue, Sep 24, 2002 at 02:25:21PM +0400, Oleg Drokin wrote:
> Hello!
>
> On Tue, Sep 24, 2002 at 12:03:38PM +0200, Jakob Oestergaard wrote:
>
> > It's a question of which errors one wishes to handle, and which you
> > simply choose to ignore.
>
> Yes, that's true. Reiserfs chose to not handle any HW errors, this is even
> written somewhere in our FAQ.

Fair enough.

...
> Ok, I did.
> You math was wrong, you forgot to account that your number should be divided
> by 128, not 512, since integer itself is 4 bytes long on x86.
> (See message from Itai Nahshon <[email protected]>, August 7th
> Message-Id: <[email protected]> if you missed that originally.)

Darn !

Thanks for pointing that one out - I missed it.

And you (well [email protected]) is right and I am wrong.

...
> Probably I mssed that part of converstion, then.
> As I see the IDE thing, you tell the hardware that you want to write some amount
> of _sectors_ to the hard drive, and then feed the controller with necessary
> amount of data. If it writes these sectors from the start of the data flow,
> what will it do on data transefer timeout?
> So I still think that data is written on disk in 512 bytes atomic blocks
> until I see IDE device that does otherwise (and then I will probably
> dig some IDE datasheet and find out they are violating some spec ;) )

That would be really interesting.

I mean, my point still stands even though my proof was wrong, unless
someone can come up with a "proof" (a spec would be close enough) that
writes must be 512 bytes.

Thanks, :)

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2002-09-24 10:49:10

by Oleg Drokin

[permalink] [raw]
Subject: Re: ReiserFS buglet

Hello!

On Tue, Sep 24, 2002 at 12:39:24PM +0200, Jakob Oestergaard wrote:
> > amount of data. If it writes these sectors from the start of the data flow,
> > what will it do on data transefer timeout?
> > So I still think that data is written on disk in 512 bytes atomic blocks
> > until I see IDE device that does otherwise (and then I will probably
> > dig some IDE datasheet and find out they are violating some spec ;) )
> That would be really interesting.
> I mean, my point still stands even though my proof was wrong, unless
> someone can come up with a "proof" (a spec would be close enough) that
> writes must be 512 bytes.

Ok, here's what I have quckly found, it seems to proove that my assumptions are
right:
http://www.repairfaq.org/filipg/LINK/F_IDE-tech.html
Section: "9. Command Descriptions":
Here the quote:
The recommended procedure for executing a command on the selected
drive is:

1. Wait for drive to clear BUSY.
2. Load required parameters in the Command Block Registers.
3. Activate the Interrupt Enable (nIEN) bit.
4. Wait for drive to set DRDY.
5. Write the command code to the Command Register.
Execution of the command begins as soon as the drive loads the Command
Block Register.

Commands:
E8h: Write Buffer
This command enables the host to overwrite the contents of the
drive's sector buffer with any data pattern. On receipt of this
comand the drive sets BUSY within 400 nsec, sets up the sector
buffer for a write operation, sets DRQ and clears BUSY. The
host then writes up to 512 bytes of data to the buffer. The
Read Buffer and Write Buffer are synchronized so that back to
back Read Buffer and Write Buffer commands access the same 512
bytes within the buffer.
30h: Write Sectors with Retry
31h: Write Sectors without Retry
This command writes from 1 to 256 sectors beginning at the
specified sector and as stated earlier, a sector count of 0 in
the Command Block Register will request 256 sectors. When the
drive accepts this command it sets DRQ and wait for the host to
fill the sector buffer with the data to be written to disk. No
interrupt is generated to start the first buffer fill operation
and once the buffer is full the drive clears the DRQ, sets BUSY
and begins execution of the command.

For single sector Write operations, the drive sets DRQ upon
receipt of the command and waits for the host to fill the
sector buffer. Once a sector has been transferred, the drive
sets BUSY and clears DRQ. If the drive is not on the requested
cylinder and head an inplied seek and/or head switch is
performed. Once the desired track is reached the drive searches
for the appropriate ID field.

....
End of quote

So no write should be even started until entire sector buffer is filled.

Bye,
Oleg

2002-09-24 11:25:02

by Hans Reiser

[permalink] [raw]
Subject: Re: ReiserFS buglet

Oleg Drokin wrote:

>Hello!
>
>On Tue, Sep 24, 2002 at 12:03:38PM +0200, Jakob Oestergaard wrote:
>
>
>
>>It's a question of which errors one wishes to handle, and which you
>>simply choose to ignore.
>>
>>
>
>Yes, that's true. Reiserfs chose to not handle any HW errors, this is even
>written somewhere in our FAQ.
>
No, this is not true. We handle IO errors. We do not attempt to handle
every imaginable hardware error because no one can. If your CPU
overheats, our code makes no effort to compensate for it.

Hans



2002-09-24 11:30:29

by Oleg Drokin

[permalink] [raw]
Subject: Re: ReiserFS buglet

Hello!

On Tue, Sep 24, 2002 at 03:30:10PM +0400, Hans Reiser wrote:

> >>It's a question of which errors one wishes to handle, and which you
> >>simply choose to ignore.
> >Yes, that's true. Reiserfs chose to not handle any HW errors, this is even
> >written somewhere in our FAQ.
> No, this is not true. We handle IO errors. We do not attempt to handle

Ok, IO errors is not kind of HW errors I meant.
I meant broken HW that does not behave like it should behave according
to specs (i.e. writing other data than we asked it, damaging memory
content and stuff).

> every imaginable hardware error because no one can. If your CPU
> overheats, our code makes no effort to compensate for it.

Bye,
Oleg