2001-04-19 11:55:55

by Alexander Viro

[permalink] [raw]
Subject: ext2 inode size (on-disk)

Erm... Folks, can ->s_inode_size be not a power of 2? Both
libext2fs and kernel break in that case. Example:

dd if=/dev/zero of=foo bs=1024 count=20480
mkfs -I 192 foo

corrupts memory and segfaults. Reason: ext2_read_inode() (same problem
is present in the kernel version of said beast) finds inode offset within
cylinder group piece of inode table, splits it into block*block_size+offset,
reads the block and works with the structure at given offset.

I.e. it does
group = (ino-1) / inodes_per_group;
number_in_group = (ino-1) % inodes_per_group;
offset_in_group = number_in_group * inode_size;
block_number = inode_table_base[group] + offset_in_group/block_size;
offset_in_block = offset_in_group % block_size

Guess what happens if inode crosses block boundary? Exactly.

AFAICS we have two sane solutions:

a) require inode size to be a power of 2

b) switch to

group = (ino-1) / inodes_per_group;
number_in_group = (ino-1) % inodes_per_group;
block_in_group = number_in_group / inodes_per_block;
number_in_block = number_in_group % inodes_per_block;
block = inode_table_fragments[group] + block_in_group;
offset_in_block = number_in_block * inode_size;

i.e. instead of current "pack inodes into piece of inode table and
pad it in the end" do "pack inodes into blocks padding the end of every block".

Something has to be done - right now mke2fs effectively mandates "inode size
is a power of 2" and as far as I'm concerned it's OK, but segfaulting is
a bit too drastic way of telling user "don't do it"...
Al

PS: can we assume that inodes_per_group is a multiple of inodes_per_block
or it isn't guaranteed?


2001-04-19 17:43:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext2 inode size (on-disk)

Al, you write:
> Erm... Folks, can ->s_inode_size be not a power of 2? Both
> libext2fs and kernel break in that case. Example:
>
> dd if=/dev/zero of=foo bs=1024 count=20480
> mkfs -I 192 foo

I had always assumed that it would be a power-of-two size, but since it
is an undocumented option to mke2fs, I suppose it was never really
intended to be used. It appears, however, that the mke2fs code
doesn't do ANY checking on the parameter, so you could concievably make
the inode size SMALLER than the current size, and this would DEFINITELY
be bad as well.

> corrupts memory and segfaults. Reason: ext2_read_inode() (same problem
> is present in the kernel version of said beast) finds inode offset within
> cylinder group piece of inode table, splits it into block*block_size+offset,
> reads the block and works with the structure at given offset.

However, we _should_ be safe against this, because ext2_read_super() does:

if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV) {
sb->u.ext2_sb.s_inode_size = EXT2_GOOD_OLD_INODE_SIZE;
sb->u.ext2_sb.s_first_ino = EXT2_GOOD_OLD_FIRST_INO;
} else {
sb->u.ext2_sb.s_inode_size = le16_to_cpu(es->s_inode_size);
sb->u.ext2_sb.s_first_ino = le32_to_cpu(es->s_first_ino);
if (sb->u.ext2_sb.s_inode_size != EXT2_GOOD_OLD_INODE_SIZE) {
printk ("EXT2-fs: unsupported inode size: %d\n",
sb->u.ext2_sb.s_inode_size);
goto failed_mount;
}
}

Are you talking about user-space code or kernel code when you say it is
segfaulting and corrupting memory?

> a) require inode size to be a power of 2

This would be my preferred solution, makes the math in the kernel a lot
faster. Should probably be put into mke2fs for safety, and also checked
when/if we ever allow the kernel to mount a filesystem with non-power-of-2
inode sizes.

> PS: can we assume that inodes_per_group is a multiple of inodes_per_block
> or it isn't guaranteed?

mke2fs will always set up the filesystem this way, and there is no real
reason NOT to do that. If you are using a filesystem block for the inode
table, it is pointless to leave part of it empty, because you can't use
it for anything else anyways.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-19 18:04:29

by Alexander Viro

[permalink] [raw]
Subject: Re: ext2 inode size (on-disk)



On Thu, 19 Apr 2001, Andreas Dilger wrote:

> Al, you write:
> > Erm... Folks, can ->s_inode_size be not a power of 2? Both
> > libext2fs and kernel break in that case. Example:
> >
> > dd if=/dev/zero of=foo bs=1024 count=20480
> > mkfs -I 192 foo
>
> I had always assumed that it would be a power-of-two size, but since it
> is an undocumented option to mke2fs, I suppose it was never really
> intended to be used. It appears, however, that the mke2fs code
> doesn't do ANY checking on the parameter, so you could concievably make
> the inode size SMALLER than the current size, and this would DEFINITELY
> be bad as well.

In some sense it does - it dies if you've passed it not a power of two ;-)
I don't think that segfault is a good way to report the problem, though...

Problem with mkfs is obvious, but kernel side is also shady - we could
have cleaner code if we assumed that inode size is power of 2. As it
is, we have a code in read_super() that checks for size == 128 _and_
code that was apparently writen in assumption that it can be not a
power of 2. However, if that was the really the goal, we fail - code
in ext2_read_inode() actually would break with such sizes.

In other words, the real question is what the hell are we trying to
do there. If we want code that deals with sizes that are not powers of 2
we need to change ext2_read_inode() and friends. It wouldn't be
hard. OTOH, if we guarantee that inode size will always remain a power of
2 we can simplify the thing. In any case current situation doesn't
make much sense. The only question is direction of fix.

Could those who introduced ->s_inode_size tell what use had been intended?

> mke2fs will always set up the filesystem this way, and there is no real
> reason NOT to do that. If you are using a filesystem block for the inode
> table, it is pointless to leave part of it empty, because you can't use
> it for anything else anyways.

It's not that simple - if you need 160 bytes per inode rounding it up
to the next power of two will lose a lot. On 4Kb fs it will be
16 inodes per block instead of 25 - 36% loss...

2001-04-19 19:26:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext2 inode size (on-disk)

Al writes:
> > I had always assumed that it would be a power-of-two size, but since it
> > is an undocumented option to mke2fs, I suppose it was never really
> > intended to be used. It appears, however, that the mke2fs code
> > doesn't do ANY checking on the parameter, so you could concievably make
> > the inode size SMALLER than the current size, and this would DEFINITELY
> > be bad as well.
>
> In some sense it does - it dies if you've passed it not a power of two ;-)
> I don't think that segfault is a good way to report the problem, though...

Strange, I run "mke2fs -I 192 /dev/hdc2" and do not have a segfault or any
problems with e2fsck or debugfs on the resulting filesystem. I'm running
1.20-WIP, but I don't think anything was changed in this area for some time.

However, looking at the output from dumpe2fs shows it is packing inodes
across block boundaries (inode_size = 192, inodes_per_group = 16144,
inode_blocks_per_group = 757). It is also not filling the last inode
table block (which would give us 16149 inodes).

Also, looking at the disk, it shows garbage data in the space after the
end of the normal inode, and ext2 should always zero-fill unused fields.

Basically, packing inodes across block boundaries is TOTALLY broken.
This can lead to all sorts of data corruption issues if one block is
written to disk, and the other is not. For that matter, the same would
hold true with any not-power-of-two inode size. In this case, the
inode will still be crossing a hard disk sector boundary, and have the
(small) possibility that part of the inode is written and part is not.

In this light, the safe inode sizes are 128 (current size), 256, and 512.

> > mke2fs will always set up the filesystem this way, and there is no real
> > reason NOT to do that. If you are using a filesystem block for the inode
> > table, it is pointless to leave part of it empty, because you can't use
> > it for anything else anyways.
>
> It's not that simple - if you need 160 bytes per inode rounding it up
> to the next power of two will lose a lot. On 4Kb fs it will be
> 16 inodes per block instead of 25 - 36% loss...

What I was getting at, is that no matter what size of inode we are using,
we should ALWAYS fill the last inode table block as full as possible. Once
you have allocated a block to the inode table, it should be filled with
as many inodes as fit in a single block. To do anything else is a waste.

For example, with 160 byte inodes, and a 4k inode table block, we can fit

4096 / 160 = 25 inodes into this block (with 96 bytes remaining)

There is no reason to only have 1 or 2 or 17 inodes in this block. If we
assume we are not crossing a block boundary with the inode, then
inodes_per_group = n * inodes_per_block, where n = number of inode blocks.

Cheers, Andreas

PS - is this a code cleanup issue, or do you have some reason that you want
to increase the inode size?
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-04-20 02:13:18

by Alexander Viro

[permalink] [raw]
Subject: Re: ext2 inode size (on-disk)



On Thu, 19 Apr 2001, Andreas Dilger wrote:

> Strange, I run "mke2fs -I 192 /dev/hdc2" and do not have a segfault or any
> problems with e2fsck or debugfs on the resulting filesystem. I'm running
> 1.20-WIP, but I don't think anything was changed in this area for some time.

May depend on the libc version/size of device/phase of the moon. I've
got segfaults with 1.18, 1.19 and 1.20-WIP on a Debian box with glibc
2.1.3-18 and 20Mb image. What really happens is memory corruption in
libe2fs (ext2_write_inode()), segfault comes later (usually in free()).

> Basically, packing inodes across block boundaries is TOTALLY broken.
> This can lead to all sorts of data corruption issues if one block is
> written to disk, and the other is not. For that matter, the same would

Yup.

> PS - is this a code cleanup issue, or do you have some reason that you want
> to increase the inode size?

Code cleanup
Al