LinuxLists.cc - [patch 5/5] Disallow in-inode attributes for reserved inodes

2005-01-20 03:27:00

Subject: [patch 5/5] Disallow in-inode attributes for reserved inodes

When creating a filesystem with inodes bigger than 128 bytes, mke2fs
fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
it creates (the journal, the filesystem root, and lost+found). We would
require a zeroed-out i_extra_isize field but we don't get it, so
disallow in-inode attributes for those inodes.

Add an i_extra_isize sanity check.

Signed-off-by: Andreas Gruenbacher <[email protected]>

Index: linux-2.6.11-latest/fs/ext3/inode.c
===================================================================
--- linux-2.6.11-latest.orig/fs/ext3/inode.c
+++ linux-2.6.11-latest/fs/ext3/inode.c
@@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);

- ei->i_extra_isize =
- (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ?
- le16_to_cpu(raw_inode->i_extra_isize) : 0;
- if (ei->i_extra_isize) {
- __le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE +
- ei->i_extra_isize;
- if (le32_to_cpu(*magic))
- ei->i_state |= EXT3_STATE_XATTR;
- }
+ if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
+ EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
+ /*
+ * When mke2fs creates big inodes it does not zero out
+ * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE,
+ * so ignore those first few inodes.
+ */
+ ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
+ if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
+ EXT3_INODE_SIZE(inode->i_sb))
+ goto bad_inode;
+ if (ei->i_extra_isize == 0) {
+ /* The extra space is currently unused. Use it. */
+ ei->i_extra_isize = sizeof(struct ext3_inode) -
+ EXT3_GOOD_OLD_INODE_SIZE;
+ } else {
+ __le32 *magic = (void *)raw_inode +
+ EXT3_GOOD_OLD_INODE_SIZE +
+ ei->i_extra_isize;
+ if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC))
+ ei->i_state |= EXT3_STATE_XATTR;
+ }
+ } else
+ ei->i_extra_isize = 0;

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext3_file_inode_operations;

--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

2005-01-20 12:16:28

by Andreas Dilger

[permalink] [raw]

Subject: Re: [patch 5/5] Disallow in-inode attributes for reserved inodes

On Jan 20, 2005 03:01 +0100, Andreas Gruenbacher wrote:
> When creating a filesystem with inodes bigger than 128 bytes, mke2fs
> fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
> it creates (the journal, the filesystem root, and lost+found). We would
> require a zeroed-out i_extra_isize field but we don't get it, so
> disallow in-inode attributes for those inodes.
>
> Add an i_extra_isize sanity check.

I'm not sure I agree with this patch. It definitely resolves a problem
I had with the previous patch (4/5), namely that for inodes that were
created in a large-inode filesystem using a kernel w/o the large-inode
patch we didn't save EAs into the large-inode space at all. With this
patch we will start using that space.

It is debatable whether we should mark inodes bad if the i_extra_isize
field is bad, or if we should just initialize i_extra_isize in that case.
On one hand there is very little else we can use to validate the on-disk
inodes, but on the other hand the EA data isn't critical to reading the
inode so this would seem to introduce additional failure cases.

I'm not sure whether an old e2fsck will also clear out the space in a
large inode that it writes or not (I know that the patched one we use
validates i_extra_isize). If it doesn't it may be that there would be a
largish number of inodes that are marked bad because of this, so having
the kernel fix this would be good. If the old e2fsck zeroes the large
inodes properly it would be prudent to have an ext3_error() here so that
the disk corruption triggers an e2fsck the next time the system starts.

For the root and lost+found inodes it looks like we can never store an
EA in the extra part of the inode regardless of whether i_extra_isize is
good or not. If a bad value is found we could just initialize it and
start using that space (though not print an ext3_error() in that case,
an ext3_warning() if anything since this is probably the fault of mke2fs).

> Index: linux-2.6.11-latest/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.11-latest.orig/fs/ext3/inode.c
> +++ linux-2.6.11-latest/fs/ext3/inode.c
> @@ -2493,15 +2493,30 @@ void ext3_read_inode(struct inode * inod
> ei->i_data[block] = raw_inode->i_block[block];
> INIT_LIST_HEAD(&ei->i_orphan);
>
> - ei->i_extra_isize =
> - (EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) ?
> - le16_to_cpu(raw_inode->i_extra_isize) : 0;
> - if (ei->i_extra_isize) {
> - __le32 *magic = (void *)raw_inode + EXT3_GOOD_OLD_INODE_SIZE +
> - ei->i_extra_isize;
> - if (le32_to_cpu(*magic))
> - ei->i_state |= EXT3_STATE_XATTR;
> - }
> + if (inode->i_ino >= EXT3_FIRST_INO(inode->i_sb) + 1 &&
> + EXT3_INODE_SIZE(inode->i_sb) > EXT3_GOOD_OLD_INODE_SIZE) {
> + /*
> + * When mke2fs creates big inodes it does not zero out
> + * the unused bytes above EXT3_GOOD_OLD_INODE_SIZE,
> + * so ignore those first few inodes.
> + */
> + ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> + if (EXT3_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> + EXT3_INODE_SIZE(inode->i_sb))
> + goto bad_inode;
> + if (ei->i_extra_isize == 0) {
> + /* The extra space is currently unused. Use it. */
> + ei->i_extra_isize = sizeof(struct ext3_inode) -
> + EXT3_GOOD_OLD_INODE_SIZE;
> + } else {
> + __le32 *magic = (void *)raw_inode +
> + EXT3_GOOD_OLD_INODE_SIZE +
> + ei->i_extra_isize;
> + if (*magic == cpu_to_le32(EXT3_XATTR_MAGIC))
> + ei->i_state |= EXT3_STATE_XATTR;
> + }
> + } else
> + ei->i_extra_isize = 0;
>
> if (S_ISREG(inode->i_mode)) {
> inode->i_op = &ext3_file_inode_operations;

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/

Attachments:

(No filename) (3.75 kB)
(No filename) (189.00 B)
Download all attachments

2005-01-20 13:29:31

by Andreas Gruenbacher

[permalink] [raw]

Subject: Re: [patch 5/5] Disallow in-inode attributes for reserved inodes

On Thursday 20 January 2005 13:16, Andreas Dilger wrote:
> On Jan 20, 2005 03:01 +0100, Andreas Gruenbacher wrote:
> > When creating a filesystem with inodes bigger than 128 bytes, mke2fs
> > fails to clear out bytes beyond EXT3_GOOD_OLD_INODE_SIZE in all inodes
> > it creates (the journal, the filesystem root, and lost+found). We would
> > require a zeroed-out i_extra_isize field but we don't get it, so
> > disallow in-inode attributes for those inodes.
> >
> > Add an i_extra_isize sanity check.
>
> I'm not sure I agree with this patch. It definitely resolves a problem
> I had with the previous patch (4/5), namely that for inodes that were
> created in a large-inode filesystem using a kernel w/o the large-inode
> patch we didn't save EAs into the large-inode space at all. With this
> patch we will start using that space.

The ea-in-inode patch totally relies on getting all the available inode space
cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for
any inode we find, then i_extra_isize may contain a random number, and we've
just lost, period: There is no way of sanitizing a random i_extra_isize; we
cannot know what the right number would be.

We have two options for going forward from here: (a) depend on the fact that
all inodes have been cleared out for us, or (b) don't rely on that, fix
e2fsck before going any further, and have the fixed mke2fs and e2fsck set a
filesystem feature flag that tells us that all the extra space in all inodes
is "sane".

For (a) we can make an (ugly) exception because we know which inodes mke2fs
created. As far as I can tell we can also fully rely on the kernel having
cleared out the extra space for us. e2fsck may currently fail to move the
extra space around when moving inodes, but that did not matter without
ea-in-inode, and if we fix it now, that's probably good enough. We don't
really care which inode's zeroed extra space we look at without ea-in-inode.

> It is debatable whether we should mark inodes bad if the i_extra_isize
> field is bad, or if we should just initialize i_extra_isize in that case.

IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply
changing it to something we think is better is a really bad idea. We might as
well find an unusual but still valid i_extra_isize and use it, then a future
patch allocates a few more fields in the inode, and suddenly we start to see
garbage.

> On one hand there is very little else we can use to validate the on-disk
> inodes, but on the other hand the EA data isn't critical to reading the
> inode so this would seem to introduce additional failure cases.

You may have an access acl on the inode. Not being able to read an access acl
is a clear sign of trouble. The same applies for everything else in the
system.* and security.* namespaces, at least.

> I'm not sure whether an old e2fsck will also clear out the space in a
> large inode that it writes or not (I know that the patched one we use
> validates i_extra_isize). If it doesn't it may be that there would be a
> largish number of inodes that are marked bad because of this, so having
> the kernel fix this would be good.

Repairing the filesystem has no place in the kernel code. If what you assume
is the case (and I sure would have assumed you to check before proposing the
kernel patch for inclusion), we have to go for the filesystem feature flag
approach.

> If the old e2fsck zeroes the large
> inodes properly it would be prudent to have an ext3_error() here so that
> the disk corruption triggers an e2fsck the next time the system starts.

That's even better, yes.

> For the root and lost+found inodes it looks like we can never store an
> EA in the extra part of the inode regardless of whether i_extra_isize is
> good or not. If a bad value is found we could just initialize it and
> start using that space (though not print an ext3_error() in that case,
> an ext3_warning() if anything since this is probably the fault of mke2fs).

I disagree. We cannot just use the space when we think the inode is corrupted.

Cheers,
--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX PRODUCTS GMBH

2005-01-20 23:05:39

by Andreas Dilger

[permalink] [raw]

Subject: Re: [patch 5/5] Disallow in-inode attributes for reserved inodes

On Jan 20, 2005 14:29 +0100, Andreas Gruenbacher wrote:
> The ea-in-inode patch totally relies on getting all the available inode space
> cleared out by the kernel (or mke2fs, or e2fsck). If this is not the case for
> any inode we find, then i_extra_isize may contain a random number, and we've
> just lost, period: There is no way of sanitizing a random i_extra_isize; we
> cannot know what the right number would be.

The large-inode support is designed to allow different amounts of
"fixed" optional data (i.e. what is stored inside i_extra_isize),
so it is valid to set this to 4 (i.e. just enough to hold i_extra_isize
itself) and store the EA data after that. Any code which reads "fixed"
fields from a large inode (e.g. i_mtime_nsec) needs to validate that
i_extra_isize on that inode is large enough for that data to actually
be in the fixed area in the large inode.

If the kernel is setting i_extra_isize > 4 (i.e. it is storing optional
fields there like i_mtime_msb_and_ns) it should/is-able-to also initialize
those values since it should know what they are or they shouldn't be in
struct ext3_inode.

The whole point of i_extra_isize is that it is possible for inodes
to have different amounts of the optional fixed fields in each large
inode, depending on what the kernel that wrote the inode knew about.
So any value for i_extra_isize is valid as long as those fields
are initialized. If we arbitrarily set i_extra_isize = 4 instead of
leaving the bad value this is no different than waiting for e2fsck to
do the same.

> > It is debatable whether we should mark inodes bad if the i_extra_isize
> > field is bad, or if we should just initialize i_extra_isize in that case.
>
> IMHO it's not debatable. Taking an i_extra_isize that looks odd and simply
> changing it to something we think is better is a really bad idea.
>
> You may have an access acl on the inode. Not being able to read an access acl
> is a clear sign of trouble. The same applies for everything else in the
> system.* and security.* namespaces, at least.

Well, I said it was debatable and we're having a debate ;-). I don't have
a strong opinion either way. If we ext3_error() in this case at least we
will check the fs on the next boot (which will just zero i_extra_isize)
instead of never doing anything to resolve the situation.

> > For the root and lost+found inodes it looks like we can never store an
> > EA in the extra part of the inode regardless of whether i_extra_isize is
> > good or not. If a bad value is found we could just initialize it and
> > start using that space (though not print an ext3_error() in that case,
> > an ext3_warning() if anything since this is probably the fault of mke2fs).
>
> I disagree. We cannot just use the space when we think the inode is corrupted.

But as your patch stands it doesn't ever check if i_extra_isize is valid
for the root or lost+found inode. It just always sets i_extra_isize = 0
and never uses it. Given that the root inode is fairly high-traffic it
makes sense to use the faster EA space if it is available.

If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
not so keen to have an ext3_error() there. If the user doesn't have an
e2fsck with ea-in-inode support there isn't anything they can do to fix
it and they will get a full e2fsck on each boot.

Even so, for the effort of setting i_extra_isize = 4 (or larger if we
initialize the fixed fields) we can do the equivalent of what e2fsck will
do when it finds a bogus value.

The good news is that we can still apply your patch as-is and address my
concerns later since this is a transient issue. Also, given that there
are probably only a handful of filesystems in the world using large inodes
(excluding Lustre filesystems which aren't affected by this) I don't think
it is a pressing issue yet. I'm going to be away for 2 weeks, so I'll say
accept this patch as is and we can look at it again when I get back, and
maybe Ted and Stephen will have weighed in on this issue also.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/

Attachments:

(No filename) (4.08 kB)
(No filename) (189.00 B)
Download all attachments

2005-01-21 00:36:19

by Andreas Gruenbacher

[permalink] [raw]

Subject: Re: [patch 5/5] Disallow in-inode attributes for reserved inodes

On Friday 21 January 2005 00:05, Andreas Dilger wrote:
> [...]
> But as your patch stands it doesn't ever check if i_extra_isize is valid
> for the root or lost+found inode. It just always sets i_extra_isize = 0
(that's the in-memory i_extra_isize)
> and never uses it. Given that the root inode is fairly high-traffic it
> makes sense to use the faster EA space if it is available.

It's only a single block we're talking about, not all the overhead you run
into with huge amounts of attributes in many xattr disk blocks. It sure would
be much cleaner to use the root inode's in-inode space like with all other
inodes, but performance wise I don't think it matters.

> If these inodes have a BAD i_extra_isize it is OK to skip it, but I'm
> not so keen to have an ext3_error() there. If the user doesn't have an
> e2fsck with ea-in-inode support there isn't anything they can do to fix
> it and they will get a full e2fsck on each boot.

Agreed, that would be really bad. We should get e2fsck fixed ASAP.

> Even so, for the effort of setting i_extra_isize = 4 (or larger if we
> initialize the fixed fields) we can do the equivalent of what e2fsck will
> do when it finds a bogus value.

We cannot ask the user, and we don't have the kind of global view that e2fsck
has. Something different may be messed up, and may have lead to the
corruption. It's unlikely, but not impossible.

Cheers,
--
Andreas Gruenbacher <[email protected]>
SUSE Labs, SUSE LINUX PRODUCTS GMBH