2006-03-16 12:11:26

by Takashi Sato

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Hi,

> You changed most of the affected variables from "int" to "unsigned int",
> that seems allow block number to address 2^32. It probably a good thing
> to consider change the variables to sector_t type, so when the time we
> want to support for 64 bit block number, we don't have to re-do the
> similar work again. Laurent did very similar work on this before.

sector_t is 8bytes on normal configuration and there are many
variables for blocks on ext2/3. I thought extending variables may
influence on performance, so I didn't change.

> Besides these limitations, I think there is one more to limit ext3
> filesystem size to 8TB
>
> - The superblock format currently stores the number of block groups as a
> 16-bit integer, and because (on a 4 KB blocksize filesystem) the maximum
> number of blocks in a block group is 32,768 , a combination of these
> constraints limits the maximum size of the filesystem to 8 TB

Is it s_block_group_nr in ext3_super_block?
mke2fs sets 65535 to the field if the number of block groups is greater
than 65535. Current kernel ignores the field and re-calculate from
other fields. findsuper command is the only user of it and it simply prints
the value. So, it does not limit the maximum size of the filesystem to 8 TB.
I confirmed that mke2fs with my change could make the filesysytem
which has more than 65536 groups and it could be mounted.

> I noticed that the first patch set combines changes to ext2 filesystem
> and changes to ext3 filesystem. It would be nice to split the changes to
> two different filesystems.

Ok, I'll split it later.

> But that doesn't fix all th problem. We still have places in ext3 block
> reservation code that use int for system-wide block numbers. For e.g.,
> alloc_new_reservation(), group_first_block, group_end_block, start_block
> are all filesystem wide block numbers, they need to be changed. I will
> check the code more closely tomorrow to see if the changes will break
> any assumptions.

Thank you, I missed it. I'm looking forward to seeing your report.

> Also, I noticed that in your first patch, you changed a few variables
> for logical block number from "long" to "unsigned int". Just want to
> point out that's a seperate issue- that's for enlarge the file size, not
> for expand the max filesystem size.

Ok, I'll remove them when I update the patch next time.
They are left because I'm considering enlarging the file size max too...

>> -static int ext3_alloc_block (handle_t *handle,
>> - struct inode * inode, unsigned long goal, int *err)
>> +static unsigned int ext3_alloc_block (handle_t *handle,
>> + struct inode * inode, unsigned int goal, int *err)
>> {
>
> I did some changes in the same code to support ext3 multiple block
> allocation. Those patches removed this function ext3_alloc_block(). The
> patches are sitting in mm tree now.
>
> BTW, why we change from unsigned long back to unsigned int here?

Because ext3_alloc_branch calls ext3_alloc_block with int type for the
block number and ext3_alloc_blocks returns int type.

>> struct ext3_block_alloc_info *block_i = EXT3_I(inode)->i_block_alloc_info;
>> @@ -505,21 +505,21 @@ static unsigned long ext3_find_goal(stru
>> static int ext3_alloc_branch(handle_t *handle, struct inode *inode,
>> int num,
>> unsigned long goal,
>> - int *offsets,
>> + unsigned int *offsets,
>> Indirect *branch)
>
> offsets[] array here store the index position within a indirect block,
> where the physical block is stored. The indirect block takes a 4k block,
> holds up to 1K entry of physical block numbers, so int type for the
> index is good enough.

Ok, I'll update them too.


2006-03-16 13:53:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Thu, Mar 16, 2006 at 09:11:17PM +0900, Takashi Sato wrote:
> >You changed most of the affected variables from "int" to "unsigned int",
> >that seems allow block number to address 2^32. It probably a good thing
> >to consider change the variables to sector_t type, so when the time we
> >want to support for 64 bit block number, we don't have to re-do the
> >similar work again. Laurent did very similar work on this before.
>
> sector_t is 8bytes on normal configuration and there are many
> variables for blocks on ext2/3. I thought extending variables may
> influence on performance, so I didn't change.

It would be interesting to do a CPU overhead benchmark to see how much
of the overhead is actually measurable on an x86 system. If it's only
a small percent, it might be acceptable given that x86_64 machines are
going to be gradually taking over, and sector_t only exists if
CONFIG_LBD is enabled. So for smaller systems where LBD isn't
enabled, we won't see performance overhead since sector_t won't exist
and so the code is going to have to use a typedef for ext2_blk_t which
is either __u32 or sector_t as necessary.

> >- The superblock format currently stores the number of block groups as a
> >16-bit integer, and because (on a 4 KB blocksize filesystem) the maximum
> >number of blocks in a block group is 32,768 , a combination of these
> >constraints limits the maximum size of the filesystem to 8 TB
>
> Is it s_block_group_nr in ext3_super_block?
> mke2fs sets 65535 to the field if the number of block groups is greater
> than 65535. Current kernel ignores the field and re-calculate from
> other fields. findsuper command is the only user of it and it simply prints
> the value. So, it does not limit the maximum size of the filesystem to 8
> TB.

s_block_group_nr is *not* the number of block groups in the
filesystem. As Takashi-san properly pointed out, the kernel
calculates the number of block groups by dividing the number of blocks
by the blocks_per_group fields. s_block_group_nr is used to identify
the block group of a particular backup supeblock.

So for the backup superblock located at block group #3,
s_block_group_nr 3, and for the backup superblock located at block
group #5, s_block_group_nr 5, and so on. It is used only as a hint so
that prorams like findsuper and gpart can be more intelligent about
finding the start of filesystem, when trying to recover from a smashed
partition table.

- Ted

2006-03-16 18:35:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mar 16, 2006 21:11 +0900, Takashi Sato wrote:
> >Also, I noticed that in your first patch, you changed a few variables
> >for logical block number from "long" to "unsigned int". Just want to
> >point out that's a seperate issue- that's for enlarge the file size, not
> >for expand the max filesystem size.
>
> Ok, I'll remove them when I update the patch next time.
> They are left because I'm considering enlarging the file size max too...

There was previously a patch by Goldwyn Rodrigues in linux-kernel:
"[PATCH] Pushing ext3 file size limits beyond 2TB", which at least
got as far as 4TB for the file size (for 4kB blocks).

Beyond that, we need a format change and may as well have something
like extents, but even extents still need to allow a larger i_blocks,
so that patch would be useful in any case... though it needs some
cleanup to remove all users of i_frag and i_faddr (which have never
ever been used).

Laurent, do your 64-bit patches include support for larger i_blocks?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-03-16 21:26:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Thu, Mar 16, 2006 at 11:35:49AM -0700, Andreas Dilger wrote:
> Beyond that, we need a format change and may as well have something
> like extents, but even extents still need to allow a larger i_blocks,

As a side note, one of the things that we've been talking about doing
is bundling a number of small changes together into a single INCOMPAT
flag. Changing i_blocks so its units are in blocks rather than
512-byte sectors was one such change.

Another was guaranteeing that for large inodes (> 128 bytes) that at
least some number of bytes (probably on the order of 32 bytes or so)
would be reserved for things like the high resolution portion of
ctime/mtime/atime, high watermark, and other inode extensions. (One
of the problems with doing high res timestamps right is how to handle
the case where you can't make room for the high res timestamps, due to
too much space being taken up by extended attributes. The make(1)
program gets really confused unless all files are either using or not
using high res timestamps.)

The idea was to do a quick easy strike of all of the ideas which could
be implemented quickly, and perhaps try to get them done before RHEL 5
snapshots. Even if RHEL5 doesn't enable use of these features by
default, having it supported by RHEL5 would be extremely convenient.

> so that patch would be useful in any case... though it needs some
> cleanup to remove all users of i_frag and i_faddr (which have never
> ever been used).

One of the things which we need to consider is whether we think we
will never support tail packing or other forms of fragments, which is
related to whether we think we will ever support large blocks (i.e.,
32k, 64k, and up). If we do, we might want to keep those fields
around.

- Ted

2006-03-16 22:59:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mar 16, 2006 16:26 -0500, Theodore Ts'o wrote:
> On Thu, Mar 16, 2006 at 11:35:49AM -0700, Andreas Dilger wrote:
> > Beyond that, we need a format change and may as well have something
> > like extents, but even extents still need to allow a larger i_blocks,
>
> As a side note, one of the things that we've been talking about doing
> is bundling a number of small changes together into a single INCOMPAT
> flag. Changing i_blocks so its units are in blocks rather than
> 512-byte sectors was one such change.

> Another was guaranteeing that for large inodes (> 128 bytes) that at
> least some number of bytes (probably on the order of 32 bytes or so)
> would be reserved for things like the high resolution portion of
> ctime/mtime/atime, high watermark, and other inode extensions. (One
> of the problems with doing high res timestamps right is how to handle
> the case where you can't make room for the high res timestamps, due to
> too much space being taken up by extended attributes. The make(1)
> program gets really confused unless all files are either using or not
> using high res timestamps.)
>
> The idea was to do a quick easy strike of all of the ideas which could
> be implemented quickly, and perhaps try to get them done before RHEL 5
> snapshots. Even if RHEL5 doesn't enable use of these features by
> default, having it supported by RHEL5 would be extremely convenient.

While I agree with that in theory, in practise we never end up doing
this and it just ends up delaying the acceptance of the trivial patches.
It may also be a burden later on when some of the features that could
be e.g. ROCOMPAT are bundled with an INCOMPAT change and we then
make the filesystem gratuitously INCOMPAT.

In the end, I don't think having a couple of separate flags is any more
effort than having a single one. As yet we only have about 6 of each 32
feature bits used, and if we get close to running out we can make an
EXT3_FEATURE_{,RO,IN}COMPAT_NEXT_WORD flag to continue it on.

Note that I'm not against this in practise, but I wouldn't hold up any
feature for this reason. How long has large i_blocks been pending,
and usecond timestamps? Many years already, even though they are trivial
to implement, so I'm hesitant to tie them together and delay further.

I think i_blocks can be considered an ROCOMPAT feature, and the large
inode reservation for usecond timestamps could be COMPAT I think (since
an unsupporting kernel would still update all the timestamps consistently
even if the useconds on disk would be some constant instead of 0.

> > so that patch would be useful in any case... though it needs some
> > cleanup to remove all users of i_frag and i_faddr (which have never
> > ever been used).
>
> One of the things which we need to consider is whether we think we
> will never support tail packing or other forms of fragments, which is
> related to whether we think we will ever support large blocks (i.e.,
> 32k, 64k, and up). If we do, we might want to keep those fields
> around.

I thought the long-term plan for small files was to just store them
in an EA? That way, we can efficiently pack them inside the inode
or up to blocksize (space willing) without any usage of inode fields
(maybe with a flag to indicate that there is such a fragment to avoid
gratuitous EA searching). This would be a net win on performance since
it avoids an IO for the in-inode case at least.

I think testing with reiserfs showed that tail packing was a net loss in
most cases, since basically every benchmark I've ever seen with reiserfs
disables tail packing or suffers. For space constrained systems (if
there ever exists such a thing again ;-) it would probably be better to
go to compressed files.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-03-17 09:35:33

by Laurent Vivier

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Le jeu 16/03/2006 à 19:35, Andreas Dilger a écrit :
[...]
> Laurent, do your 64-bit patches include support for larger i_blocks?

No, I only work on extending the filesystem size. Extending the file
size will be the next step...

Cheers,
Laurent
--
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4


Attachments:
signature.asc (189.00 B)
Ceci est une partie de message num?riquement sign?e.

2006-03-18 17:07:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Thu, Mar 16, 2006 at 03:59:13PM -0700, Andreas Dilger wrote:
> While I agree with that in theory, in practise we never end up doing
> this and it just ends up delaying the acceptance of the trivial patches.
> It may also be a burden later on when some of the features that could
> be e.g. ROCOMPAT are bundled with an INCOMPAT change and we then
> make the filesystem gratuitously INCOMPAT.
>
> In the end, I don't think having a couple of separate flags is any more
> effort than having a single one. As yet we only have about 6 of each 32
> feature bits used, and if we get close to running out we can make an
> EXT3_FEATURE_{,RO,IN}COMPAT_NEXT_WORD flag to continue it on.

The overhead is not running out of feature bit flags. After all, it's
easy to add more if we need to; we just define the MSB as meaning
"check the auxiliary features compat/rocompat/incompat mask", and then
define a new 32-bit extension bitmask in the superblock.

What I'm trying to simplify is the overhead of users trying to
understand a tangled mess of features, some compat, some incompat,
etc.

> Note that I'm not against this in practise, but I wouldn't hold up any
> feature for this reason. How long has large i_blocks been pending,
> and usecond timestamps? Many years already, even though they are trivial
> to implement, so I'm hesitant to tie them together and delay further.

i_blocks has been pending because the people who could push it haven't
had the time, and usec timestamps because the trivial way (without an
at least an ROCOMPAT flag).

> I think i_blocks can be considered an ROCOMPAT feature, and the large
> inode reservation for usecond timestamps could be COMPAT I think (since
> an unsupporting kernel would still update all the timestamps consistently
> even if the useconds on disk would be some constant instead of 0.

i_blocks can ROCOMPAT only if it is acceptable for stat(2) to return
erroneous i_blocks return values. I'm not entirely convinced that's a
good thing, and at the very least it would be extremely confusing, but
maybe.

usecond timestamps must be at least ROCOMPAT, because of the
requirement that all newly created inodes must reserve extra space and
guarantee that i_extra_isize must be at least n bytes (where n is the
size of the guaranteed extra inode fields). If you don't do that,
then when the filesystem is mounted one again on a kernel that does
understand usec timestamps, some inodes will have room for the usec
time fields, and other inodes won't (because they have too much of the
space used for EA's), and that will cause serious problems for make(1).

> I think testing with reiserfs showed that tail packing was a net loss in
> most cases, since basically every benchmark I've ever seen with reiserfs
> disables tail packing or suffers. For space constrained systems (if
> there ever exists such a thing again ;-) it would probably be better to
> go to compressed files.

I have to wonder if that's because of the way reiserfs implemented
tail-packing more than anything else. I don't belive fragments hurt
performance on UFS systems quite as much as it does on reiserfs
systems. I'm not worried about this as much for space constrained
systems, but for cases where we find that increasing the blocksize to
8k or even larger (32k? 64k) really helps, but we don't want to pay
the internal fragmentation penalty for small files. There are other
ways to solve the problem, yes, such as by assuming that we can use a
different filesystem for database or video streams separate from the
/, /usr, and/or /var filesystems, for example.

If we are ready to forever forswear wanting to use large block sizes,
then maybe we don't need to worry about fragmentations support (or
maybe the 1.8" pedabyte disk drives will show up and be cheap enough
that we just won't care about wasting space on small files). But
that's I think a decision which we need to formally make.

- Ted

2006-03-20 06:37:32

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mar 18, 2006 12:07 -0500, Theodore Ts'o wrote:
> What I'm trying to simplify is the overhead of users trying to
> understand a tangled mess of features, some compat, some incompat,
> etc.

I think the real goal is that 99% of users never really see this
in the first place. Add in support for these features now with the
most permissive COMPAT flag possible, and chances are that most
normal users won't use these features for a few years. Most of
them are for the 1% of systems that are pushing the current filesystem
limits, and the majority of these users are also more sophisticated.
You won't have Joe Average dual-booting their Linux system with a
16TB filesystem.

> > I think i_blocks can be considered an ROCOMPAT feature, and the large
> > inode reservation for usecond timestamps could be COMPAT I think (since
> > an unsupporting kernel would still update all the timestamps consistently
> > even if the useconds on disk would be some constant instead of 0.

NB - my reference for i_blocks was the use of i_frag|i_fsize for use
by files > 2TB, not the recent proposal for i_blocks in fs blocksize.

> i_blocks can ROCOMPAT only if it is acceptable for stat(2) to return
> erroneous i_blocks return values. I'm not entirely convinced that's a
> good thing, and at the very least it would be extremely confusing, but
> maybe.

I think most cases where someone is ro-mounting their filesystem is
when they need emergency access to the filesystem with an older kernel.
Allowing that access IMHO is more important than exact correctness.
I'm also not aware of any tools that depend on i_blocks being correct,
though I suspect e.g. "cp --sparse" will use a discrepency in
i_size vs (i_blocks >> 9) to determine sparseness.

> usecond timestamps must be at least ROCOMPAT, because of the
> requirement that all newly created inodes must reserve extra space and
> guarantee that i_extra_isize must be at least n bytes (where n is the
> size of the guaranteed extra inode fields). If you don't do that,
> then when the filesystem is mounted one again on a kernel that does
> understand usec timestamps, some inodes will have room for the usec
> time fields, and other inodes won't (because they have too much of the
> space used for EA's), and that will cause serious problems for make(1).

What happens to existing filesystems with large inodes that don't have
enough space for the extra timestamps in the first place? Also, if files
are created while the filesystem is mounted without usecond timestamps
they would get no usecond fields anyways. I agree that there are some
unlikely corner conditions that might be hit (large inode filesystem, on
older kernel without usec support, fills both the in-inode and external
block so much that there isn't 12 bytes left for the usecond timestamps,
and that file happens to depend on the exact accuracy of the timestamp).
IMHO the inconvenience of the ROCOMPAT outweighs the benefits.

We have previously restricted ROCOMPAT and INCOMPAT flags for changes
that would cause corruption or crashes on older kernels. In the end,
I'm not dead-set against making it ROCOMPAT, just trying to maintain
the maximal compatibility possible.

> for cases where we find that increasing the blocksize to
> 8k or even larger (32k? 64k) really helps, but we don't want to pay
> the internal fragmentation penalty for small files. There are other
> ways to solve the problem, yes, such as by assuming that we can use a
> different filesystem for database or video streams separate from the
> /, /usr, and/or /var filesystems, for example.
>
> If we are ready to forever forswear wanting to use large block sizes,
> then maybe we don't need to worry about fragmentations support (or
> maybe the 1.8" pedabyte disk drives will show up and be cheap enough
> that we just won't care about wasting space on small files). But
> that's I think a decision which we need to formally make.

I'm not against large block support (in fact I was hoping this would be
standard by now), or fragment support. Rather, I think that nobody
cares enough about it to actually implement it, and given the growth
of disks the demand will never materialize, just like ext2 filesystem
compression missed the window when the benefits outweighed the costs.

If and when large-page/block support makes it into a commodity CPU
(sadly, x86_64 missed the mark) or kernel we can re-evaluate it then.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-03-20 22:39:21

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Hi,

On Sun, 2006-03-19 at 23:36 -0700, Andreas Dilger wrote:

> What happens to existing filesystems with large inodes that don't have
> enough space for the extra timestamps in the first place?

Sadly, they are basically out of luck, unless we change the way that
space in the extended inode is used.

In retrospect, perhaps we goofed. We added that space into the inode,
but there is no guarantee that it can be used on demand for anything
other than xattrs --- precisely because xattrs can grow to use all
available space both in the external xattr block *and* in the inode.

We could have defined things such that you could either use the in-inode
space, OR the external space, for xattrs, but not both. But that would
be a performance compromise at best, for some of the most important
xattrs (like SELinux labels, which are always there and are always
needed) really want to be accelerated in the inode.

We really ought to have reserved *some* space in the extended inode for
non-xattr fields, for compatibility purposes.

But it's probably not too late. I would expect that the vast majority
of filesystems won't have any inodes that have fully-occupied xattr
space. It would be easy enough to define a new flag that indicates that
there is always X amount of space reserved for inode fields, and to set
that in fsck if all inodes on the fs obey that restriction. Then it
just comes down to picking a number X that is likely to satisfy all the
short-term demands for new inode fields.

> Also, if files
> are created while the filesystem is mounted without usecond timestamps
> they would get no usecond fields anyways. I agree that there are some
> unlikely corner conditions that might be hit (large inode filesystem, on
> older kernel without usec support, fills both the in-inode and external
> block so much that there isn't 12 bytes left for the usecond timestamps,
> and that file happens to depend on the exact accuracy of the timestamp).
> IMHO the inconvenience of the ROCOMPAT outweighs the benefits.

That's precisely the corner case that concerns me. The question is, do
we want the filesystem to behave correctly in all cases, or do we take
short-cuts?

I think we're probably early enough in the adoption of large inodes that
we don't have to make that compromise, and we can reserve some space for
guaranteed use by inode fields with a single minimally-invasive compat
change (say, a flag enabling a field in the superblock which defines how
many bytes we can always safely use for extended inode fields.)

--Stephen


2006-03-20 23:48:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mar 20, 2006 17:38 -0500, Stephen C. Tweedie wrote:
> On Sun, 2006-03-19 at 23:36 -0700, Andreas Dilger wrote:
> > What happens to existing filesystems with large inodes that don't have
> > enough space for the extra timestamps in the first place?
>
> Sadly, they are basically out of luck, unless we change the way that
> space in the extended inode is used.
>
> We could have defined things such that you could either use the in-inode
> space, OR the external space, for xattrs, but not both. But that would
> be a performance compromise at best, for some of the most important
> xattrs (like SELinux labels, which are always there and are always
> needed) really want to be accelerated in the inode.

The fast EA space is the only reason we implemented this at all. We
also still need the external EA space for the overflow case.

> I would expect that the vast majority of filesystems won't have any
> inodes that have fully-occupied xattr space.

I would agree. The number of affected files is likely infintesimal,
given that large inodes are not enabled by default (not sure if they
are even documented), Lustre doesn't use more than a single EA on a file
except in the just-release version, and Samba4 doesn't care because it
stores its timestamps in EAs anyways due to lack of usecond timestamps.

> It would be easy enough to define a new flag that indicates that
> there is always X amount of space reserved for inode fields, and to set
> that in fsck if all inodes on the fs obey that restriction. Then it
> just comes down to picking a number X that is likely to satisfy all the
> short-term demands for new inode fields.

We could change ext3_new_inode() today to reserve, say, 12 or 16 more
bytes for timestamps, even if they are not implemented yet. Having a
field in the superblock (tunable by the admin, concievably) to reserve
a total of X bytes for i_extra_isize has some appeal though.

At a rough guess I'd want to have timestamps (27 bits for 10s of nanoseconds,
with the high 5 bits left for growing the number of seconds). If we put the
fields in "priority" order, then on those inodes that don't have much space
left we at least get the more important one(s), primarily mtime I think).

__u32 i_mtime_extended;
__u32 i_ctime_extended;
__u32 i_atime_extended;

Are there any other needs right now? My thought on the "extra" fields
in the inode is that they would always be on an "as available" basis,
so if e.g. we only had 8 bytes reserved we would get i_mtime_extended,
and i_ctime_extended, and not i_atime_extended. If there is some added
field that is so important that the kernel/filesystem can't live without
it, it would need its own {RO,IN}COMPAT flag anyways.

I think with the advent of large inodes we can be less worried about
cannibalizing the other "unused" inode fields like i_faddr, i_frag,
i_fsize. An i_blocks_high field, (even in the face of Takashi's recently
proposed patch we would still want another 16 or 32 bits for larger
files, maybe at the same time as his patch is implemented), a 32-bit
inode checksum, more bits for i_nlinks?

It would also be good to understand what HURD is actually doing with
those other fields (if anything, does it even exist anymore?), since
it is literally holding TB of space unusable on Linux ext3 filesystems
that could better be put to use. There are i_translator, i_mode_high,
and i_author held hostage by HURD, and I certainly have never seen or
heard of any good description of what they do or if Linux would/could
ever use them, or if HURD could live without them.

> > Also, if files
> > are created while the filesystem is mounted without usecond timestamps
> > they would get no usecond fields anyways. I agree that there are some
> > unlikely corner conditions that might be hit (large inode filesystem, on
> > older kernel without usec support, fills both the in-inode and external
> > block so much that there isn't 12 bytes left for the usecond timestamps,
> > and that file happens to depend on the exact accuracy of the timestamp).
> > IMHO the inconvenience of the ROCOMPAT outweighs the benefits.
>
> That's precisely the corner case that concerns me. The question is, do
> we want the filesystem to behave correctly in all cases, or do we take
> short-cuts?
>
> I think we're probably early enough in the adoption of large inodes that
> we don't have to make that compromise, and we can reserve some space for
> guaranteed use by inode fields with a single minimally-invasive compat
> change (say, a flag enabling a field in the superblock which defines how
> many bytes we can always safely use for extended inode fields.)

I'm fully in the "the chance of any real problem is vanishingly small"
camp, even though Lustre is one of the few users of large inodes. The
presence of the COMPAT field would not really be any different than just
changing ext3_new_inode() to make i_extra_isize 16 by default, except to
cause breakage against the older e2fsprogs. I don't think it is a bad
idea to implement kernel support for such a flag, but not actually set
it in the superblock unless done so by tune2fs.

Hmm, another "forward looking" change may be to add some masking of bits
in the inode i_flags word. Ted did this with great success for the
EXT2_INDEX_FL. Would it be prudent to do the same with, say, the top 4
bits of i_flags?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-03-21 04:03:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mon, Mar 20, 2006 at 05:38:02PM -0500, Stephen C. Tweedie wrote:
> But it's probably not too late. I would expect that the vast majority
> of filesystems won't have any inodes that have fully-occupied xattr
> space. It would be easy enough to define a new flag that indicates that
> there is always X amount of space reserved for inode fields, and to set
> that in fsck if all inodes on the fs obey that restriction. Then it
> just comes down to picking a number X that is likely to satisfy all the
> short-term demands for new inode fields.

Yes, that's what I'm proposing that we do. My original plan was to
use an incompat flag that would guarantee that there would be enough
space for likely short-term new inode fields, but perhaps it doesn't
have to be an incompat flag. At least in theory it could be a compat
flag, and then we release a new e2fsprogs which enforces the guarantee
that at least that much space is reserved in every single inode, and
offers to remove one or more EA's in order to satisfy that guarantee.

There is a chance that someone who has a filesystem with the compat
feature enabled, a kernel has the support for high-resolution
time-stamps, and an old e2fsprogs will get screwed, but only if the EA
space is totally filled up. But maybe that's an acceptable risk, and
the worst that will happen is that make(1) will get confused.

> I think we're probably early enough in the adoption of large inodes that
> we don't have to make that compromise, and we can reserve some space for
> guaranteed use by inode fields with a single minimally-invasive compat
> change (say, a flag enabling a field in the superblock which defines how
> many bytes we can always safely use for extended inode fields.)

Ah, it sounds like you're thinking the same thing I am. OK, that
seems like a reasonable compromise. We are taking a bit of a
shortcut, but it seems reasonable to assume that distro's will have
the right version of e2fsprogs if they want to use this feature; if
they don't users won't be able to enable the new compat flag anyway,
which means the chances of the user noticing a real problem is pretty
low.

- Ted

2006-03-21 17:06:22

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Hi,

On Mon, 2006-03-20 at 16:48 -0700, Andreas Dilger wrote:

> > It would be easy enough to define a new flag that indicates that
> > there is always X amount of space reserved for inode fields, and to set
> > that in fsck if all inodes on the fs obey that restriction. Then it
> > just comes down to picking a number X that is likely to satisfy all the
> > short-term demands for new inode fields.
>
> We could change ext3_new_inode() today to reserve, say, 12 or 16 more
> bytes for timestamps, even if they are not implemented yet. Having a
> field in the superblock (tunable by the admin, concievably) to reserve
> a total of X bytes for i_extra_isize has some appeal though.

Exactly, because it's more than just the timestamps that we'd like to
grow.

> __u32 i_(m|c|a)time_extended;

> Are there any other needs right now?

Potentially, yes. If we want to go 64-bits, then the extent maps can
take care of indirect blocks, but we would still need:

__le32 i_blocks; /* Blocks count */
__le32 i_file_acl; /* File ACL */
__le32 i_dir_acl; /* Directory ACL */

to get an extra 32 bits. And there's the old favourite,

__le16 i_links_count; /* Links count */

which is a completely unnecessary limit on subdirs, which would be great
to eliminate at the same time.

We're not talking about a huge amount of space, here; I'd hate to
reserve too little space for the next year or so and force people to go
through a full forced fsck more than once to flag just a few more bytes
as available.

> My thought on the "extra" fields
> in the inode is that they would always be on an "as available" basis,
> so if e.g. we only had 8 bytes reserved we would get i_mtime_extended,
> and i_ctime_extended, and not i_atime_extended.

Yes, that's basically what we're already set up for with i_extra_size.

The problem is that by the time we find we can't grow the inode fields,
it may be too late. That's especially true with timestamps: ENOSPC is a
bad return code for sys_utimes()! It's perhaps a little more reasonable
to expect to have to deal with ENOSPC when we do a mkdir() or write().
But a per-sb reserved-inode-growth field that fsck can always set, and
that the overwhelming majority of filesystems will be able to satisfy,
simply gets rid of *all* the edge cases by guaranteeing enough space.

> ...a 32-bit
> inode checksum...?

Not something that anyone is using right now, but it's exactly the sort
of thing that a superblock field would be ideal for.

> It would also be good to understand what HURD is actually doing with
> those other fields (if anything, does it even exist anymore?), since
> it is literally holding TB of space unusable on Linux ext3 filesystems
> that could better be put to use. There are i_translator, i_mode_high,
> and i_author held hostage by HURD, and I certainly have never seen or
> heard of any good description of what they do or if Linux would/could
> ever use them, or if HURD could live without them.

If they really are 100% necessary for hurd, it might be that we could
relegate them to an xattr. There's the slight problem of testing,
though; does anyone on ext2-devel actually run hurd, ever?

> I'm fully in the "the chance of any real problem is vanishingly small"
> camp, even though Lustre is one of the few users of large inodes. The
> presence of the COMPAT field would not really be any different than just
> changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> cause breakage against the older e2fsprogs.

Setting i_extra_isize will break older e2fsprogs anyway, won't it?
e2fsck needs to have full knowledge of all fs fields in order to
maintain consistency; if it doesn't know about some of the fields whose
presence is implied by i_extra_isize, then doesn't it have to abort?

So for future-proofing, we do need some distinction between the fields
actually *used* in i_extra_isize, and those simply reserved there. And
that has to be per-inode, if we want to allow easy dynamic migration to
newer fields.

So a per-superblock field guaranteeing that there's at least $N bytes of
usable *potential* i_extra_isize in each inode, and a per-inode
i_extra_isize which shows which fields are *actively* used, gives us
both pieces of information that we need.

--Stephen


2006-03-21 18:38:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Tue, Mar 21, 2006 at 12:05:22PM -0500, Stephen C. Tweedie wrote:
> > It would also be good to understand what HURD is actually doing with
> > those other fields (if anything, does it even exist anymore?), since
> > it is literally holding TB of space unusable on Linux ext3 filesystems
> > that could better be put to use. There are i_translator, i_mode_high,
> > and i_author held hostage by HURD, and I certainly have never seen or
> > heard of any good description of what they do or if Linux would/could
> > ever use them, or if HURD could live without them.

Hurd is definitely using the translator field, and I only recently
discovered they are using it to point at a disk block where the name
of the translator program (I'm not 100% sure, but I think it's a
generic, out-of-band, #! sort of functionality). I don't know about
the other fields, but I can find out.

> If they really are 100% necessary for hurd, it might be that we could
> relegate them to an xattr. There's the slight problem of testing,
> though; does anyone on ext2-devel actually run hurd, ever?

Relegating them to an xatter would break compatibility with existing
hurd filesystems. We could take the arrogant "Linux is the only thing
that matters", and just screw them, and the net result will probably
be that Hurd will never implement some of the advanced features we've
been talking about. They might not anyways, though. A real problem
is that as far as I know, the hurd ext2 developers aren't on the
ext2-devel mailing list.

I've cc'ed two people that sent me a request to add some additional
debugfs functionality to support hurd; maybe they can help by telling
us whether or not hurd is using i_mode_high and i_author, and whether
or not hurd has any likelihood of tracking new ext3 features that we
might add in the future or not.

> > I'm fully in the "the chance of any real problem is vanishingly small"
> > camp, even though Lustre is one of the few users of large inodes. The
> > presence of the COMPAT field would not really be any different than just
> > changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> > cause breakage against the older e2fsprogs.
>
> Setting i_extra_isize will break older e2fsprogs anyway, won't it?
> e2fsck needs to have full knowledge of all fs fields in order to
> maintain consistency; if it doesn't know about some of the fields whose
> presence is implied by i_extra_isize, then doesn't it have to abort?

E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and didn't
check whether or not the EA's in the inode were valid. Starting in
e2fsprogs 1.37, e2fsck understands i_extra_size and in fact does
validate the EA's in the inode. If we add new i_extra fields, then
currently e2fsprogs will ignore them, and that's OK for things like
the high precision time fields. But if they are fields where e2fsck
does need to know about them, then obviously we would need a COMPAT
feature flag to signal that fact (since e2fsck will refuse to operate
on a filesystem if ther is a COMPAT feature that it doesn't
understand.)

> So for future-proofing, we do need some distinction between the fields
> actually *used* in i_extra_isize, and those simply reserved there. And
> that has to be per-inode, if we want to allow easy dynamic migration to
> newer fields.
>
> So a per-superblock field guaranteeing that there's at least $N bytes of
> usable *potential* i_extra_isize in each inode, and a per-inode
> i_extra_isize which shows which fields are *actively* used, gives us
> both pieces of information that we need.

The easiest way to do future-proofing is to state that they must be
initialized to zero. That's how we handle unusued fields in the
superblock, after all, and it means that it's relatively easy to add
new superblock fields without needing to cause compatibility
problems.. If you absolutely, positively need e2fsck to abort if it
doesn't understand a particular field, that's what a COMPAT feature
flag is for. Otherwise, new kernels can simply check to see if the
field is non-zero, and if so, honor it, and old-kernels will simply
ignore the new information. In many cases, that's more than
sufficient.

- Ted

2006-03-21 19:50:17

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Hi,

On Tue, 2006-03-21 at 13:38 -0500, Theodore Ts'o wrote:

> Hurd is definitely using the translator field, and I only recently
> discovered they are using it to point at a disk block where the name
> of the translator program (I'm not 100% sure, but I think it's a
> generic, out-of-band, #! sort of functionality).

..

> > If they really are 100% necessary for hurd, it might be that we could
> > relegate them to an xattr. There's the slight problem of testing,
> > though; does anyone on ext2-devel actually run hurd, ever?
>
> Relegating them to an xatter would break compatibility with existing
> hurd filesystems.

This would be an incompat change, but one that would not be hard to
maintain. The translator stuff looks like the kind of thing that would
_easily_ suit xattrs.

> We could take the arrogant "Linux is the only thing
> that matters"

I'm not proposing breaking any compatibility. The idea was simply that
if we wanted to add new fields to that space in the inode struct, it
would be an incompat change on *all* platforms, not just hurd; and that
on hurd, an extra side-effect of that incompat flag would be that we now
look for translation etc. in an xattr.

Do you know how large the translation data is, btw? If it's typically
just a small string, then we may actually get far better efficiency by
lumping it into the xattr blocks than by keeping it out-of-band.

> E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and didn't
> check whether or not the EA's in the inode were valid. Starting in
> e2fsprogs 1.37, e2fsck understands i_extra_size and in fact does
> validate the EA's in the inode. If we add new i_extra fields, then
> currently e2fsprogs will ignore them, and that's OK for things like
> the high precision time fields. But if they are fields where e2fsck
> does need to know about them, then obviously we would need a COMPAT
> feature flag to signal that fact (since e2fsck will refuse to operate
> on a filesystem if ther is a COMPAT feature that it doesn't
> understand.)

The timestamps are about the only things I can think of that would be
safe to ignore. Everything else --- i_nlinks, i_blocks, checksums,
highwatermarking --- has consistency implications and e2fsck would need
to be aware of it.

> > So for future-proofing, we do need some distinction between the fields
> > actually *used* in i_extra_isize, and those simply reserved there. And
> > that has to be per-inode, if we want to allow easy dynamic migration to
> > newer fields.
...
> The easiest way to do future-proofing is to state that they must be
> initialized to zero.

Hmm, that should work. It certainly works nicely for overflow fields.
It might complicate things like highwatermarking: a simple HWM
implementation would record the amount of the file that is actually
initialised in the HWM field, so "0" would actually be an unusual,
important special case. And "0" would be a potentially valid checksum
if we use CRC32, too. Using the per-sb field for reserved space, and
the in-inode one to determine which fields are actively in use, would
avoid such ambiguous cases.

--Stephen


2006-03-21 20:17:07

by Alfred M. Szmidt

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Adding Roland McGrath to the CC.

> > It would also be good to understand what HURD is actually doing
> > with those other fields (if anything, does it even exist
> > anymore?), since it is literally holding TB of space unusable
> > on Linux ext3 filesystems that could better be put to use.
> > There are i_translator, i_mode_high, and i_author held hostage
> > by HURD, and I certainly have never seen or heard of any good
> > description of what they do or if Linux would/could ever use
> > them, or if HURD could live without them.

Hurd is definitely using the translator field, and I only recently
discovered they are using it to point at a disk block where the
name of the translator program (I'm not 100% sure, but I think it's
a generic, out-of-band, #! sort of functionality). I don't know
about the other fields, but I can find out.

Something like that. The author field is akin to gid/uid. I don't
recall the exact usage of i_mode_high, but it has something to do with
translators.

> If they really are 100% necessary for hurd, it might be that we
> could relegate them to an xattr. There's the slight problem of
> testing, though; does anyone on ext2-devel actually run hurd,
> ever?

Relegating them to an xatter would break compatibility with
existing hurd filesystems. We could take the arrogant "Linux is
the only thing that matters", and just screw them, and the net
result will probably be that Hurd will never implement some of the
advanced features we've been talking about. They might not
anyways, though. A real problem is that as far as I know, the hurd
ext2 developers aren't on the ext2-devel mailing list.

I've cc'ed two people that sent me a request to add some additional
debugfs functionality to support hurd; maybe they can help by
telling us whether or not hurd is using i_mode_high and i_author,
and whether or not hurd has any likelihood of tracking new ext3
features that we might add in the future or not.

Both i_mode_high and i_author are used in the Hurd. But they are only
used if and only if creator of the file-system is the Hurd, same for
the translator fields.

> > I'm fully in the "the chance of any real problem is vanishingly
> > small" camp, even though Lustre is one of the few users of
> > large inodes. The presence of the COMPAT field would not
> > really be any different than just changing ext3_new_inode() to
> > make i_extra_isize 16 by default, except to cause breakage
> > against the older e2fsprogs.
>
> Setting i_extra_isize will break older e2fsprogs anyway, won't
> it? e2fsck needs to have full knowledge of all fs fields in
> order to maintain consistency; if it doesn't know about some of
> the fields whose presence is implied by i_extra_isize, then
> doesn't it have to abort?

E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and
didn't check whether or not the EA's in the inode were valid.
Starting in e2fsprogs 1.37, e2fsck understands i_extra_size and in
fact does validate the EA's in the inode. If we add new i_extra
fields, then currently e2fsprogs will ignore them, and that's OK
for things like the high precision time fields. But if they are
fields where e2fsck does need to know about them, then obviously we
would need a COMPAT feature flag to signal that fact (since e2fsck
will refuse to operate on a filesystem if ther is a COMPAT feature
that it doesn't understand.)

> So for future-proofing, we do need some distinction between the
> fields actually *used* in i_extra_isize, and those simply
> reserved there. And that has to be per-inode, if we want to
> allow easy dynamic migration to newer fields.
>
> So a per-superblock field guaranteeing that there's at least $N
> bytes of usable *potential* i_extra_isize in each inode, and a
> per-inode i_extra_isize which shows which fields are *actively*
> used, gives us both pieces of information that we need.

The easiest way to do future-proofing is to state that they must be
initialized to zero. That's how we handle unusued fields in the
superblock, after all, and it means that it's relatively easy to
add new superblock fields without needing to cause compatibility
problems.. If you absolutely, positively need e2fsck to abort if
it doesn't understand a particular field, that's what a COMPAT
feature flag is for. Otherwise, new kernels can simply check to
see if the field is non-zero, and if so, honor it, and old-kernels
will simply ignore the new information. In many cases, that's more
than sufficient.

2006-03-21 20:27:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mar 21, 2006 12:05 -0500, Stephen C. Tweedie wrote:
> If we want to go 64-bits, then the extent maps can
> take care of indirect blocks, but we would still need:
>
> __le32 i_blocks; /* Blocks count */

This has partially been addressed by Takashi's patch for fs-blocksize
i_blocks, and there was also a patch to use either i_frag|i_fsize or
i_faddr as the high bits of this value. If we think that 2^48 * blocksize
is enough for a file (2^60 bytes for 4kB blocks, 2^64 bytes for 64kB blocks)
then it would be prudent to use (i_frag|i_fsize) as i_blocks_high. AFAIK,
those fields have never, ever been used, and adding such a change along
with Takashi's patch makes a lot of sense.

> __le32 i_file_acl; /* File ACL */

This needs another 32 bits for sure. We might concievably also fix up the
EA code to improve the external-block EA format (e.g. allow pointing at an
extent index block or another inode to allow storing larger EAs).

> __le32 i_dir_acl; /* Directory ACL */

This is i_size_high for regular files, and I propose that it also become
i_size_high for directories as well, because CFS at least is hitting
limits of 2GB directories already (that's around 25M files). It also
doesn't take into account that we need to increase the dirent size to
accomodate larger inode numbers and possibly some other attribute data,
which I propose we flag with the high 5 bits of the d_type field.

> to get an extra 32 bits. And there's the old favourite,
>
> __le16 i_links_count; /* Links count */
>
> which is a completely unnecessary limit on subdirs, which would be great
> to eliminate at the same time.

CFS has a patch that has been working for ages that changes the i_links_count
handling to be the same as reiserfs - namely, if we overflow 65000 links
the directory i_nlinks becomes 1 (disables "find" heuristic to only recurse
into i_nlinks subdirectories), and ext3_dir_empty() is trusted to tell us
when the directory is empty (which it does already, i_nlinks only used
to print out a warning in any case). Even an unpatched e2fsck and kernel
handle this gracefully.

> We're not talking about a huge amount of space, here; I'd hate to
> reserve too little space for the next year or so and force people to go
> through a full forced fsck more than once to flag just a few more bytes
> as available.

At the same time, if we reserve too much space, it hurts EAs fitting
into the large inode space (which is at least CFS's and Samba's primary
requirement for large inodes).

> The problem is that by the time we find we can't grow the inode fields,
> it may be too late. That's especially true with timestamps: ENOSPC is a
> bad return code for sys_utimes()!

I'd rather return success and truncate the timestamp.

> It's perhaps a little more reasonable
> to expect to have to deal with ENOSPC when we do a mkdir() or write().
> But a per-sb reserved-inode-growth field that fsck can always set, and
> that the overwhelming majority of filesystems will be able to satisfy,
> simply gets rid of *all* the edge cases by guaranteeing enough space.

In the end, I don't think we can ever have "enough" space reserved for
all needs, so the code will have to have a graceful fallback strategy
in any case.

> > I'm fully in the "the chance of any real problem is vanishingly small"
> > camp, even though Lustre is one of the few users of large inodes. The
> > presence of the COMPAT field would not really be any different than just
> > changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> > cause breakage against the older e2fsprogs.
>
> Setting i_extra_isize will break older e2fsprogs anyway, won't it?
> e2fsck needs to have full knowledge of all fs fields in order to
> maintain consistency; if it doesn't know about some of the fields whose
> presence is implied by i_extra_isize, then doesn't it have to abort?

Like Ted said, and I had said when the large inode patch was first proposed,
if there is something added in the large inode space that is absolutely
mandatory, then it can be covered by an appropriate *COMPAT flag. I don't
think that the inode timestamps even warrant that protection.

> So for future-proofing, we do need some distinction between the fields
> actually *used* in i_extra_isize, and those simply reserved there. And
> that has to be per-inode, if we want to allow easy dynamic migration to
> newer fields.

The concept of "reserving" space in i_extra_isize wasn't considered. Instead
the design was that i_extra_isize would be large enough to cover the valid
fields, and if more space is needed, i_extra_isize would be grown to cover
this space, as applicable.

As Ted says, we could just initialize unused fields to zero and depend on
that. It works for the timestamps and e.g. checksum at least, and new
code can be made to live with this also.

> So a per-superblock field guaranteeing that there's at least $N bytes of
> usable *potential* i_extra_isize in each inode, and a per-inode
> i_extra_isize which shows which fields are *actively* used, gives us
> both pieces of information that we need.

I thought of this also, though plain "reservation" fails in two regards:
- if the older kernel doesn't understand "s_extra_isize_min", it may still
consume that space if the filesystem is mounted there
- if all fields in i_extra_isize are not used (e.g. i_checksum is disabled,
but a later i_dac is enabled), we still need a way to know if the field
is in use

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-03-21 20:40:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mar 21, 2006 14:47 -0500, Stephen C. Tweedie wrote:
> On Tue, 2006-03-21 at 13:38 -0500, Theodore Ts'o wrote:
> > Hurd is definitely using the translator field, and I only recently
> > discovered they are using it to point at a disk block where the name
> > of the translator program (I'm not 100% sure, but I think it's a
> > generic, out-of-band, #! sort of functionality).

Argh, sounds fragile in any case.

> I'm not proposing breaking any compatibility. The idea was simply that
> if we wanted to add new fields to that space in the inode struct, it
> would be an incompat change on *all* platforms, not just hurd; and that
> on hurd, an extra side-effect of that incompat flag would be that we now
> look for translation etc. in an xattr.

I would rather propose that we maintain as much compatibility as possible,
given that we don't even know what those extra fields might be, and would
likely need to have yet another compatibility flag on the feature itself.
Remember that large inodes themselves are incompatible with older kernels
(maybe predating 2.6.9) so we don't need to worry about 2.4 kernels at all.

> The timestamps are about the only things I can think of that would be
> safe to ignore. Everything else --- i_nlinks, i_blocks, checksums,
> highwatermarking --- has consistency implications and e2fsck would need
> to be aware of it.

Which would get their own superblock flags if needed.

> Hmm, that should work. It certainly works nicely for overflow fields.
> It might complicate things like highwatermarking: a simple HWM
> implementation would record the amount of the file that is actually
> initialised in the HWM field, so "0" would actually be an unusual,
> important special case.

The HWM feature would fall under an INCOMPAT flag then, and possibly
also set a flag in the inode to indicate validity (similar to my
proposal for the i_blocks change).

> And "0" would be a potentially valid checksum if we use CRC32, too.

Hmm, is that true? I thought that 0 was impossible for CRC32, since
even for a zero-length file the initial value should be 0xffffffff,
though I'm not 100% sure of that.

> Using the per-sb field for reserved space, and
> the in-inode one to determine which fields are actively in use, would
> avoid such ambiguous cases.

But, doesn't help if i_hwm comes before some other field that is put into
use, so it has to be handled anyways.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-03-21 23:05:19

by Olivier Galibert

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Tue, Mar 21, 2006 at 01:38:22PM -0500, Theodore Ts'o wrote:
> Hurd is definitely using the translator field, and I only recently
> discovered they are using it to point at a disk block where the name
> of the translator program (I'm not 100% sure, but I think it's a
> generic, out-of-band, #! sort of functionality).

Translators on directories are a combo of automount+userland
filesystem, with the addition on having them saved in the mounted-on
filesystem. Rather nice actually. Replacing /etc/fstab with
local-to-the-mountpoint information has some charm. I'm not sure if
translator-on-files actually exist.

Note that in hurd all filesystems are userland. Whether it is a good
thing is left as an exercise to the benchmarker and the deadlock
chaser.

OG.

2006-03-21 23:35:29

by Alfred M. Szmidt

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

> Hurd is definitely using the translator field, and I only
> recently discovered they are using it to point at a disk block
> where the name of the translator program (I'm not 100% sure, but
> I think it's a generic, out-of-band, #! sort of functionality).

Translators on directories are a combo of automount+userland
filesystem, with the addition on having them saved in the
mounted-on filesystem. Rather nice actually. Replacing /etc/fstab
with local-to-the-mountpoint information has some charm. I'm not
sure if translator-on-files actually exist.

You can set a translator on a file or a directory, it doesn't matter.
Anything that is accessed through the file-system is a translator.
/dev/null is a translator, symbolic links can be[0] translators,
/dev/hd0s1 (/dev/hda1 in GNU/Linux) is a translator, ...

[0]: They are usually implemented directly into the file-system so you
don't end up spawning a new processes for each symlink. But if the
file-system in question doesn't support symlinks you can always use
the symlink translator to get symlinks. This will work for all
file-systems as long as you do not wish to have it persitant across
reboots, then you need passive translator support (which is what those
fields in ext2 are for among other things).

Happy hacking.

Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Tue, Mar 21, 2006 at 01:38:22PM -0500, Theodore Ts'o wrote:
> On Tue, Mar 21, 2006 at 12:05:22PM -0500, Stephen C. Tweedie wrote:
> > > It would also be good to understand what HURD is actually doing with
> > > those other fields (if anything, does it even exist anymore?), since
> > > it is literally holding TB of space unusable on Linux ext3 filesystems
> > > that could better be put to use. There are i_translator, i_mode_high,
> > > and i_author held hostage by HURD, and I certainly have never seen or
> > > heard of any good description of what they do or if Linux would/could
> > > ever use them, or if HURD could live without them.
>
> Hurd is definitely using the translator field, and I only recently
> discovered they are using it to point at a disk block where the name
> of the translator program (I'm not 100% sure, but I think it's a
> generic, out-of-band, #! sort of functionality). I don't know about
> the other fields, but I can find out.
>
> > If they really are 100% necessary for hurd, it might be that we could
> > relegate them to an xattr. There's the slight problem of testing,
> > though; does anyone on ext2-devel actually run hurd, ever?
>
> Relegating them to an xatter would break compatibility with existing
> hurd filesystems. We could take the arrogant "Linux is the only thing
> that matters", and just screw them, and the net result will probably
> be that Hurd will never implement some of the advanced features we've
> been talking about. They might not anyways, though. A real problem
> is that as far as I know, the hurd ext2 developers aren't on the
> ext2-devel mailing list.
>
> I've cc'ed two people that sent me a request to add some additional
> debugfs functionality to support hurd; maybe they can help by telling
> us whether or not hurd is using i_mode_high and i_author, and whether
> or not hurd has any likelihood of tracking new ext3 features that we
> might add in the future or not.
>

As AMS has pointed out, the filesystem creator must be set to Hurd for
these inode fields to be used. Since ext2 seems to be the most
supported filesystem on Hurd, most of the ext2 fs used have the fs
creator set to Hurd.

Regarding compatibility, there are plans to support xattr in Hurd and
use them for these fields, translator and author. (I can't recall what
i_mode_high is used for.) With respect to that, I'd appreciate if
there is a recommendation to every ext2 implementation (not only
Linux) that supports xattr, to support gnu.translator and gnu.author
(I'll check about the i_mode_high and post about it asap.). There is a
patch by Roland McGrath for Linux that supports those besides the
reserved fields in case the fs creator is Hurd.

> > > I'm fully in the "the chance of any real problem is vanishingly small"
> > > camp, even though Lustre is one of the few users of large inodes. The
> > > presence of the COMPAT field would not really be any different than just
> > > changing ext3_new_inode() to make i_extra_isize 16 by default, except to
> > > cause breakage against the older e2fsprogs.
> >
> > Setting i_extra_isize will break older e2fsprogs anyway, won't it?
> > e2fsck needs to have full knowledge of all fs fields in order to
> > maintain consistency; if it doesn't know about some of the fields whose
> > presence is implied by i_extra_isize, then doesn't it have to abort?
>
> E2fsprogs previous to e2fsprogs 1.37 ignored i_extra_isize and didn't
> check whether or not the EA's in the inode were valid. Starting in
> e2fsprogs 1.37, e2fsck understands i_extra_size and in fact does
> validate the EA's in the inode. If we add new i_extra fields, then
> currently e2fsprogs will ignore them, and that's OK for things like
> the high precision time fields. But if they are fields where e2fsck
> does need to know about them, then obviously we would need a COMPAT
> feature flag to signal that fact (since e2fsck will refuse to operate
> on a filesystem if ther is a COMPAT feature that it doesn't
> understand.)

Regarding userland tools, it would be wise if they would still support
old format filesystems, including those with fs creator set to
Hurd. That would include supporting the oob block for translator when
counting used/free blocks and other operations like copying a file
using debugfs, for example.

>
[...]
>
> - Ted

Regards,
Thadeu Cascardo.
--


Attachments:
(No filename) (4.26 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2006-03-26 16:27:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mar 25, 2006 11:51 -0300, [email protected] wrote:
> As AMS has pointed out, the filesystem creator must be set to Hurd for
> these inode fields to be used. Since ext2 seems to be the most
> supported filesystem on Hurd, most of the ext2 fs used have the fs
> creator set to Hurd.

So, if fs creator is Linux then HURD doesn't try to use those fields?
That would allow Linux to start using them, and if such a filesystem
is used on HURD then it could store the translator/author/mode_high
in the xattr space. Does it even make sense to add translator/author
to existing files, or only at file creation time?

That would mean that Linux would just need to check the fs creator
field before using any of the HURD-reserved fields.

> Regarding compatibility, there are plans to support xattr in Hurd and
> use them for these fields, translator and author. (I can't recall what
> i_mode_high is used for.) With respect to that, I'd appreciate if
> there is a recommendation to every ext2 implementation (not only
> Linux) that supports xattr, to support gnu.translator and gnu.author
> (I'll check about the i_mode_high and post about it asap.).

Not that we will be in a rush to use these fields, but it would be good
to know what i_mode_high is used for in case it ever becomes relevant
for Linux we would want to keep it the same meaning as HURD.

> There is a
> patch by Roland McGrath for Linux that supports those besides the
> reserved fields in case the fs creator is Hurd.

I'm not sure what is required for supporting such EAs? I don't think
any kernel would remove existing EAs, even if it doesn't understand
them.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-03-27 19:55:48

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Hi,

On Sat, 2006-03-25 at 11:51 -0300, [email protected] wrote:

> Regarding compatibility, there are plans to support xattr in Hurd and
> use them for these fields, translator and author. (I can't recall what
> i_mode_high is used for.) With respect to that, I'd appreciate if
> there is a recommendation to every ext2 implementation (not only
> Linux) that supports xattr, to support gnu.translator and gnu.author
> (I'll check about the i_mode_high and post about it asap.).

What do you mean by "support", exactly?

There are 3 different bits of xattr design which matter here. There's
the namespace exported to users via the *attr syscalls; there's the
encoding used on disk for those different namespaces; and there's the
exact semantics surrounding interpretation of the xattr contents.

Now, a non-Hurd system is not going to have any use for the gnu.* xattr
semantics, as translator is a Hurd-specific concept. The user "gnu.*"
namespace is easy enough to teach to Linux: to simply reserve that
namespace, without actually implementing any part of it, I think it be
sufficient simply to claim the name in include/linux/xattr.h.

For ext2/3, though, the key is how to store gnu.* on disk. Right now
the different namespaces that ext* stores on disk are enumerated in

fs/ext[23]/xattr.h

which, for ext2, currently contains:

/* Name indexes */
/* Name indexes */
#define EXT2_XATTR_INDEX_USER 1
#define EXT2_XATTR_INDEX_POSIX_ACL_ACCESS 2
#define EXT2_XATTR_INDEX_POSIX_ACL_DEFAULT 3
#define EXT2_XATTR_INDEX_TRUSTED 4
#define EXT2_XATTR_INDEX_LUSTRE 5
#define EXT2_XATTR_INDEX_SECURITY 6

If you want to reserve a new semantically-significant portion of the
namespace for use in the Hurd by gnu.* xattrs, then you'd need to submit
an authoritative Linux patch to register a new name index on ext2;
reservation of such an xattr namespace index is in effect an on-disk
format decision so needs to be agreed between implementations.

> Regarding userland tools, it would be wise if they would still support
> old format filesystems, including those with fs creator set to
> Hurd. That would include supporting the oob block for translator when
> counting used/free blocks and other operations like copying a file
> using debugfs, for example.

Certainly; I don't think anybody is arguing against that, and I regard
such backwards compatibility as an absolute requirement.

--Stephen


2006-03-27 19:59:34

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Hi,

On Sun, 2006-03-26 at 09:27 -0700, Andreas Dilger wrote:

> I'm not sure what is required for supporting such EAs? I don't think
> any kernel would remove existing EAs, even if it doesn't understand
> them.

Right --- reservation in fs/ext[23]/xattr.h is sufficient, I think, as
all we need is to make sure that the gnu.* on-disk namespace is reserved
against reuse by any new namespaces in the future.

--Stephen


2006-03-27 20:05:27

by Alfred M. Szmidt

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Now, a non-Hurd system is not going to have any use for the gnu.*
xattr semantics, as translator is a Hurd-specific concept.

gnu.* doesn't just concern itself with translators, it can also be
gnu.author (or some such) which is a normal UID, which GNU/Linux can
support without any problems.

2006-03-27 20:36:21

by Alfred M. Szmidt

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

So, if fs creator is Linux then HURD doesn't try to use those
fields?

As I recall it (don't have access to the source code here), the
file-system translator will return EOPNOTSUPP if you try and set a
passive translator on a non-Hurd owned file-system. Passive
translators are the only kind of translators which write any kind of
data back to the acutal file-system.

For the case of st_author/i_author, when the file is created on a
non-Hurd owned file-system, it will simply return whatever
i_uid/st_uid is.

Not that we will be in a rush to use these fields, but it would be
good to know what i_mode_high is used for in case it ever becomes
relevant for Linux we would want to keep it the same meaning as
HURD.

Once again, as I recall it (a bit better this time), i_mode_high is
used for the actual bits that define if there is a translator (and
what kind) on a node or not.

Cheers.

2006-03-27 20:40:49

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

Hi,

On Mon, 2006-03-27 at 22:05 +0200, Alfred M. Szmidt wrote:
> Now, a non-Hurd system is not going to have any use for the gnu.*
> xattr semantics, as translator is a Hurd-specific concept.
>
> gnu.* doesn't just concern itself with translators, it can also be
> gnu.author (or some such) which is a normal UID, which GNU/Linux can
> support without any problems.

OK, but would it have any active semantics on non-Hurd kernels? How
would the behaviour of ext3 change in the presence of a gnu.author
attribute on a file?

It would certainly be possible to add a generic ext2/3 namespace handler
to allow those fields to be set on, say, Linux hosts; but that would
just be a matter of matching the gnu.* syscall xattr encoding to the
EXT2_XATTR_INDEX_GNU on-disk encoding; it wouldn't actually deal with
any semantic expectations surrounding the use of those fields.

--Stephen


Subject: Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)

On Mon, Mar 27, 2006 at 02:55:01PM -0500, Stephen C. Tweedie wrote:
> Hi,
>
> On Sat, 2006-03-25 at 11:51 -0300, [email protected] wrote:
>
> > Regarding compatibility, there are plans to support xattr in Hurd and
> > use them for these fields, translator and author. (I can't recall what
> > i_mode_high is used for.) With respect to that, I'd appreciate if
> > there is a recommendation to every ext2 implementation (not only
> > Linux) that supports xattr, to support gnu.translator and gnu.author
> > (I'll check about the i_mode_high and post about it asap.).
>
> What do you mean by "support", exactly?
>
> There are 3 different bits of xattr design which matter here. There's
> the namespace exported to users via the *attr syscalls; there's the
> encoding used on disk for those different namespaces; and there's the
> exact semantics surrounding interpretation of the xattr contents.
>

Listing the attributes of a file should return the "gnu.*"
ones. That's the first meaning of supporting. Storing them on ext2/3
is the second. This one is already implemented for Linux by Roland
McGrath. I don't know, however, it that patch was submitted to the
right people. Is anyone here responsible for that? I can send it to
the list or privately, including the number used to store them. AFAIK,
the Linux code does not blindly lists all the attributes, but only
those "supported", as you pointed below, because they require a
reservation.

> Now, a non-Hurd system is not going to have any use for the gnu.* xattr
> semantics, as translator is a Hurd-specific concept. The user "gnu.*"
> namespace is easy enough to teach to Linux: to simply reserve that
> namespace, without actually implementing any part of it, I think it be
> sufficient simply to claim the name in include/linux/xattr.h.
>

The semantics may not be supported, if they have no meaning to the
system. But star or cp should be able to keep those attributes if they
are written to do so. Does anyone know if cp can keep the xattr of a
file? Anyway, a patched cp that would keep the xattrs should keep the
"gnu.*" xattrs, and that's all (if both underlying filesystems support
them, which would be true for two ext2/3 filesystems).

> For ext2/3, though, the key is how to store gnu.* on disk. Right now
> the different namespaces that ext* stores on disk are enumerated in
>
> fs/ext[23]/xattr.h
>
> which, for ext2, currently contains:
>
> /* Name indexes */
> /* Name indexes */
> #define EXT2_XATTR_INDEX_USER 1
> #define EXT2_XATTR_INDEX_POSIX_ACL_ACCESS 2
> #define EXT2_XATTR_INDEX_POSIX_ACL_DEFAULT 3
> #define EXT2_XATTR_INDEX_TRUSTED 4
> #define EXT2_XATTR_INDEX_LUSTRE 5
> #define EXT2_XATTR_INDEX_SECURITY 6
>
> If you want to reserve a new semantically-significant portion of the
> namespace for use in the Hurd by gnu.* xattrs, then you'd need to submit
> an authoritative Linux patch to register a new name index on ext2;
> reservation of such an xattr namespace index is in effect an on-disk
> format decision so needs to be agreed between implementations.
>

That's just what I meant by saying that I'd like them to be supported
by every implementation of ext2/3 xattr. Sorry if that was not
clear. That would be 7, right? That's what Roland uses in his patch.

[...]
>
> --Stephen
>
>

Regards,
Thadeu Cascardo.
--


_______________________________________________________
Yahoo! Acesso Gr?tis - Internet r?pida e gr?tis. Instale o discador agora!
http://br.acesso.yahoo.com