This is the implementation for large EA support in ext4. Note that this also helps to have a larger number of EAs since large EAs get written out to a new inode instead of the EA block.
If value of an attribute is greater than 1/2 the blocksize, the value is not saved in the external EA block, instead it is saved in an inode. The EA entry saves the inode number in e_value_inum field (earlier this was
e_value_block that was unused). The EA inode has the same generation as the parent inode and the mtime of the EA inode is set to the inode number of the parent. A new EXT4_FEATURE_INCOMPAT_EA_INODE feature has been added for this.
This "large_xattr" feature is not enabled automatically and needs to be enabled by mkfs or using tune2fs.
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>
ext4.h | 7 -
xattr.c | 413 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
xattr.h | 11 +
3 files changed, 386 insertions(+), 45 deletions(-)
Thanks,
Kalpak
Sorry for not reviewing this patch earlier, but looking at the disk
format, I wonder if it's really necessary to allocate an inode for
each EA. Given that we have a fixed inode table, if the user creates
a large number of 2k EA's (on a 4k filesystem) or 512 byte EA's (on a
1k) filesystem, this could easily burn a huge number of inodes,
causing users to run out.
We don't actually *need* to use an inode; what if we make use
e_value_block and e_hash to be a 64-bit block number, and use
e_value_offs (if 0) to indicate whether the 64-bit block number
contains data, or (if 1) contains the root of an extent tree. This
adds a bit of complexity to the hash calculation if we want to support
sharing the EA block that contains pointers to Large EA's, but from
what I can tell the proposed patch doesn't support this anyway (and it
seems highly unlikely that multiple files with large EA's could be
able to be shared anyway).
The upsides of this patch (not needing to use inodes) seems to be
worth the downsides (slightly more complexity, and not being able to
share EA blocks).
What do folks think?
- Ted
On Wed, Nov 26, 2008 at 10:11 AM, Theodore Tso <[email protected]> wrote:
> Sorry for not reviewing this patch earlier, but looking at the disk
> format, I wonder if it's really necessary to allocate an inode for
> each EA. Given that we have a fixed inode table, if the user creates
> a large number of 2k EA's (on a 4k filesystem) or 512 byte EA's (on a
> 1k) filesystem, this could easily burn a huge number of inodes,
> causing users to run out.
>
> We don't actually *need* to use an inode;
One of the reasons we need to use an inode is that orphan EA inodes
can be linked into lost+found. If we just use an extent tree, I am not
sure how e2fsck can find out orphan EAs.
> what if we make use
> e_value_block and e_hash to be a 64-bit block number, and use
> e_value_offs (if 0) to indicate whether the 64-bit block number
> contains data, or (if 1) contains the root of an extent tree. This
> adds a bit of complexity to the hash calculation if we want to support
> sharing the EA block that contains pointers to Large EA's, but from
> what I can tell the proposed patch doesn't support this anyway (and it
> seems highly unlikely that multiple files with large EA's could be
> able to be shared anyway).
We shouldn't worry about sharing EA blocks once we have large EAs as
it will be too inefficient getting all the EA values (from EA inodes
or extent trees) and calculating a hash of all the data and when an EA
block cannot be shared we will need to create copies of all EA
inodes......
> The upsides of this patch (not needing to use inodes) seems to be
> worth the downsides (slightly more complexity, and not being able to
> share EA blocks).
>
> What do folks think?
>
> - Ted
Thanks,
Kalpak
On Wed, Nov 26, 2008 at 11:30:32AM +0530, Kalpak Shah wrote:
> On Wed, Nov 26, 2008 at 10:11 AM, Theodore Tso <[email protected]> wrote:
> > Sorry for not reviewing this patch earlier, but looking at the disk
> > format, I wonder if it's really necessary to allocate an inode for
> > each EA. Given that we have a fixed inode table, if the user creates
> > a large number of 2k EA's (on a 4k filesystem) or 512 byte EA's (on a
> > 1k) filesystem, this could easily burn a huge number of inodes,
> > causing users to run out.
> >
> > We don't actually *need* to use an inode;
>
> One of the reasons we need to use an inode is that orphan EA inodes
> can be linked into lost+found. If we just use an extent tree, I am not
> sure how e2fsck can find out orphan EAs.
It's already the case that if we have an orphaned EA block, we'll lose
it. The question is whether it's important to keep a large EA if it
gets orphaned, especially given that there are already plenty ways
that we can lose EA's (i.e., ftp, tar, NFSv3, etc.). So if someone is
going to store a multi-megabyte EA, and we lose it because the inode
it was attached to gets destroyed, or the inode gets corrupted to the
point where we can't find the root of the EA tree --- the question is
--- will we care? It's similar to losing the high-level node in the
EA tree, we lose all the leaf nodes below it. It can happen, but in
that case the user will just have to restore the entire file from
backup. I'd argue that losing the EA tree would be the same sort of
thing.
I can see the argument on the other side, where if someone attaches a
multi-megabyte EA to a file, that it might be important enough to be
worth saving. OTOH I'm not at all sure we would want to encourage
such a thing!
- Ted
On Nov 26, 2008 01:54 -0500, Theodore Ts'o wrote:
> It's already the case that if we have an orphaned EA block, we'll lose
> it. The question is whether it's important to keep a large EA if it
> gets orphaned, especially given that there are already plenty ways
> that we can lose EA's (i.e., ftp, tar, NFSv3, etc.). So if someone is
> going to store a multi-megabyte EA, and we lose it because the inode
> it was attached to gets destroyed, or the inode gets corrupted to the
> point where we can't find the root of the EA tree --- the question is
> --- will we care?
One benefit I think is that at least the orphaned EA inode can be
cleaned up instead of lingering in the middle of the shared EA tree.
The other issue is that I don't want to give up the e_hash field for
the EA, because that is a useful checksum of the EA contents.
Another benefit of having separate EAs is that it makes it tractable to
modify very large EAs. Otherwise, if there are a number of large
EAs shared in a single tree they would all have to be modified in order
to store a larger value for an EA in the middle of the tree.
To be honest, I don't think that it is worth a great deal of effort to
optimize this corner case. I would rather keep the EA structure simple,
and if running out of inodes is a problem we would be far better off to
have a much more widely useful solution like dynamic inode tables instead
of working around that limitation with complex EA code.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
On Wed, Nov 26, 2008 at 02:49:29PM -0700, Andreas Dilger wrote:
>
> One benefit I think is that at least the orphaned EA inode can be
> cleaned up instead of lingering in the middle of the shared EA tree.
>
> Another benefit of having separate EAs is that it makes it tractable to
> modify very large EAs. Otherwise, if there are a number of large
> EAs shared in a single tree they would all have to be modified in order
> to store a larger value for an EA in the middle of the tree.
I guess I didn't make myself clear. I was *not* suggesting that we
share EA's in one inode, or in one extent tree. Instead, what I
suggested was that instead of having a pointer to an inode, if the
value of the EA is less than half the blocksize, it is stored in the
EA block. If it is between 50% and 100% of the blocksize, instead of
pointing at inode, we point to a block. If it is greater than a
blocksize, we point at a block containing an EA tree. (Which means
for a large EA the average space overhead is 6k --- 4k for the extent
block, plus 2k for the fragmentation cost).
So this scheme very much uses separate EA's, and does not pack all of
the EA's into a single tree. It is deliberately kept simple precisely
because like you I don't think it's worth it to optimize EA's. On the
other hand, running out of inodes is a big problem, and dynamic inodes
is far more complicated an issue, especially if we don't have 64-bit
inode support in the kernel and in userspace, and we need to worry
about locality issues and how dynamic inodes work with online
resizing.
The tradeoff is that my scheme doesn't burn an inode for each large
EA, but for EA's greater than a blocksize, we chew an extra block's
worth of overhead. Personally, I think it's a worthwhile tradeoff ---
- Ted
On Nov 26, 2008 19:35 -0500, Theodore Ts'o wrote:
> I guess I didn't make myself clear. I was *not* suggesting that we
> share EA's in one inode, or in one extent tree. Instead, what I
> suggested was that instead of having a pointer to an inode, if the
> value of the EA is less than half the blocksize, it is stored in the
> EA block. If it is between 50% and 100% of the blocksize, instead of
> pointing at inode, we point to a block. If it is greater than a
> blocksize, we point at a block containing an EA tree. (Which means
> for a large EA the average space overhead is 6k --- 4k for the extent
> block, plus 2k for the fragmentation cost).
>
> So this scheme very much uses separate EA's, and does not pack all of
> the EA's into a single tree. It is deliberately kept simple precisely
> because like you I don't think it's worth it to optimize EA's. On the
> other hand, running out of inodes is a big problem, and dynamic inodes
> is far more complicated an issue, especially if we don't have 64-bit
> inode support in the kernel and in userspace, and we need to worry
> about locality issues and how dynamic inodes work with online
> resizing.
>
> The tradeoff is that my scheme doesn't burn an inode for each large
> EA, but for EA's greater than a blocksize, we chew an extra block's
> worth of overhead. Personally, I think it's a worthwhile tradeoff ---
The other issue is that if we are pointing to a direct extent tree
instead of a relative block in the inode then all of the normal IO
functions are not usable (ext3_getblk(), etc), and they would have
to be re-implemented.
It might be possible to fake out the extent handling functions and do
this by iterating over the blocks directly by virtue of passing in the
parent inode, but it seems prone to breakage.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Since we need to make sure that inodes are not used very frequently for
storing EAs, the following design was discussed on the ext4 concall:
xattrs of size blocksize/2 < ea_size <= blocksize are stored by
referencing the block number directly from the ext4_xattr_entry (using
some unique combination of bits to encode that this is referencing a
block instead of an inode, and also finding space to store 48-bit block
numbers) and then ea_size > blocksize is referenced directly by an
inode.
During discussion Andreas suggested another idea using which we can
avoid the need to point at blocks from the ext4_xattr_entry:
Use mballoc to try and find up to 64kB of contiguous blocks to store
smaller xattrs. Looking at the ext4_xattr_header it has an h_blocks
field which we can use to indicate the number of blocks in a row that
are allocated for this inode's xattrs.
The ext4_xattr_entry has a 16-bit block offset that can be used to
point anywhere within a 64kB block. This not only allows many more
small xattrs to be stored efficiently, but also mid-sized xattrs (<=
blocksize) can be handled efficiently because the data will be packed
into the single group of blocks. It also avoids the need to reference
block numbers from the ext4_xattr_entry directly, which is ugly.
Comments?
Thanks,
Kalpak
On Wed, 2008-11-26 at 19:35 -0500, Theodore Tso wrote:
> On Wed, Nov 26, 2008 at 02:49:29PM -0700, Andreas Dilger wrote:
> >
> > One benefit I think is that at least the orphaned EA inode can be
> > cleaned up instead of lingering in the middle of the shared EA tree.
> >
> > Another benefit of having separate EAs is that it makes it tractable to
> > modify very large EAs. Otherwise, if there are a number of large
> > EAs shared in a single tree they would all have to be modified in order
> > to store a larger value for an EA in the middle of the tree.
>
> I guess I didn't make myself clear. I was *not* suggesting that we
> share EA's in one inode, or in one extent tree. Instead, what I
> suggested was that instead of having a pointer to an inode, if the
> value of the EA is less than half the blocksize, it is stored in the
> EA block. If it is between 50% and 100% of the blocksize, instead of
> pointing at inode, we point to a block. If it is greater than a
> blocksize, we point at a block containing an EA tree. (Which means
> for a large EA the average space overhead is 6k --- 4k for the extent
> block, plus 2k for the fragmentation cost).
>
> So this scheme very much uses separate EA's, and does not pack all of
> the EA's into a single tree. It is deliberately kept simple precisely
> because like you I don't think it's worth it to optimize EA's. On the
> other hand, running out of inodes is a big problem, and dynamic inodes
> is far more complicated an issue, especially if we don't have 64-bit
> inode support in the kernel and in userspace, and we need to worry
> about locality issues and how dynamic inodes work with online
> resizing.
>
> The tradeoff is that my scheme doesn't burn an inode for each large
> EA, but for EA's greater than a blocksize, we chew an extra block's
> worth of overhead. Personally, I think it's a worthwhile tradeoff ---
>
> - Ted
On Wed, Dec 3, 2008 at 4:08 PM, Kalpak Shah <[email protected]> wrote:
> Since we need to make sure that inodes are not used very frequently for
> storing EAs, the following design was discussed on the ext4 concall:
>
> xattrs of size blocksize/2 < ea_size <= blocksize are stored by
> referencing the block number directly from the ext4_xattr_entry (using
> some unique combination of bits to encode that this is referencing a
> block instead of an inode, and also finding space to store 48-bit block
> numbers) and then ea_size > blocksize is referenced directly by an
> inode.
>
> During discussion Andreas suggested another idea using which we can
> avoid the need to point at blocks from the ext4_xattr_entry:
>
> Use mballoc to try and find up to 64kB of contiguous blocks to store
> smaller xattrs. Looking at the ext4_xattr_header it has an h_blocks
> field which we can use to indicate the number of blocks in a row that
> are allocated for this inode's xattrs.
>
> The ext4_xattr_entry has a 16-bit block offset that can be used to
> point anywhere within a 64kB block. This not only allows many more
> small xattrs to be stored efficiently, but also mid-sized xattrs (<=
> blocksize) can be handled efficiently because the data will be packed
> into the single group of blocks. It also avoids the need to reference
> block numbers from the ext4_xattr_entry directly, which is ugly.
>
> Comments?
Hi Ted,
Did you get a chance to think about this? It would be great if you can
let me know which design is more preferable to you, so I can go ahead
with the implementation. I understand that including this work in ext4
isn't a priority right now, but it would be great if we can register a
feature flag and also what all the flag will include (EA inodes, EA
entries pointing to blocks or larger no of EA blocks).
Thanks,
Kalpak
>
> Thanks,
> Kalpak
>
> On Wed, 2008-11-26 at 19:35 -0500, Theodore Tso wrote:
>> On Wed, Nov 26, 2008 at 02:49:29PM -0700, Andreas Dilger wrote:
>> >
>> > One benefit I think is that at least the orphaned EA inode can be
>> > cleaned up instead of lingering in the middle of the shared EA tree.
>> >
>> > Another benefit of having separate EAs is that it makes it tractable to
>> > modify very large EAs. Otherwise, if there are a number of large
>> > EAs shared in a single tree they would all have to be modified in order
>> > to store a larger value for an EA in the middle of the tree.
>>
>> I guess I didn't make myself clear. I was *not* suggesting that we
>> share EA's in one inode, or in one extent tree. Instead, what I
>> suggested was that instead of having a pointer to an inode, if the
>> value of the EA is less than half the blocksize, it is stored in the
>> EA block. If it is between 50% and 100% of the blocksize, instead of
>> pointing at inode, we point to a block. If it is greater than a
>> blocksize, we point at a block containing an EA tree. (Which means
>> for a large EA the average space overhead is 6k --- 4k for the extent
>> block, plus 2k for the fragmentation cost).
>>
>> So this scheme very much uses separate EA's, and does not pack all of
>> the EA's into a single tree. It is deliberately kept simple precisely
>> because like you I don't think it's worth it to optimize EA's. On the
>> other hand, running out of inodes is a big problem, and dynamic inodes
>> is far more complicated an issue, especially if we don't have 64-bit
>> inode support in the kernel and in userspace, and we need to worry
>> about locality issues and how dynamic inodes work with online
>> resizing.
>>
>> The tradeoff is that my scheme doesn't burn an inode for each large
>> EA, but for EA's greater than a blocksize, we chew an extra block's
>> worth of overhead. Personally, I think it's a worthwhile tradeoff ---
>>
>> - Ted
>
>