2008-08-25 16:05:26

by Kalpak Shah

[permalink] [raw]
Subject: [PATCH] Large EAs in ext4

Hi,

This is the implementation for large EA support in ext4. Note that this
also helps to have a larger number of EAs since large EAs get written
out to a new inode instead of the EA block.

If value of an attribute is greater than 2048 bytes the value is not
saved in the external EA block, instead it is saved in an inode. The EA
entry saves the inode number in e_value_inum field (earlier this was
e_value_block that was unused). A new EXT4_FEATURE_INCOMPAT_EA_INODE
feature has been added for this.

These inodes are not linked into any directory since a single directory
per filesystem will cause a bottleneck. Instead a "goal" argument has
been added to the ext4_new_inode() function to help a localized
selection of the EA inode. Since ext4_new_inode() only used the dir
argument to choose the group, we use goal to do the same.

This "large_xattr" feature is not enabled automatically and needs to
enabled during mkfs or using tune2fs. I have also attached the e2fsprogs
patch for the same.

Your feedback/review/comments are appreciated.

Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>

Thanks,
Kalpak.


Attachments:
ext4-large-eas.patch (25.42 kB)
e2fsprogs-large-xattrs.patch (14.31 kB)
Download all attachments

2008-08-25 22:47:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] Large EAs in ext4

On Aug 25, 2008 21:30 +0530, Kalpak Shah wrote:
> This is the implementation for large EA support in ext4. Note that this
> also helps to have a larger number of EAs since large EAs get written
> out to a new inode instead of the EA block.
>
> If value of an attribute is greater than 2048 bytes the value is not
> saved in the external EA block, instead it is saved in an inode.

I just realized that this needs to be (blocksize / 2) instead of 2048,
or we will never get the EA inodes in case of 1kB/2kB block filesystem
where we need it the most.

> +struct inode *ext4_xattr_inode_iget(struct inode *parent, int ea_ino, int *err)
^^ extra space here

> + if (ea_inode->i_mtime.tv_sec != parent->i_ino ||

Do you think it makes sense to "#define i_xattr_inode_parent i_mtime.tv_sec"
in case there is a decision to change which field is used? Or do people
think that is more confusing than helpful?

Note to readers that this is a new patch, and Lustre doesn't use it yet,
but we'd like to in the relatively near future so feedback that affects
the on disk format is preferred sooner than later.

> +ext4_xattr_inode_set(handle_t *handle, struct inode *inode, int *ea_ino,
> + const void *value, size_t value_len)
> +{
> + /*
> + * Make sure that enough buffer credits are available else extend the
> + * transaction.
> + */
> + req_buffer_credits = (value_len / inode->i_sb->s_blocksize) + 4;

Can you please explain in the comment what the "+ 4" blocks are?
I suspect this will not be enough if the xattr is large, it should just
use one of the standard "transaction size" helper functions to determine
metadata size.

> static int
> -ext4_xattr_set_entry(struct ext4_xattr_info *i, struct ext4_xattr_search *s)
> +ext4_xattr_set_entry(struct ext4_xattr_info *i, struct ext4_xattr_search *s,
> + handle_t *handle, struct inode *inode)
> {
> + if (s->here->e_value_inum) {
> + ext4_xattr_inode_unlink(inode, s->here->e_value_inum);
> + s->here->e_value_inum = 0;
> + }

The transaction not have enough blocks reserved to do the unlink and
truncate of the old EA inode. It isn't really possible to know this
before having done the xattr header lookup, so it is difficult to compute
the transaction size without doing the lookup first. As an alternative,
it would be possible to only add this inode to the orphan list and do
the iput() after the handle is done, in a separate transaction maybe.

Also, what will happen with this inode? Will it be allocated again
for the ext4_xattr_inode_set() below, or will changing a large EA in
the same transaction cause the freed inode to be busy until after the
transaction commit and a new inode found for the new EA? That would
be sub-optimal, since rapid EA changing will mark a bunch of inodes
busy. Another option is to just overwrite the old inode, trusting
that the journal will keep the update atomic (enough blocks for the
overwrite were reserved at the start).

> +#define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */
> #define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */
>
> #define EXT4_FL_USER_VISIBLE 0x000BDFFF /* User visible flags */

Is there any reason not to make this flag visible?
Did you also verify that it does not clash with the "generic" flags?

> @@ -514,6 +518,42 @@ struct inode *ext4_new_inode(handle_t *h
> + if (goal) {
> + if (ext4_set_bit_atomic(sb_bgl_lock(sbi, group),
> + ino, bitmap_bh->b_data)) {
> + goto continue_allocation;
> + }

This should probably set the goal to the first inode in the current
inode table block, if the goal is not found. That will possibly help
avoid another block read if there is a free inode in the same block
(e.g. if xattr inode is being allocated long after initial inode and
maybe another inode was freed in the same block.

The patch is good enough to go into the unstable part of the patch
queue I think, though it can have a few tweaks still.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-26 00:12:47

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH] Large EAs in ext4


在 2008-08-25一的 16:47 -0600,Andreas Dilger写道:
> On Aug 25, 2008 21:30 +0530, Kalpak Shah wrote:
> > This is the implementation for large EA support in ext4. Note that this
> > also helps to have a larger number of EAs since large EAs get written
> > out to a new inode instead of the EA block.
> >
> > If value of an attribute is greater than 2048 bytes the value is not
> > saved in the external EA block, instead it is saved in an inode.
>
> I just realized that this needs to be (blocksize / 2) instead of 2048,
> or we will never get the EA inodes in case of 1kB/2kB block filesystem
> where we need it the most.
>
> > +struct inode *ext4_xattr_inode_iget(struct inode *parent, int ea_ino, int *err)
> ^^ extra space here
>
> > + if (ea_inode->i_mtime.tv_sec != parent->i_ino ||
>
> Do you think it makes sense to "#define i_xattr_inode_parent i_mtime.tv_sec"
> in case there is a decision to change which field is used? Or do people
> think that is more confusing than helpful?
>
> Note to readers that this is a new patch, and Lustre doesn't use it yet,
> but we'd like to in the relatively near future so feedback that affects
> the on disk format is preferred sooner than later.
>
> > +ext4_xattr_inode_set(handle_t *handle, struct inode *inode, int *ea_ino,
> > + const void *value, size_t value_len)
> > +{
> > + /*
> > + * Make sure that enough buffer credits are available else extend the
> > + * transaction.
> > + */
> > + req_buffer_credits = (value_len / inode->i_sb->s_blocksize) + 4;
>
> Can you please explain in the comment what the "+ 4" blocks are?
> I suspect this will not be enough if the xattr is large, it should just
> use one of the standard "transaction size" helper functions to determine
> metadata size.
>

ext4_meta_trans_blocks() will gives the metadata size, which also
account for block group bitmap and block group descriptor blocks,
superblock, inode block.

also here

> @@ -1059,10 +1352,23 @@ ext4_xattr_set(struct inode *inode, int
> const void *value, size_t value_len, int flags)
> {
> handle_t *handle;
> + int buffer_credits;
> int error, retries = 0;

> + buffer_credits = EXT4_DATA_TRANS_BLOCKS(inode->i_sb);
> + if ((value_len >= EXT4_XATTR_MIN_LARGE_EA_SIZE) &&
> + (EXT4_SB(inode->i_sb)->s_es->s_feature_incompat &
> + EXT4_FEATURE_INCOMPAT_EA_INODE)) {
> + /* For new inode */
> + buffer_credits += > > >EXT4_SINGLEDATA_TRANS_BLOCKS(inode->i_sb) +3;
> +
> + /* For the blocks to be written in the EA inode */
> + buffer_credits += (value_len + inode->i_sb->s_blocksize - 1) /
> + inode->i_sb->s_blocksize;
> + }
> +
> retry:
> - handle = ext4_journal_start(inode, > EXT4_DATA_TRANS_BLOCKS(inode->i_sb));
> + handle = ext4_journal_start(inode, buffer_credits);
> if (IS_ERR(handle)) {
> error = PTR_ERR(handle);
> } else {


the credits calculation could be replaced with something like this:

nrblocks = (value_len + inode->i_sb->s_blocksize - 1) / inode->i_sb->s_blocksize;
buffer_credits = ext4_meta_trans_blocks(inode, nrblocks, 1) + nrblocks;



BTW, I think we should update the xttars credits micro in ext4_jbd2.h,
that is based on 1 xattr block assumption...

/* Extended attribute operations touch at most two data buffers,
* two bitmap buffers, and two group summaries, in addition to the inode
* and the superblock, which are already accounted for. */

#define EXT4_XATTR_TRANS_BLOCKS 6U