2009-11-17 14:05:20

by Pavel Emelyanov

[permalink] [raw]
Subject: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

Hi.

We have a proposal to implement a 2-level disk quota on ext3 and ext4.

In two words - the aim is to have directories on ext3/4 partitions
which are limited by its disk usage and the number of inodes. Further
the plan is to allow configuring uid and gid quotas within them.

The main usage of this is containers. When two or more of them are
located on one disk their roots will be marked with a unique tree id
and thus the disk consumption of each container will be limited. While
achieving this goal having an id of what tree an inode belongs to is
a key requirement.

So first we would like to ask to reserve a place on ext3 and ext4 inodes
for that ID.

Signed-off-by: Pavel Emelyanov <[email protected]>

---

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 26d3cf8..0fda97c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -471,7 +471,7 @@ struct ext4_inode {
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
- __u32 l_i_reserved2;
+ __u32 l_i_tree_id; /* reserved for 2-level disk quota */
} linux2;
struct {
__le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
@@ -585,7 +585,7 @@ do { \
#define i_gid_low i_gid
#define i_uid_high osd2.linux2.l_i_uid_high
#define i_gid_high osd2.linux2.l_i_gid_high
-#define i_reserved2 osd2.linux2.l_i_reserved2
+#define i_tree_id osd2.linux2.l_i_tree_id

#elif defined(__GNU__)

diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 7499b36..d9f633d 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -320,7 +320,7 @@ struct ext3_inode {
__u16 i_pad1;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
- __u32 l_i_reserved2;
+ __u32 l_i_tree_id; /* reserved for 2-level disk quota */
} linux2;
struct {
__u8 h_i_frag; /* Fragment number */
@@ -351,7 +351,7 @@ struct ext3_inode {
#define i_gid_low i_gid
#define i_uid_high osd2.linux2.l_i_uid_high
#define i_gid_high osd2.linux2.l_i_gid_high
-#define i_reserved2 osd2.linux2.l_i_reserved2
+#define i_tree_id osd2.linux2.l_i_tree_id

#elif defined(__GNU__)



2009-11-17 17:06:05

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

On 2009-11-17, at 06:04, Pavel Emelyanov wrote:
> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
>
> In two words - the aim is to have directories on ext3/4 partitions
> which are limited by its disk usage and the number of inodes. Further
> the plan is to allow configuring uid and gid quotas within them.
>
> The main usage of this is containers. When two or more of them are
> located on one disk their roots will be marked with a unique tree id
> and thus the disk consumption of each container will be limited. While
> achieving this goal having an id of what tree an inode belongs to is
> a key requirement.

How do you handle files with multiple links, if they are located in
different trees? The inode would need to have multiple tree ids.

You can instead just store this data in an xattr (which will normally
be stored in the inode, so no performance impact), and then you are
free to store multiple values per inode.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-11-17 17:12:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

Hi,

> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
>
> In two words - the aim is to have directories on ext3/4 partitions
> which are limited by its disk usage and the number of inodes. Further
> the plan is to allow configuring uid and gid quotas within them.
If I understand it right, this is something like XFS's project quota,
right? Note that such thing has implications such as you have to forbid
hardlinks between different "quota trees", otherwise it just won't fly...
Also by 2-level, you mean it won't be possible to nest such subtrees?
I.e. have a quota on directories a/, b/, a/b, a/c?

> The main usage of this is containers. When two or more of them are
> located on one disk their roots will be marked with a unique tree id
> and thus the disk consumption of each container will be limited. While
> achieving this goal having an id of what tree an inode belongs to is
> a key requirement.
>
> So first we would like to ask to reserve a place on ext3 and ext4 inodes
> for that ID.
Do you really need to store tree ID on disk? I'd think that it should
be enough to keep some id / pointer in memory and initialize it when we
load inode into memory (from an id / pointer of parent directory). Then
it would be enough to store a fact that some directory is a root of
"quota tree" somewhere - either in extended attributes, as a flag in
the inode, or together with quota data.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-11-17 17:57:01

by Pavel Emelyanov

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

Jan Kara wrote:
> Hi,
>
>> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
>>
>> In two words - the aim is to have directories on ext3/4 partitions
>> which are limited by its disk usage and the number of inodes. Further
>> the plan is to allow configuring uid and gid quotas within them.
> If I understand it right, this is something like XFS's project quota,
> right?

Not exactly. XFS tree quota actually replaces gid one. My proposal is
to add the 3rd id.

> Note that such thing has implications such as you have to forbid
> hardlinks between different "quota trees", otherwise it just won't fly...

Yes, I know it. We know other things we'll have to disable, but this is
OK to live without them.

> Also by 2-level, you mean it won't be possible to nest such subtrees?

As I see it - nesting can be done on top of it. I mean - once we have
a tree id of an inode and if we say "id A is a sub-id of id B" we're done.

As far as containers are concerned - we'll have to map container id to
quota tree id, since changing a container id is fast and simple, but
it's not so for tree id. That said, this treeid is just a way do distinguish
inodes from one sub-tree from the others.

> I.e. have a quota on directories a/, b/, a/b, a/c?
>
>> The main usage of this is containers. When two or more of them are
>> located on one disk their roots will be marked with a unique tree id
>> and thus the disk consumption of each container will be limited. While
>> achieving this goal having an id of what tree an inode belongs to is
>> a key requirement.
>>
>> So first we would like to ask to reserve a place on ext3 and ext4 inodes
>> for that ID.
> Do you really need to store tree ID on disk? I'd think that it should
> be enough to keep some id / pointer in memory and initialize it when we
> load inode into memory (from an id / pointer of parent directory). Then
> it would be enough to store a fact that some directory is a root of
> "quota tree" somewhere - either in extended attributes, as a flag in
> the inode, or together with quota data.

We can't do it inside ext4_nfs_get_inode unfortunately :(

> Honza


2009-11-17 18:47:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

> Jan Kara wrote:
> > Hi,
> >
> >> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
> >>
> >> In two words - the aim is to have directories on ext3/4 partitions
> >> which are limited by its disk usage and the number of inodes. Further
> >> the plan is to allow configuring uid and gid quotas within them.
> > If I understand it right, this is something like XFS's project quota,
> > right?
>
> Not exactly. XFS tree quota actually replaces gid one. My proposal is
> to add the 3rd id.
Yeah, OK, but it's quite similar :)

> > Also by 2-level, you mean it won't be possible to nest such subtrees?
>
> As I see it - nesting can be done on top of it. I mean - once we have
> a tree id of an inode and if we say "id A is a sub-id of id B" we're done.
But for implementation, it's kind of important whether there is going
to be just one "tree" limitation for each inode, or arbitrary number of
them...

> > I.e. have a quota on directories a/, b/, a/b, a/c?
> >
> >> The main usage of this is containers. When two or more of them are
> >> located on one disk their roots will be marked with a unique tree id
> >> and thus the disk consumption of each container will be limited. While
> >> achieving this goal having an id of what tree an inode belongs to is
> >> a key requirement.
> >>
> >> So first we would like to ask to reserve a place on ext3 and ext4 inodes
> >> for that ID.
> > Do you really need to store tree ID on disk? I'd think that it should
> > be enough to keep some id / pointer in memory and initialize it when we
> > load inode into memory (from an id / pointer of parent directory). Then
> > it would be enough to store a fact that some directory is a root of
> > "quota tree" somewhere - either in extended attributes, as a flag in
> > the inode, or together with quota data.
> We can't do it inside ext4_nfs_get_inode unfortunately :(
Right, that's nasty. OK, but as Andreas suggested, extended attributes
are more flexible for this - most notably every fs supporting them would
be able to support your tree quota extension.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-11-17 21:19:10

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

Andreas Dilger <[email protected]> writes:

> On 2009-11-17, at 06:04, Pavel Emelyanov wrote:
>> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
>>
>> In two words - the aim is to have directories on ext3/4 partitions
>> which are limited by its disk usage and the number of inodes. Further
>> the plan is to allow configuring uid and gid quotas within them.
>>
>> The main usage of this is containers. When two or more of them are
>> located on one disk their roots will be marked with a unique tree id
>> and thus the disk consumption of each container will be limited. While
>> achieving this goal having an id of what tree an inode belongs to is
>> a key requirement.
>
> How do you handle files with multiple links, if they are located in
> different trees? The inode would need to have multiple tree ids.
A short answer is "NO", inode can not belongs to multiple trees.
Containers has some non obvious specific.
Each container isolated from another as much as possible.
Container has its own root tree. This tree is exported inside
CT by numerous possible ways (name-space, virtual-stack-fs, chroot)

So container's root are independent tree or several trees.
usually they organized like follows /ct_root/CT_${ID}/${tree_content}
There are many reasons to keep this trees separate one from another
- inode attr:
If inode has links in A n B trees. And A-user call chown() for
this inode, then B's owner will be surprised.
The only way to overcome this is to virtualize inode atributes
(for each tree) which is madness IMHO.
- checkpoint/restore/online-backup:
This is like suspend resume for VM, but in this case only
container's process are stopped(freezed) for some time. After CT's
process are stopped we may create backup CT's tree without freezing
FS as a whole.
As I already say there are many way to accomplish this task. But everyone
has strong disadvantages:
Virtual block devices(qemu-like): problems with consistency and performance
ext3/4 + stack-fs(unionfs/vzfs): Bad failure resistance. It is
impossible to support jorunalling quota file on stack-fs level.
XFS with proj quota : Lack of quota file journalling. XFS itself
(please dont balme me, but i'm really not huge XFS fan)

So the only way to implement journalled quota for containers is to
implement it on native fs level.

"Containers directory tree-id" assumptions:
(1) Tree id is embedded inside inode
(2) Tree id is inherent from parent dir
(3) Inode can not belongs to different directory trees

Default directory tree (with id == 0) has special meaning.
directory which belongs to default tree may contains roots of
other trees. Default tree is used for subtree manipulation.

->rename restriction:
if (S_ISDIR(old_inode->i_mode)) {
if ((new_dir->i_tree_id == 0) || /* move to default tree */
(new_dir->i_tree_id == old_inode->i_tree_id)) /*same tree */
goto good;
return -EXDEV;
} else {
/* If entry have more than one link then it is bad idea to allow
rename it to different (even if it's default tree) tree,
because this result in rule (3) violation.
if (old_inode->i_nlink > 1) &&
(new_dir->i_tree_id != old_inode->i_tree_id)
return -EXDEV;
}
->link restriction: /* Links may belongs to only one tree */
if(new_dir->i_tree_id != old_inode->i_tree_id)
return -EXDEV;

>
> You can instead just store this data in an xattr (which will normally
> be stored in the inode, so no performance impact), and then you are
> free to store multiple values per inode.
Yes xattr is possible, but struct ext4_xattr_entry is so big plus
space for attr_name ...., But we only want 4 bytes.
In fact i've made a proof of concept patch it contains all necessary
for tree quota support. I'll post it if you interesting.

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.

2009-11-17 21:19:56

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

Jan Kara <[email protected]> writes:

>> Jan Kara wrote:
>> > Hi,
>> >
>> >> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
>> >>
>> >> In two words - the aim is to have directories on ext3/4 partitions
>> >> which are limited by its disk usage and the number of inodes. Further
>> >> the plan is to allow configuring uid and gid quotas within them.
>> > If I understand it right, this is something like XFS's project quota,
>> > right?
>>
>> Not exactly. XFS tree quota actually replaces gid one. My proposal is
>> to add the 3rd id.
> Yeah, OK, but it's quite similar :)
>
>> > Also by 2-level, you mean it won't be possible to nest such subtrees?
>>
>> As I see it - nesting can be done on top of it. I mean - once we have
>> a tree id of an inode and if we say "id A is a sub-id of id B" we're done.
> But for implementation, it's kind of important whether there is going
> to be just one "tree" limitation for each inode, or arbitrary number of
> them...
>
>> > I.e. have a quota on directories a/, b/, a/b, a/c?
>> >
I've post fs assumptions to Andreas's replay
>> >> The main usage of this is containers. When two or more of them are
>> >> located on one disk their roots will be marked with a unique tree id
>> >> and thus the disk consumption of each container will be limited. While
>> >> achieving this goal having an id of what tree an inode belongs to is
>> >> a key requirement.
>> >>
>> >> So first we would like to ask to reserve a place on ext3 and ext4 inodes
>> >> for that ID.
>> > Do you really need to store tree ID on disk? I'd think that it should
>> > be enough to keep some id / pointer in memory and initialize it when we
>> > load inode into memory (from an id / pointer of parent directory). Then
>> > it would be enough to store a fact that some directory is a root of
>> > "quota tree" somewhere - either in extended attributes, as a flag in
>> > the inode, or together with quota data.
>> We can't do it inside ext4_nfs_get_inode unfortunately :(
Also we will have problems with orphan list cleanup on unclean umount.
> Right, that's nasty. OK, but as Andreas suggested, extended attributes
> are more flexible for this - most notably every fs supporting them would
> be able to support your tree quota extension.
>
> Honza

2009-11-18 17:43:11

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

Dmitry Monakhov <[email protected]> writes:

> Andreas Dilger <[email protected]> writes:
>
>> On 2009-11-17, at 06:04, Pavel Emelyanov wrote:
>>> We have a proposal to implement a 2-level disk quota on ext3 and ext4.
>>>
>>> In two words - the aim is to have directories on ext3/4 partitions
>>> which are limited by its disk usage and the number of inodes. Further
>>> the plan is to allow configuring uid and gid quotas within them.
>>>
>>> The main usage of this is containers. When two or more of them are
>>> located on one disk their roots will be marked with a unique tree id
>>> and thus the disk consumption of each container will be limited. While
>>> achieving this goal having an id of what tree an inode belongs to is
>>> a key requirement.
>>
>> How do you handle files with multiple links, if they are located in
>> different trees? The inode would need to have multiple tree ids.
> A short answer is "NO", inode can not belongs to multiple trees.
> Containers has some non obvious specific.
> Each container isolated from another as much as possible.
> Container has its own root tree. This tree is exported inside
> CT by numerous possible ways (name-space, virtual-stack-fs, chroot)
>
> So container's root are independent tree or several trees.
> usually they organized like follows /ct_root/CT_${ID}/${tree_content}
> There are many reasons to keep this trees separate one from another
> - inode attr:
> If inode has links in A n B trees. And A-user call chown() for
> this inode, then B's owner will be surprised.
> The only way to overcome this is to virtualize inode atributes
> (for each tree) which is madness IMHO.
> - checkpoint/restore/online-backup:
> This is like suspend resume for VM, but in this case only
> container's process are stopped(freezed) for some time. After CT's
> process are stopped we may create backup CT's tree without freezing
> FS as a whole.
> As I already say there are many way to accomplish this task. But everyone
> has strong disadvantages:
> Virtual block devices(qemu-like): problems with consistency and performance
> ext3/4 + stack-fs(unionfs/vzfs): Bad failure resistance. It is
> impossible to support jorunalling quota file on stack-fs level.
> XFS with proj quota : Lack of quota file journalling. XFS itself
> (please dont balme me, but i'm really not huge XFS fan)
>
> So the only way to implement journalled quota for containers is to
> implement it on native fs level.
>
> "Containers directory tree-id" assumptions:
> (1) Tree id is embedded inside inode
> (2) Tree id is inherent from parent dir
> (3) Inode can not belongs to different directory trees
>
> Default directory tree (with id == 0) has special meaning.
> directory which belongs to default tree may contains roots of
> other trees. Default tree is used for subtree manipulation.
>
> ->rename restriction:
> if (S_ISDIR(old_inode->i_mode)) {
> if ((new_dir->i_tree_id == 0) || /* move to default tree */
> (new_dir->i_tree_id == old_inode->i_tree_id)) /*same tree */
> goto good;
> return -EXDEV;
> } else {
> /* If entry have more than one link then it is bad idea to allow
> rename it to different (even if it's default tree) tree,
> because this result in rule (3) violation.
> if (old_inode->i_nlink > 1) &&
> (new_dir->i_tree_id != old_inode->i_tree_id)
> return -EXDEV;
> }
> ->link restriction: /* Links may belongs to only one tree */
> if(new_dir->i_tree_id != old_inode->i_tree_id)
> return -EXDEV;
>
>>
>> You can instead just store this data in an xattr (which will normally
>> be stored in the inode, so no performance impact), and then you are
>> free to store multiple values per inode.
> Yes xattr is possible, but struct ext4_xattr_entry is so big plus
> space for attr_name ...., But we only want 4 bytes.
In other point of view it may be too expensive reserve the last 4
bytes in EXT4_GOOD_OLD_INODE. At the same time store tree_id as xattr.
result in space wasting. But in fact new inode has room for space
reservation. We may store it like it is done for i_version_hi field
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -494,6 +494,7 @@ struct ext4_inode {
__le32 i_crtime; /* File Creation time */
__le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */
__le32 i_version_hi; /* high 32 bits for 64-bit version */
+ __le32 i_disk_tree_id; /* directory tree quota id */
};

struct move_extent {
@@ -1112,6 +1113,7 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
#define EXT4_FEATURE_INCOMPAT_MMP 0x0100
#define EXT4_FEATURE_INCOMPAT_FLEX_BG 0x0200
+#define EXT4_FEATURE_INCOMPAT_TREE_ID 0x0400 /* directory tree id */

#define EXT4_FEATURE_COMPAT_SUPP EXT2_FEATURE_COMPAT_EXT_ATTR
#define EXT4_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \
@@ -1119,7 +1121,8 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
EXT4_FEATURE_INCOMPAT_META_BG| \
EXT4_FEATURE_INCOMPAT_EXTENTS| \
EXT4_FEATURE_INCOMPAT_64BIT| \
- EXT4_FEATURE_INCOMPAT_FLEX_BG)
+ EXT4_FEATURE_INCOMPAT_FLEX_BG| \
+ EXT4_FEATURE_INCOMPAT_TREE_ID)
#define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1534,6 +1534,15 @@ set_qf_format:
set_opt(sbi->s_mount_opt, I_VERSION);
sb->s_flags |= MS_I_VERSION;
break;
+ case Opt_tree_id:
+ if (!(EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_TREE_ID) &&
+ EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
+ EXT4_FITS_IN_INODE(raw_inode, ei, i_disk_tree_id))) {
+ ext4_msg(sb, KERN_ERR, "tree_id is not supported");
+ return 0;
+ }
+ set_opt(sbi->s_mount_opt, TREE_ID);
+ break;
case Opt_nodelalloc:
clear_opt(sbi->s_mount_opt, DELALLOC);
break;
-=-=-=-
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-11-19 06:34:05

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] A request to reserve a "tree id" field on ext[34] inodes

On 2009-11-18, at 09:43, Dmitry Monakhov wrote:
> In other point of view it may be too expensive reserve the last 4
> bytes in EXT4_GOOD_OLD_INODE. At the same time store tree_id as xattr.
> result in space wasting.

Since the xattr is stored inside the inode, and you are accessing this
from the kernel, it is using only 24 bytes of space used for your tree
ID (20 bytes ext4_xattr_entry, including 3-byte name, 4 bytes data).
It also has virtually no performance overhead because it is kept in
the inode itself.

If you consider that the use of tree_id is not likely to be commonly
used, then it would be "wasting" 4 bytes of space in everyone else's
inodes to reserve this field in the inode (whether in the old inode or
the larger ext4 inode).

> But in fact new inode has room for space
> reservation. We may store it like it is done for i_version_hi field
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -494,6 +494,7 @@ struct ext4_inode {
> __le32 i_crtime; /* File Creation time */
> __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 |
> epoch) */
> __le32 i_version_hi; /* high 32 bits for 64-bit version */
> + __le32 i_disk_tree_id; /* directory tree quota id */
> };


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.