2011-11-15 14:22:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ceph and ext4

On Tue, Nov 15, 2011 at 09:29:28AM +0100, Christian Brunner wrote:
>
> I'd love to have a working filesystem for ceph, but the xattr issue
> with ext4 really scares me. That's why I'm still wrestling with btrfs.

FYI, there are patches to extend ext4 to store xattr's in a separate
inode. They were submitted by the Whamcloud (formerly Oracle, Sun,
Clustrefs -- the Lustre folks).

Unfortunately I haven't had the personal time and review bandwidth to
go over the patches, and they haven't been rebased in a while. I'm
not opposed to them; they just need some time and loving attention
from folks with ext4 expertise to accelerate the process of getting
the patches reviewed, and in mainline.

- Ted


2011-11-15 16:43:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: ceph and ext4

Coincidentally, we have someone working in those patches again. The main obstacle for accepting the previous patch as-is was that Ted wanted to add support for "medium-sized" xattrs that are addressed as a string of blocks, instead of via an inode.

This will allow xattrs up to 64kB in size (in total) to be stored as efficiently as an external xattr block.

Cheers, Andreas

On 2011-11-15, at 6:22, Ted Ts'o <[email protected]> wrote:

> On Tue, Nov 15, 2011 at 09:29:28AM +0100, Christian Brunner wrote:
>>
>> I'd love to have a working filesystem for ceph, but the xattr issue
>> with ext4 really scares me. That's why I'm still wrestling with btrfs.
>
> FYI, there are patches to extend ext4 to store xattr's in a separate
> inode. They were submitted by the Whamcloud (formerly Oracle, Sun,
> Clustrefs -- the Lustre folks).
>
> Unfortunately I haven't had the personal time and review bandwidth to
> go over the patches, and they haven't been rebased in a while. I'm
> not opposed to them; they just need some time and loving attention
> from folks with ext4 expertise to accelerate the process of getting
> the patches reviewed, and in mainline.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-11-15 18:43:58

by Yehuda Sadeh Weinraub

[permalink] [raw]
Subject: Re: ceph and ext4

On Tue, Nov 15, 2011 at 8:43 AM, Andreas Dilger <[email protected]> wrote:
> Coincidentally, we have someone working in those patches again. The main obstacle for accepting the previous patch as-is was that Ted wanted to add support for "medium-sized" xattrs that are addressed as a string of blocks, instead of via an inode.
>
> This will allow xattrs up to 64kB in size (in total) to be stored as efficiently as an external xattr block.
>

Ted, is having that is a hard requirement? Can this
functionality/optimization be added later so that we can leverage ext4
in the mean time, even if not in the most optimal way?

Thanks,
Yehuda

2011-11-21 12:22:39

by Yu Jian

[permalink] [raw]
Subject: Increase xattr space by allocating contiguous xattr blocks

Hello Ted,

I'm working on the patches for large xattrs support again based on the
previous patch made by Kalpak Shah. There are some small issues in the
previous patch, I'd fix them and rebase the patch on the latest ext4 codes.

Per the thread of
http://kerneltrap.org/mailarchive/linux-ext4/2009/2/9/4931244, there are
two features need to be implemented:
1) store large xattr value (with value_size > blocksize) in an external
xattr inode
2) increase xattr space by allocating a single chunk of up to 64kB (in
size) contiguous xattr blocks

The first one has been implemented in the previous patch, and I'm
working on the second one, which would handle "medium-sized" xattrs more
efficiently.

Now, I've the same question as that in the above thread:
In xattr.{h,c}, all of the macros and functions assume the xattr space
is contiguous with entries growing downwards and values growing upwards
(aligned to the end of the space). Especially, the create, replace,
remove and shift operations of xattrs are all performed inside a
contiguous buffer. This is no problem with in-inode xattr space and
single external xattr block which is associated with one block buffer.
But for multiple xattr blocks, since the data of them would be read into
different block buffers, which are not contiguous, most of the existing
macros and functions need to be changed. Is this way acceptable?

In order to make most of the codes remain as-is, we could allocate a
contiguous large buffer (up to 64kB in size) to handle all of the data.
However, we have to memcpy the data from block buffers to the large
buffer, and after the data are changed, we need memcpy the data back to
block buffers to make the data written into the block device. Is this
way reasonable?

Could you please give me some suggestions on how to solve this issue?

Thanks!
--
Best regards,
Yu Jian

2011-11-21 15:08:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Increase xattr space by allocating contiguous xattr blocks


On Nov 21, 2011, at 7:22 AM, Yu Jian wrote:

> Now, I've the same question as that in the above thread:
> In xattr.{h,c}, all of the macros and functions assume the xattr space is contiguous with entries growing downwards and values growing upwards (aligned to the end of the space). Especially, the create, replace, remove and shift operations of xattrs are all performed inside a contiguous buffer. This is no problem with in-inode xattr space and single external xattr block which is associated with one block buffer. But for multiple xattr blocks, since the data of them would be read into different block buffers, which are not contiguous, most of the existing macros and functions need to be changed. Is this way acceptable?

It depends on how cleanly you can implement it, and if you can create some better xfstests to exercise xattr operations --- especially if deleting xattrs will cause existing xattrs to get moved around for better packing, we need to be absolutely sure that the code is completely reliable and doesn't end up corrupting some xattr other than the one that is being inserted, deleted, or replaced.

Currently, one of the primary xattr tests that is in xfstests (#62) doesn't work on ext4 at all, since it assumes that files are returned by readdir in file creation order for newly created directory. So the lack of test coverage is something that would have to be addressed if we want to do major surgery to the xattr code. I'd suggest creating a new series of test from scratch, since I don't believe test #62 can be easily reworked so that it will work under ext4.

> In order to make most of the codes remain as-is, we could allocate a contiguous large buffer (up to 64kB in size) to handle all of the data. However, we have to memcpy the data from block buffers to the large buffer, and after the data are changed, we need memcpy the data back to block buffers to make the data written into the block device. Is this way reasonable?

This is no doubt the simpler way to go. The downsides of doing this are pretty obvious: overhead to do the copy, the extra memory pressure, the need to do the memory allocation (vmalloc is slow, since it requires messing with page tables; if you need to count on contiguous free pages, you may end up stalling while you wait for the pressure on the mm defrag routines). Also, if there are people who want to do large amounts of xattr operations on PCIe attached flash, the extra overhead of doing the copy will definitely show up on the benchmarks.

The first is approach is no doubt he better one, but it at the same time it would be tricker to implement. If it would be totally up to me, I'd suggest the first approach, but it would need to be approached with care (and a lot of testing).

Regards,

-- Ted


2011-11-22 03:32:30

by Yu Jian

[permalink] [raw]
Subject: Re: Increase xattr space by allocating contiguous xattr blocks

On 11/21/11 11:08 PM, Theodore Tso wrote:
>
> On Nov 21, 2011, at 7:22 AM, Yu Jian wrote:
>
>> Now, I've the same question as that in the above thread:
>> In xattr.{h,c}, all of the macros and functions assume the xattr space is contiguous with entries growing downwards and values growing upwards (aligned to the end of the space). Especially, the create, replace, remove and shift operations of xattrs are all performed inside a contiguous buffer. This is no problem with in-inode xattr space and single external xattr block which is associated with one block buffer. But for multiple xattr blocks, since the data of them would be read into different block buffers, which are not contiguous, most of the existing macros and functions need to be changed. Is this way acceptable?
>
> It depends on how cleanly you can implement it, and if you can create some better xfstests to exercise xattr operations --- especially if deleting xattrs will cause existing xattrs to get moved around for better packing, we need to be absolutely sure that the code is completely reliable and doesn't end up corrupting some xattr other than the one that is being inserted, deleted, or replaced.
>
> Currently, one of the primary xattr tests that is in xfstests (#62) doesn't work on ext4 at all, since it assumes that files are returned by readdir in file creation order for newly created directory. So the lack of test coverage is something that would have to be addressed if we want to do major surgery to the xattr code. I'd suggest creating a new series of test from scratch, since I don't believe test #62 can be easily reworked so that it will work under ext4.
>
>> In order to make most of the codes remain as-is, we could allocate a contiguous large buffer (up to 64kB in size) to handle all of the data. However, we have to memcpy the data from block buffers to the large buffer, and after the data are changed, we need memcpy the data back to block buffers to make the data written into the block device. Is this way reasonable?
>
> This is no doubt the simpler way to go. The downsides of doing this are pretty obvious: overhead to do the copy, the extra memory pressure, the need to do the memory allocation (vmalloc is slow, since it requires messing with page tables; if you need to count on contiguous free pages, you may end up stalling while you wait for the pressure on the mm defrag routines). Also, if there are people who want to do large amounts of xattr operations on PCIe attached flash, the extra overhead of doing the copy will definitely show up on the benchmarks.
>
> The first is approach is no doubt he better one, but it at the same time it would be tricker to implement. If it would be totally up to me, I'd suggest the first approach, but it would need to be approached with care (and a lot of testing).

Thanks Ted for the suggestions. I'd use the first approach.
>
> Regards,
>
> -- Ted
>
>
--
Best regards,
Yu Jian

2011-11-22 04:53:56

by Eric Sandeen

[permalink] [raw]
Subject: Re: Increase xattr space by allocating contiguous xattr blocks

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/21/11 9:08 AM, Theodore Tso wrote:
> Currently, one of the primary xattr tests that is in xfstests (#62)
> doesn't work on ext4 at all, since it assumes that files are returned
> by readdir in file creation order for newly created directory. So
> the lack of test coverage is something that would have to be
> addressed if we want to do major surgery to the xattr code. I'd
> suggest creating a new series of test from scratch, since I don't
> believe test #62 can be easily reworked so that it will work under
> ext4.

Are you sure?

# grep supported 062
_supported_fs generic

Hm, but it doesn't actually work does it. :/

Maybe we could do find | sort | xargs ... but I guess it's explicitly supposed to test the getfattr recursive code.

- -Eric
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJOyyrdAAoJECCuFpLhPd7gjqwP/2ZMeQedGBvsmPLvxlD/oej+
dZ/U8DSV1kKo7A+2Su7LSB04LuuxrQ97AJBsVTfPyqxB8PFiQ+YQMvkqWwO4xEPO
abhr5Y0e9s+L/aOQ40UWQQE+X4LLrHOjakv0b4oXJgor7piuCRMhLiT1lejzml72
Vg6bYNoQ0uDeNZSJyjRscexg14/uJNz052gQ6QDz7jKZpYPbD5EJr0y63i9MadbM
H8do7kvQOzuIg4UtPOlt9ce52pstihsCL4rGrKZNghDFbmqnSN2SLVx74QyphXYc
T1QqxqxYIiuM2QF6wiZP7mXvk7w1OOyBczOxXGWW7WBrlfg+eD2uW93yiCxgpIPD
1Ljbo+dbXRlVGTEb/k9nrkn/KQjU/MqNkfImGVtjZLK3hEOcbWuh9ojEGoSfKnfx
LQ1V6Jqb4VcFxP55Nkrl8ROCp4FahorMWFVRl3hh/lSBZP7h3ogtzE9GCBQfygRP
n0ZPBDOJ2H87L1q5NGnCqkqtwY05dTuVQZD6xdlV8Z/eogOSdVTu4/Icb4k8a6cK
H98X7u+1iVy1meDqtqrvOOOgq0jss3ZvVl7vBToo7OsInhlbDHHXY0fbxBAmfguo
1KP5cMl069KPY7WeydXGjynh2nrYh1F7BGCWp0a93REcg7NTl6E7XONuT4E0ohsO
oPX9piyNcZzgYHDOqfon
=HL4p
-----END PGP SIGNATURE-----

2011-11-23 17:09:42

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH] xfstests: Sort recursive getfattr output in 062

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Test 062 was made "generic" a while back, but it fails on any filesystem
which returns getfattr -R results (aka readdir results) in something
other than inode-order.

With a little awk-fu we can sort the records from getfattr -R so that
the output is the same for xfs as well as ext4, etc.

Also filter out lost+found which extN creates at mkfs time, but
some other filesystems do not.

Signed-off-by: Eric Sandeen <[email protected]>
- ---

diff --git a/062 b/062
index f666e1b..7005c4e 100755
- --- a/062
+++ b/062
@@ -46,6 +46,13 @@ _cleanup()
}
trap "_cleanup; exit \$status" 0 1 2 3 15

+# getfattr -R returns info in readdir order which varies from fs to fs.
+# This sorts the output by filename
+_sort_getfattr_output()
+{
+ awk '{a[FNR]=$0}END{n = asort(a); for(i=1; i <= n; i++) print a[i],"\n"}' RS=''
+}
+
getfattr()
{
$GETFATTR_PROG --absolute-names -dh $@ 2>&1 | _filter_scratch
@@ -67,7 +74,7 @@ _create_test_bed()
mknod $SCRATCH_MNT/dev/c c 0 0
mknod $SCRATCH_MNT/dev/p p
# sanity check
- - find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch
+ find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch | grep -v "lost+found"
}

# real QA test starts here
@@ -160,18 +167,18 @@ _extend_test_bed()
# whack a symlink in the middle, just to be difficult
ln -s $SCRATCH_MNT/here/up $SCRATCH_MNT/descend/and
# dump out our new starting point
- - find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch
+ find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch | grep -v "lost+found"
}

_extend_test_bed

echo
echo "*** directory descent with us following symlinks"
- -getfattr -h -L -R -m '.' -e hex $SCRATCH_MNT
+getfattr -h -L -R -m '.' -e hex $SCRATCH_MNT | _sort_getfattr_output

echo
echo "*** directory descent without following symlinks"
- -getfattr -h -P -R -m '.' -e hex $SCRATCH_MNT
+getfattr -h -P -R -m '.' -e hex $SCRATCH_MNT | _sort_getfattr_output


#
diff --git a/062.out b/062.out
index 699254a..7d05c85 100644
- --- a/062.out
+++ b/062.out
@@ -508,115 +508,115 @@ SCRATCH_MNT/lnk
SCRATCH_MNT/reg

*** directory descent with us following symlinks
- -# file: SCRATCH_MNT/reg
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
+# file: SCRATCH_MNT/descend
+user.1=0x3233
+user.x=0x797a

- -# file: SCRATCH_MNT/dir
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
+# file: SCRATCH_MNT/descend/and/ascend
+trusted.9=0x3837
+trusted.a=0x6263

- -# file: SCRATCH_MNT/lnk
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
+# file: SCRATCH_MNT/descend/down
+user.1=0x3233
+user.x=0x797a
+
+# file: SCRATCH_MNT/descend/down/here
+user.1=0x3233
+user.x=0x797a

# file: SCRATCH_MNT/dev/b
trusted.name=0xbabe
- -trusted.name3=0xdeface
+trusted.name3=0xdeface

# file: SCRATCH_MNT/dev/c
trusted.name=0xbabe
- -trusted.name3=0xdeface
+trusted.name3=0xdeface

# file: SCRATCH_MNT/dev/p
trusted.name=0xbabe
+trusted.name3=0xdeface
+
+# file: SCRATCH_MNT/dir
+trusted.name=0xbabe
trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface

# file: SCRATCH_MNT/here
trusted.9=0x3837
- -trusted.a=0x6263
+trusted.a=0x6263

# file: SCRATCH_MNT/here/up
trusted.9=0x3837
- -trusted.a=0x6263
+trusted.a=0x6263

# file: SCRATCH_MNT/here/up/ascend
trusted.9=0x3837
- -trusted.a=0x6263
+trusted.a=0x6263

+# file: SCRATCH_MNT/lnk
+trusted.name=0xbabe
+trusted.name3=0xdeface
+
+# file: SCRATCH_MNT/reg
+trusted.name=0xbabe
+trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface
+
+
+*** directory descent without following symlinks
# file: SCRATCH_MNT/descend
user.1=0x3233
- -user.x=0x797a
+user.x=0x797a

# file: SCRATCH_MNT/descend/down
user.1=0x3233
- -user.x=0x797a
+user.x=0x797a

# file: SCRATCH_MNT/descend/down/here
user.1=0x3233
- -user.x=0x797a
- -
- -# file: SCRATCH_MNT/descend/and/ascend
- -trusted.9=0x3837
- -trusted.a=0x6263
- -
- -
- -*** directory descent without following symlinks
- -# file: SCRATCH_MNT/reg
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
- -
- -# file: SCRATCH_MNT/dir
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
- -
- -# file: SCRATCH_MNT/lnk
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
+user.x=0x797a

# file: SCRATCH_MNT/dev/b
trusted.name=0xbabe
- -trusted.name3=0xdeface
+trusted.name3=0xdeface

# file: SCRATCH_MNT/dev/c
trusted.name=0xbabe
- -trusted.name3=0xdeface
+trusted.name3=0xdeface

# file: SCRATCH_MNT/dev/p
trusted.name=0xbabe
+trusted.name3=0xdeface
+
+# file: SCRATCH_MNT/dir
+trusted.name=0xbabe
trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface

# file: SCRATCH_MNT/here
trusted.9=0x3837
- -trusted.a=0x6263
+trusted.a=0x6263

# file: SCRATCH_MNT/here/up
trusted.9=0x3837
- -trusted.a=0x6263
+trusted.a=0x6263

# file: SCRATCH_MNT/here/up/ascend
trusted.9=0x3837
- -trusted.a=0x6263
+trusted.a=0x6263

- -# file: SCRATCH_MNT/descend
- -user.1=0x3233
- -user.x=0x797a
- -
- -# file: SCRATCH_MNT/descend/down
- -user.1=0x3233
- -user.x=0x797a
+# file: SCRATCH_MNT/lnk
+trusted.name=0xbabe
+trusted.name3=0xdeface

- -# file: SCRATCH_MNT/descend/down/here
- -user.1=0x3233
- -user.x=0x797a
+# file: SCRATCH_MNT/reg
+trusted.name=0xbabe
+trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface





-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJOzSjQAAoJECCuFpLhPd7gsJIP/j2BKTvft/QjnLxs+gJT/f0S
H7L9r/izHcQ5Z/xk9t4unpOr35EsXOuwMzWN8OrB3fCx0J7gkIj7s7dNFrou8K8w
bkVW6lzy5VUOHPzkqFSNLHSf2YLrrXNfPBLgOQl8A4lkNzY4gjy74uw05bLL4z3y
u+gmjwuOMRTwGe3N9l/q6VR7UybHRyYBE62Ee2SPBa59FFWdicqzQBGVj0OOAdQg
asTekASb8fXv/A4GNMFyNpZtA65ov8puISeGjV/A9Dhrx843qn1IFLA3UfcWEPEf
qF5MDkg0ZuDDzN8YOiZM5S6mY7KAGSiKAHlF9nNZistCMJy4UDWh2lGLCIDEoO8l
oRpbEOQy8tq7v7ppgp4kmh+8aIj0bletjZbgumPiu8KrRNCyXlBpGYGsBIPH+wgn
P0i7EKILkF7gRM6Gs4U7Ek88mLcDSrsRGjI7JWQgOufUoo9noZM4pBlfw+ngx3DC
NlhxisqyWcwDAVV1MBNvoSNGB5iLNCTgb/Ppgb7SdwNJ/heX9EAgKTsTnyENjtYe
ILcuWf0CLYRPG9gun3JIXS/IibJY4Eqi3E9jo+Et5q6eOGTFuD0hjgERULUzH9d2
wwWFq7KPtdJJbWK0mANn2HK0GQH79gskmWsFS8Mxl7JCSMfPOFdbHNSbqdzNKXJS
+OIS7v24/L/pNQNtuqEt
=hW9Q
-----END PGP SIGNATURE-----

2011-11-28 11:03:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] xfstests: Sort recursive getfattr output in 062

On Wed, Nov 23, 2011 at 11:09:37AM -0600, Eric Sandeen wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Test 062 was made "generic" a while back, but it fails on any filesystem
> which returns getfattr -R results (aka readdir results) in something
> other than inode-order.
>
> With a little awk-fu we can sort the records from getfattr -R so that
> the output is the same for xfs as well as ext4, etc.
>
> Also filter out lost+found which extN creates at mkfs time, but
> some other filesystems do not.

Looks fine to me, except that I'd put the sorting helper into
common.attr.


Reviewed-by: Christoph Hellwig <[email protected]>

2011-12-08 22:59:29

by Christian Brunner

[permalink] [raw]
Subject: Re: ceph and ext4

2011/11/15 Andreas Dilger <[email protected]>:
> Coincidentally, we have someone working in those patches again. The main obstacle for accepting the previous patch as-is was that Ted wanted to add support for "medium-sized" xattrs that are addressed as a string of blocks, instead of via an inode.
>
> This will allow xattrs up to 64kB in size (in total) to be stored as efficiently as an external xattr block.
>

Did you make progress with this. I'm still having serious trouble with
btrfs and would like to try these.

Thanks,
Christian

2011-12-09 22:24:06

by Andreas Dilger

[permalink] [raw]
Subject: Re: ceph and ext4

On 2011-12-08, at 3:59 PM, Christian Brunner wrote:
> 2011/11/15 Andreas Dilger <[email protected]>:
>> Coincidentally, we have someone working in those patches again. The main obstacle for accepting the previous patch as-is was that Ted wanted to add support for "medium-sized" xattrs that are addressed as a string of blocks, instead of via an inode.
>
> Did you make progress with this. I'm still having serious trouble with
> btrfs and would like to try these.

The latest patches are available at http://review.whamcloud.com/1708, but are based on the RHEL6.1 2.6.32 kernel. The work to implement "medium-sized" xattrs was more complex than anticipated, and is not finished yet.

The use of external inode xattrs is working, which allows xattr sizes up to 64kB. The 64kB limit is imposed by the VFS and could potentially be increased.

Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.




2012-01-25 21:38:55

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH V2] xfstests: Sort recursive getfattr output in 062

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Test 062 was made "generic" a while back, but it fails on any filesystem
which returns getfattr -R results (aka readdir results) in something
other than inode-order.

With a little awk-fu we can sort the records from getfattr -R so that
the output is the same for xfs as well as ext4, etc.

Also filter out lost+found which extN creates at mkfs time, but
some other filesystems do not.

Signed-off-by: Eric Sandeen <[email protected]>
- ---

V2: move sorting function into common.attr

diff --git a/062 b/062
index f666e1b..9800e33 100755
- --- a/062
+++ b/062
@@ -67,7 +67,7 @@ _create_test_bed()
mknod $SCRATCH_MNT/dev/c c 0 0
mknod $SCRATCH_MNT/dev/p p
# sanity check
- - find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch
+ find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch | grep -v "lost+found"
}

# real QA test starts here
@@ -160,18 +160,18 @@ _extend_test_bed()
# whack a symlink in the middle, just to be difficult
ln -s $SCRATCH_MNT/here/up $SCRATCH_MNT/descend/and
# dump out our new starting point
- - find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch
+ find $SCRATCH_MNT | LC_COLLATE=POSIX sort | _filter_scratch | grep -v "lost+found"
}

_extend_test_bed

echo
echo "*** directory descent with us following symlinks"
- -getfattr -h -L -R -m '.' -e hex $SCRATCH_MNT
+getfattr -h -L -R -m '.' -e hex $SCRATCH_MNT | _sort_getfattr_output

echo
echo "*** directory descent without following symlinks"
- -getfattr -h -P -R -m '.' -e hex $SCRATCH_MNT
+getfattr -h -P -R -m '.' -e hex $SCRATCH_MNT | _sort_getfattr_output


#
diff --git a/062.out b/062.out
index 699254a..8cc3c65 100644
- --- a/062.out
+++ b/062.out
@@ -508,21 +508,21 @@ SCRATCH_MNT/lnk
SCRATCH_MNT/reg

*** directory descent with us following symlinks
- -# file: SCRATCH_MNT/reg
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
+# file: SCRATCH_MNT/descend
+user.1=0x3233
+user.x=0x797a

- -# file: SCRATCH_MNT/dir
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
+# file: SCRATCH_MNT/descend/and/ascend
+trusted.9=0x3837
+trusted.a=0x6263

- -# file: SCRATCH_MNT/lnk
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
+# file: SCRATCH_MNT/descend/down
+user.1=0x3233
+user.x=0x797a
+
+# file: SCRATCH_MNT/descend/down/here
+user.1=0x3233
+user.x=0x797a

# file: SCRATCH_MNT/dev/b
trusted.name=0xbabe
@@ -536,6 +536,12 @@ trusted.name3=0xdeface
trusted.name=0xbabe
trusted.name3=0xdeface

+# file: SCRATCH_MNT/dir
+trusted.name=0xbabe
+trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface
+
# file: SCRATCH_MNT/here
trusted.9=0x3837
trusted.a=0x6263
@@ -548,6 +554,18 @@ trusted.a=0x6263
trusted.9=0x3837
trusted.a=0x6263

+# file: SCRATCH_MNT/lnk
+trusted.name=0xbabe
+trusted.name3=0xdeface
+
+# file: SCRATCH_MNT/reg
+trusted.name=0xbabe
+trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface
+
+
+*** directory descent without following symlinks
# file: SCRATCH_MNT/descend
user.1=0x3233
user.x=0x797a
@@ -560,28 +578,6 @@ user.x=0x797a
user.1=0x3233
user.x=0x797a

- -# file: SCRATCH_MNT/descend/and/ascend
- -trusted.9=0x3837
- -trusted.a=0x6263
- -
- -
- -*** directory descent without following symlinks
- -# file: SCRATCH_MNT/reg
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
- -
- -# file: SCRATCH_MNT/dir
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -user.name=0xbabe
- -user.name3=0xdeface
- -
- -# file: SCRATCH_MNT/lnk
- -trusted.name=0xbabe
- -trusted.name3=0xdeface
- -
# file: SCRATCH_MNT/dev/b
trusted.name=0xbabe
trusted.name3=0xdeface
@@ -594,6 +590,12 @@ trusted.name3=0xdeface
trusted.name=0xbabe
trusted.name3=0xdeface

+# file: SCRATCH_MNT/dir
+trusted.name=0xbabe
+trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface
+
# file: SCRATCH_MNT/here
trusted.9=0x3837
trusted.a=0x6263
@@ -606,17 +608,15 @@ trusted.a=0x6263
trusted.9=0x3837
trusted.a=0x6263

- -# file: SCRATCH_MNT/descend
- -user.1=0x3233
- -user.x=0x797a
- -
- -# file: SCRATCH_MNT/descend/down
- -user.1=0x3233
- -user.x=0x797a
+# file: SCRATCH_MNT/lnk
+trusted.name=0xbabe
+trusted.name3=0xdeface

- -# file: SCRATCH_MNT/descend/down/here
- -user.1=0x3233
- -user.x=0x797a
+# file: SCRATCH_MNT/reg
+trusted.name=0xbabe
+trusted.name3=0xdeface
+user.name=0xbabe
+user.name3=0xdeface



diff --git a/common.attr b/common.attr
index 3e2ba85..b49d990 100644
- --- a/common.attr
+++ b/common.attr
@@ -176,6 +176,13 @@ _require_attrs()
rm -f $TEST_DIR/syscalltest.out
}

+# getfattr -R returns info in readdir order which varies from fs to fs.
+# This sorts the output by filename
+_sort_getfattr_output()
+{
+ awk '{a[FNR]=$0}END{n = asort(a); for(i=1; i <= n; i++) print a[i]"\n"}' RS=''
+}
+
# set maximum total attr space based on fs type
if [ "$FSTYP" == "xfs" -o "$FSTYP" == "udf" ]; then
MAX_ATTRS=1000


-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJPIHZtAAoJECCuFpLhPd7g2yMP/1VU2RhD8gdfpJIQKGvZjurP
XGmv2tLY/3eDWTwbpm3E5m3EmXIAZpq+ahUYT9C+I8afw0oLv4zggAQ3M6P20JlM
dM/M7imNHYfafIMr7wIw0+Iv0pt/tpDv3cm2kxJDJq+yW88Y+tRG/ifU8uSjREyX
WBSqKjo3mJ0CJ03Chh9xWDK7AV4reYW3cxlCNMMzUrRPho7hWIrZN58GhDO4esqa
x11eFp4qk6I5ApExfpa5PJIGTT7xepCsQsDSaf2HNKzSEg3c5fzV43nHUFl+qRNH
FQ0CPx10VMhxtr3RZI7laN01yHk5qWjMFB9ufldot6Xv/CFwU++zBVH21IwKAgvR
D/tihuxUf8Sllpa0+DuMTuobg2cZuvlyshkWeU4YGAbyFfxirPNVjqEcnfnIMnoA
+C3Sc/8DTd70C8WG9M5ON9sMmGtJr/s93OAZMwiLHGvBC1GZf51BvAI+l7ln3Ug3
jws77r4x1P01da6ciUsESNibCrilCiw3Q2XcYJU/VfbxpFZ0ro0ib3wA4O654vqN
V38zh7BOipn7ukWIn8KRc09d9X/PqsqeFNVaLJA+tMsxH021J8uluG4nTmn3a+jM
t79YFK0JccMk8t0ShCTUjhnhHXwdrVVs8MCCghc9QoYg2/HEQrcDPKft69hM9UTL
OrK3+QKMf3JATAYdgXSg
=w2HJ
-----END PGP SIGNATURE-----

2012-01-27 10:53:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V2] xfstests: Sort recursive getfattr output in 062

On Wed, Jan 25, 2012 at 03:38:54PM -0600, Eric Sandeen wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Test 062 was made "generic" a while back, but it fails on any filesystem
> which returns getfattr -R results (aka readdir results) in something
> other than inode-order.
>
> With a little awk-fu we can sort the records from getfattr -R so that
> the output is the same for xfs as well as ext4, etc.
>
> Also filter out lost+found which extN creates at mkfs time, but
> some other filesystems do not.
>
> Signed-off-by: Eric Sandeen <[email protected]>
> - ---
>
> V2: move sorting function into common.attr

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2012-02-12 04:05:56

by Yehuda Sadeh Weinraub

[permalink] [raw]
Subject: Re: ceph and ext4

On Fri, Dec 9, 2011 at 2:24 PM, Andreas Dilger <[email protected]> wrote:
>
> On 2011-12-08, at 3:59 PM, Christian Brunner wrote:
> > 2011/11/15 Andreas Dilger <[email protected]>:
> >> Coincidentally, we have someone working in those patches again. The main obstacle for accepting the previous patch as-is was that Ted wanted to add support for "medium-sized" xattrs that are addressed as a string of blocks, instead of via an inode.
> >
> > Did you make progress with this. I'm still having serious trouble with
> > btrfs and would like to try these.
>
> The latest patches are available at http://review.whamcloud.com/1708, but are based on the RHEL6.1 2.6.32 kernel. ?The work to implement "medium-sized" xattrs was more complex than anticipated, and is not finished yet.
>
> The use of external inode xattrs is working, which allows xattr sizes up to 64kB. ?The 64kB limit is imposed by the VFS and could potentially be increased.
>

I was able to get those compile on a recent kernel. One issue that I
see is that it will only use a separate inode for the xattr if the
xattr is big enough. However, it may be that we ran out of enough
space to set a smaller xattr, and in that case we would fail setting
it even though we'd be able to set a larger xattr.
Another related issue, is that the number of xattrs that can be used
is still limited (by what can be indexed in a single block?).

I think that the first issue is substantial, as it exposes unexpected
behavior (getting ENOSPC for small sized attrs, while bigger attrs
succeed). For the second issue, it seems that when only using small
xattrs it doesn't change anything from the old behavior.


Yehuda

2012-02-12 17:18:46

by Andreas Dilger

[permalink] [raw]
Subject: Re: ceph and ext4

On 2012-02-11, at 9:05 PM, Yehuda Sadeh Weinraub wrote:
> On Fri, Dec 9, 2011 at 2:24 PM, Andreas Dilger <[email protected]> wrote:
>>
>> On 2011-12-08, at 3:59 PM, Christian Brunner wrote:
>>> 2011/11/15 Andreas Dilger <[email protected]>:
>>>> Coincidentally, we have someone working in those patches again. The main obstacle for accepting the previous patch as-is was that Ted wanted to add support for "medium-sized" xattrs that are addressed as a string of blocks, instead of via an inode.
>>>
>>> Did you make progress with this. I'm still having serious trouble with
>>> btrfs and would like to try these.
>>
>> The latest patches are available at http://review.whamcloud.com/1708, but are based on the RHEL6.1 2.6.32 kernel. The work to implement "medium-sized" xattrs was more complex than anticipated, and is not finished yet.
>>
>> The use of external inode xattrs is working, which allows xattr sizes up to 64kB. The 64kB limit is imposed by the VFS and could potentially be increased.
>
> I was able to get those compile on a recent kernel.

It would be useful if you could send me the updated patch, if there were
any significant changes.

> One issue that I
> see is that it will only use a separate inode for the xattr if the
> xattr is big enough. However, it may be that we ran out of enough
> space to set a smaller xattr, and in that case we would fail setting
> it even though we'd be able to set a larger xattr.
> Another related issue, is that the number of xattrs that can be used
> is still limited (by what can be indexed in a single block?).

That is true with the current patch. When the "medium sized" xattrs are
finished, then there will be 64kB of space for xattrs in the external
block(s), which could be both xattr headers and data.

> I think that the first issue is substantial, as it exposes unexpected
> behavior (getting ENOSPC for small sized attrs, while bigger attrs
> succeed). For the second issue, it seems that when only using small
> xattrs it doesn't change anything from the old behavior.



Cheers, Andreas
--
Andreas Dilger Whamcloud, Inc.
Principal Lustre Engineer http://www.whamcloud.com/





2012-02-12 19:50:23

by Yehuda Sadeh Weinraub

[permalink] [raw]
Subject: Re: ceph and ext4

On Sun, Feb 12, 2012 at 9:18 AM, Andreas Dilger <[email protected]> wrote:
> On 2012-02-11, at 9:05 PM, Yehuda Sadeh Weinraub wrote:
>> On Fri, Dec 9, 2011 at 2:24 PM, Andreas Dilger <[email protected]> wrote:
>>>
>>> On 2011-12-08, at 3:59 PM, Christian Brunner wrote:
>>>> 2011/11/15 Andreas Dilger <[email protected]>:
>>>>> Coincidentally, we have someone working in those patches again. The main obstacle for accepting the previous patch as-is was that Ted wanted to add support for "medium-sized" xattrs that are addressed as a string of blocks, instead of via an inode.
>>>>
>>>> Did you make progress with this. I'm still having serious trouble with
>>>> btrfs and would like to try these.
>>>
>>> The latest patches are available at http://review.whamcloud.com/1708, but are based on the RHEL6.1 2.6.32 kernel. ?The work to implement "medium-sized" xattrs was more complex than anticipated, and is not finished yet.
>>>
>>> The use of external inode xattrs is working, which allows xattr sizes up to 64kB. ?The 64kB limit is imposed by the VFS and could potentially be increased.
>>
>> I was able to get those compile on a recent kernel.
>
> It would be useful if you could send me the updated patch, if there were
> any significant changes.

I pushed it to the the ext4-large-xattr branch, here:

https://github.com/NewDreamNetwork/ceph-client.git

>
>> One issue that I
>> see is that it will only use a separate inode for the xattr if the
>> xattr is big enough. However, it may be that we ran out of enough
>> space to set a smaller xattr, and in that case we would fail setting
>> it even though we'd be able to set a larger xattr.
>> Another related issue, is that the number of xattrs that can be used
>> is still limited (by what can be indexed in a single block?).
>
> That is true with the current patch. ?When the "medium sized" xattrs are
> finished, then there will be 64kB of space for xattrs in the external
> block(s), which could be both xattr headers and data.

Is there any progress with this feature? Is there any public
repository where we can take a look at that?

Thanks,
Yehuda