2004-10-21 18:42:54

by Jim Houston

[permalink] [raw]
Subject: [PATCH] Re: idr in Samba4

On Thu, 2004-10-21 at 00:54, [email protected] wrote:

> Apart from converting idr to use our pool allocator, and some other
> minor user-space tweaks, the only significant change I've made is to
> add a idr_find() call at the top of idr_remove() to catch possible
> errors where idr_remove() is called multiple times. Obviously this is
> programmer error if it happens, but I didn't like the default
> behaviour (I saw corruption in the tree without this check).


Hi Tridge, Andrew,

Tridge, thanks for your note. I'm glad to hear you are using idr.c.

I agree with your concerns about idr_remove(). It really should
fail gracefully and warn if the id being removed is not valid.

The attached patch against linux-2.6.9 should do the job without
additional overhead. Andrew, I hope you will add this patch to
your tree.

With the existing code, removing an id which was not allocated
could remove a valid id which shares the same lowest layer of the
radix tree.

I ran a kernel with this patch but have not done any tests to force
a failure.

Jim Houston - Concurrent Computer Corp.


--- linux-2.6.9/lib/idr.c.orig 2004-10-21 12:57:24.547106092 -0400
+++ linux-2.6.9/lib/idr.c 2004-10-21 13:09:28.984974796 -0400
@@ -277,24 +277,31 @@
}
EXPORT_SYMBOL(idr_get_new);

+static void idr_remove_warning(int id)
+{
+ printk("idr_remove called for id=%d which is not allocated.\n", id);
+ dump_stack();
+}
+
static void sub_remove(struct idr *idp, int shift, int id)
{
struct idr_layer *p = idp->top;
struct idr_layer **pa[MAX_LEVEL];
struct idr_layer ***paa = &pa[0];
+ int n;

*paa = NULL;
*++paa = &idp->top;

while ((shift > 0) && p) {
- int n = (id >> shift) & IDR_MASK;
+ n = (id >> shift) & IDR_MASK;
__clear_bit(n, &p->bitmap);
*++paa = &p->ary[n];
p = p->ary[n];
shift -= IDR_BITS;
}
- if (likely(p != NULL)){
- int n = id & IDR_MASK;
+ n = id & IDR_MASK;
+ if (likely(p != NULL && test_bit(n, &p->bitmap))){
__clear_bit(n, &p->bitmap);
p->ary[n] = NULL;
while(*paa && ! --((**paa)->count)){
@@ -303,6 +310,8 @@
}
if ( ! *paa )
idp->layers = 0;
+ } else {
+ idr_remove_warning(id);
}
}






2004-10-22 06:26:00

by tridge

[permalink] [raw]
Subject: Re: [PATCH] Re: idr in Samba4

Jim,

> The attached patch against linux-2.6.9 should do the job without
> additional overhead. Andrew, I hope you will add this patch to
> your tree.

Thanks, that looks good, and it now passes my randomized testsuite.

If you are interested, my test code is at:

http://samba.org/ftp/unpacked/junkcode/idtree/

Note that I made idr_remove() and sub_remove() return an int for
success/failure, as that was more useful for my code, and it also
means we skip the layer free logic on remove failure (not that it does
any harm, just seems a bit of a loose end).

Cheers, Tridge

2004-11-19 07:40:48

by tridge

[permalink] [raw]
Subject: performance of filesystem xattrs with Samba4

I've been developing the posix backend for Samba4 over the last few
months. It has now reached the stage where it is passing most of the
test suites, so its time to start some performance testing.

The biggest change from the kernels point of view is that Samba4 makes
extensive use of filesystem xattrs. Almost every file with have a
user.DosAttrib xattr containing file attributes and additional
timestamp fields. A lot of files will also have a system.NTACL
attribute containing a NT ACL, and many files will have a
user.DosStreams xattr for NT alternate data streams. Some rare files
will have a user.DosEAs xattr for DOS extended attribute
support. Files with streams will also have separate xattrs for each NT
stream.

I started some simple benchmarking today using the BENCH-NBENCH
smbtorture benchmark, with 10 simulated clients and loopback
networking on a dual Xeon server with 2G ram and a 50G scsi partition.
I used a 2.6.10-rc2 kernel. This benchmark only involves a
user.DosAttrib xattr of size 44 on every file (that will be the most
common situation in production use).

ext2 68 MB/sec
ext2+xattr 64 MB/sec

ext3 67 MB/sec
ext3+xattr 58 MB/sec

xfs 62 MB/sec
xfs+xattr 40 MB/sec
xfs+2Kinode 63 MB/sec
xfs+xattr+2Kinode 58 MB/sec

tmpfs 69 MB/sec
tmpfs+xattr ?? MB/sec (failed)

jfs 36 MB/sec
jfs+xattr 29 MB/sec

reiser 58 MB/sec
reiser+xattr 44 MB/sec

To get the ext2/ext3 results I needed to add "return NULL;" at the
start of ext3_xattr_cache_find() to avoid a bug in the xattr sharing
code that causes a oops (I've reported the oops separately).

The tmpfs+xattr failure above is because tmpfs didn't seem to allow
user xattrs, despite having CONFIG_TMPFS_XATTR=y.

I'm very impressed that ext3 has improved so much since I last did
Samba benchmarks. It used to always be the slowest in my tests, but
now it is the fastest journaled filesystem for Samba4. almost matching
tmpfs.

The XFS results with default options are rather disappointing, as XFS
has usually been a good performer for Samba workloads. Increasing the
inode size to 2k brought it back to a more reasonable level.

The high cost of xattr support is a bit of a problem. In the above,
xattrs were enabled in the filesystems for all runs, the difference
being whether I told Samba4 to use them or not. I hope we can reduce
the cost of xattrs as otherwise Samba4 is going to be seriously
disadvantaged when full windows compatibility is needed. I'm guessing
that nearly all Samba installs will be using xattrs by this time next
year, as we can't do basic security features like WinXP security zones
without them, so making them perform well will be important.

To make it easier to benchmark with xattrs, I'm planning on doing a
new version of dbench with optional xattr support. That will allow
others to play with xattr performance for the above workload without
having to delve into the esoteric world of Samba4 development.

Apart from the 2k inode with XFS I haven't tried any filesystem tuning
options. I'll probably wait till I have xattr support in dbench for
that, to make large numbers of runs with different options easier.

If anyone wants to see in detail what we are sticking in these xattrs,
then look at
http://samba.org/ftp/unpacked/samba4/source/librpc/idl/xattr.idl
for an IDL specification of the xattr format we are using.

Soon we'll be starting to integrate the xattr support with a LSM
module, to allow the kernel to interpret the NT ACLs directly to avoid
races, make things a little more efficient (using a xattr cache
holding unpacked ACLs), and allowing for the possibility of non-Samba
file access to obey the NT ACLs.

Cheers, Tridge

2004-11-19 08:08:16

by James Morris

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

On Fri, 19 Nov 2004 [email protected] wrote:

> The tmpfs+xattr failure above is because tmpfs didn't seem to allow
> user xattrs, despite having CONFIG_TMPFS_XATTR=y.

tmpfs does not have a 'user' xattr handler. xattr support was added to
tmpfs only to provide a 'security' xattr handler which calls out to LSM
modules such as SELinux.


- James
--
James Morris
<[email protected]>


2004-11-19 10:17:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

On Nov 19, 2004 18:38 +1100, [email protected] wrote:
> I started some simple benchmarking today using the BENCH-NBENCH
> smbtorture benchmark, with 10 simulated clients and loopback
> networking on a dual Xeon server with 2G ram and a 50G scsi partition.
> I used a 2.6.10-rc2 kernel. This benchmark only involves a
> user.DosAttrib xattr of size 44 on every file (that will be the most
> common situation in production use).
>
> ext3 67 MB/sec
> ext3+xattr 58 MB/sec
>
> xfs 62 MB/sec
> xfs+xattr 40 MB/sec
> xfs+2Kinode 63 MB/sec
> xfs+xattr+2Kinode 58 MB/sec

Also, we (CFS) have developed patches for ext3 + e2fsprogs to support
"fast" EAs stored in larger inodes on disk, and this can improve
performance dramatically in the case where you are accessing a large
number of inodes with EAs just. Otherwise you are storing the
EAs in an external block which requires another seek + read to access,
while the large inode EA is already in cache after you read the inode.
Also, the fact that you have to read a 4kB EA block into memory for
(in our case) a relatively small amount of data really kills the cache.

You can select inode sizes from 128..4096 in power-of-two sizes.


This patch also provides the infrastructure on disk for storing e.g.
nsecond and create timestamps in the ext3 large inodes, but the actual
implementation to save/load these isn't there yet. If that were
available, would you use it instead of explicitly storing the NTTIME in
an EA? I believe the 2.6 stat interface will support nsecond timestamps,
but I don't think there is any API to get the create time to userspace
though we could hook this up to a pseudo EA. The benefit of storing
these common fields in the inode instead of EAs is less overhead.

> To get the ext2/ext3 results I needed to add "return NULL;" at the
> start of ext3_xattr_cache_find() to avoid a bug in the xattr sharing
> code that causes a oops (I've reported the oops separately).

I would just configure out the xattr sharing code entirely since it will
likely do nothing but increase overhead if any of the EAs on an inode
are unique (this is the most common case, except for POSIX-ACL-only setups).

I've attached this patch here. I believe all of the ext3 developers
agree it should go into the kernel, just nobody has made a push to do
so. If this helps your performance (or even if not ;-) we'd be happy
to get it into the kernel proper. The e2fsprogs support for same can be
found at http://cvs.lustre.org:5000/ though it is mixed in with a lot of
other changesets you probably don't care about. Relevant ones are
1.1347.1.2 (Apr 23, 2004) and 1.1421 (Sept 03, 2004).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/


Attachments:
(No filename) (0.00 B)
(No filename) (189.00 B)
Download all attachments

2004-11-19 11:44:44

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Andreas,

> Also, we (CFS) have developed patches for ext3 + e2fsprogs to support
> "fast" EAs stored in larger inodes on disk, and this can improve
> performance dramatically in the case where you are accessing a large
> number of inodes with EAs just.

yep, that could help a lot. I imagine it will provide a similar
benefit to the option to expand the inode size in XFS, which certainly
made a huge difference.

> This patch also provides the infrastructure on disk for storing e.g.
> nsecond and create timestamps in the ext3 large inodes, but the actual
> implementation to save/load these isn't there yet. If that were
> available, would you use it instead of explicitly storing the NTTIME in
> an EA?

certainly!

For Samba4 we need 4 timestamps (create/change/write/access),
preferably all with 100ns resolution or better. All 4 timestamps need
to be settable (unlike st_ctime in posix).

The strategy I've adopted is this:

- use st_atime and st_mtime for the access and write time fields,
with nanosecond resolution if available, otherwise with 1 second
resolution. It's just too expensive to update an EA on every
read/write, so I didn't put these in the DosAttrib EA.

- store create_time and change_time in the user.DosAttrib xattr, as
64 bit 100ns resolution times (same format as NT uses and Samba
uses internally). I store change_time there as its definition is a
little different from the posix ctime field (plus its settable).

If we had a settable create_time field in the inode then I'd certainly
want to use it in Samba4. A non-settable one wouldn't be nearly as
useful. Some win32 applications care about being able to set all the
time fields (such as excel 2003).

This wouldn't allow us to get rid of the user.DosAttrib xattr
completely though, as we stick a bunch of other stuff in there and
will be expanding it soon to help with the case-insensitive speed
problem.

> I believe the 2.6 stat interface will support nsecond timestamps,

yep, we are already using st.st_atim.tv_nsec when configure detects
it. It's very useful, but the fact that ext3 doesn't store this on
disk leads to potential problems when timestamps regress if inodes are
ejected from the cache under memory pressure. That needs fixing.

> but I don't think there is any API to get the create time to userspace
> though we could hook this up to a pseudo EA. The benefit of storing
> these common fields in the inode instead of EAs is less overhead.

I think it would make more sense to have a new varient of utime() for
setting all available timestamps, and expose all timestamps in stat. A
separate API for create time seems a bit hackish.

> I would just configure out the xattr sharing code entirely since it will
> likely do nothing but increase overhead if any of the EAs on an inode
> are unique (this is the most common case, except for POSIX-ACL-only setups).

I didn't know it was configurable. I can't see any CONFIG option for
it - is there some trick I've missed?

> I've attached this patch here.

I'll give it a go and let you know how it changes the NBENCH results.

Cheers, Tridge

2004-11-19 12:07:27

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

On Fri, 2004-11-19 at 18:38 +1100, [email protected] wrote:
> I've been developing the posix backend for Samba4 over the last few
> months. It has now reached the stage where it is passing most of the
> test suites, so its time to start some performance testing.
>
> The biggest change from the kernels point of view is that Samba4 makes
> extensive use of filesystem xattrs. Almost every file with have a
> user.DosAttrib xattr containing file attributes and additional
> timestamp fields. A lot of files will also have a system.NTACL
> attribute containing a NT ACL, and many files will have a
> user.DosStreams xattr for NT alternate data streams. Some rare files
> will have a user.DosEAs xattr for DOS extended attribute
> support. Files with streams will also have separate xattrs for each NT
> stream.
[snip]
> Soon we'll be starting to integrate the xattr support with a LSM
> module, to allow the kernel to interpret the NT ACLs directly to avoid
> races, make things a little more efficient (using a xattr cache
> holding unpacked ACLs), and allowing for the possibility of non-Samba
> file access to obey the NT ACLs.

Note, that NTFS supports all those things natively on the file system,
so it may be worth keeping in mind when designing your APIs. It would
be nice if one day when ntfs write support is finished, when running
Samba on an NTFS partition on Linux, Samba can directly access all those
things directly from NTFS. I guess a good way would be if your
interface is sufficiently abstracted so that it can use xattrs as a
backend or a native backend which NTFS could provide for you or Samba
could provide for NTFS. For example NTFS stores the 4 different times
in NT format in each inode (base Mft record) so you would not have to
take an xattr performance hit there.

Anyway, just thought I would mention this, I am not expecting you to do
anything about it, especially since full NTFS read-write support is
still a long way away...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/, http://www-stu.christs.cam.ac.uk/~aia21/

2004-11-19 12:48:01

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Anton,

> Note, that NTFS supports all those things natively on the file system,
> so it may be worth keeping in mind when designing your APIs. It would
> be nice if one day when ntfs write support is finished, when running
> Samba on an NTFS partition on Linux, Samba can directly access all those
> things directly from NTFS.

yes, I have certainly thought about this, and at the core of Samba4 is
a "ntvfs" layer that allows for backends that can take full advantage
of whatever the filesystem can offer. The ntvfs/posix/ code in Samba4
is quite small (currently 7k lines of code) and I'm hoping that more
specialised backends will be written that talk to other types of
filesystems.

To get things started I've also written a "cifs" backend for Samba4,
that uses another CIFS file server as a storage backend, turning
Samba4 into a proxy server. That backend uses the full capabilities of
the ntvfs layer, and implements nearly all of the detailed stuff that
a NTFS can do.

> I guess a good way would be if your interface is sufficiently
> abstracted so that it can use xattrs as a backend or a native
> backend which NTFS could provide for you or Samba could provide for
> NTFS. For example NTFS stores the 4 different times in NT format
> in each inode (base Mft record) so you would not have to take an
> xattr performance hit there.

The big question is what sort of API would you envisage between user
space and this filesystem? Are you imagining that Samba mmap the raw
disk and use a libntfs library? That would be possible, but would lose
one of the big advantages of Samba, which is that the filesystem is
available to both posix and windows apps.

Or are you thinking that we add a new syscall interface to, a bit like
the IRP stuff in the NT IFS? I imagine there would be quite a bit of
resistance to that in the Linux kernel community :-)

Realistically, I think that in the vast majority of cases Samba is
going to be running on top of "mostly posix" filesystems for the
forseeable future, unless you manage to do something pretty magical
with the ntfs code. But if you do manage to get ntfs in Linux to the
stage where its a viable alternative then I'd be delighted to help
write the Samba4 backend to match.

Cheers, Tridge

2004-11-19 14:14:40

by Anton Altaparmakov

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hi Tridge,

On Fri, 19 Nov 2004 [email protected] wrote:
> > Note, that NTFS supports all those things natively on the file system,
> > so it may be worth keeping in mind when designing your APIs. It would
> > be nice if one day when ntfs write support is finished, when running
> > Samba on an NTFS partition on Linux, Samba can directly access all those
> > things directly from NTFS.
>
> yes, I have certainly thought about this, and at the core of Samba4 is
> a "ntvfs" layer that allows for backends that can take full advantage
> of whatever the filesystem can offer. The ntvfs/posix/ code in Samba4
> is quite small (currently 7k lines of code) and I'm hoping that more
> specialised backends will be written that talk to other types of
> filesystems.

Sounds great!

> To get things started I've also written a "cifs" backend for Samba4,
> that uses another CIFS file server as a storage backend, turning
> Samba4 into a proxy server. That backend uses the full capabilities of
> the ntvfs layer, and implements nearly all of the detailed stuff that
> a NTFS can do.
>
> > I guess a good way would be if your interface is sufficiently
> > abstracted so that it can use xattrs as a backend or a native
> > backend which NTFS could provide for you or Samba could provide for
> > NTFS. For example NTFS stores the 4 different times in NT format
> > in each inode (base Mft record) so you would not have to take an
> > xattr performance hit there.
>
> The big question is what sort of API would you envisage between user
> space and this filesystem? Are you imagining that Samba mmap the raw
> disk and use a libntfs library? That would be possible, but would lose
> one of the big advantages of Samba, which is that the filesystem is
> available to both posix and windows apps.
>
> Or are you thinking that we add a new syscall interface to, a bit like
> the IRP stuff in the NT IFS? I imagine there would be quite a bit of
> resistance to that in the Linux kernel community :-)
>
> Realistically, I think that in the vast majority of cases Samba is
> going to be running on top of "mostly posix" filesystems for the
> forseeable future, unless you manage to do something pretty magical
> with the ntfs code. But if you do manage to get ntfs in Linux to the
> stage where its a viable alternative then I'd be delighted to help
> write the Samba4 backend to match.

I don't know. I have been mulling over in my head for quite a while what
to do about an interface for "advanced ntfs features" but so far I have
always pushed this to the back of my mind. After all no point in
providing advanced features considering we don't even provide full
read-write access yet. I just thought I would mentione NTFS when I saw
your post.

But to answer your question I definitely would envisage an interface to
the kernel driver rather than to libntfs. It is 'just' a matter of
deciding how that would look...

Partially we will see what happens with Reiser4 as it faces the same or at
least very simillar interface problems. Maybe we need a sys_ntfs() or
maybe we need to hitchhike the ioctl() interface or maybe the VFS can
start providing all required functionality in some to be determined
manner that we can use...

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2004-11-19 15:34:52

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Is this an fsync intensive benchmark? If no, could you try with
reiser4? If yes, you might as well wait for us to optimize fsync first
in reiser4.

Hans

2004-11-19 16:01:09

by Jan Engelhardt

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

>Is this an fsync intensive benchmark? If no, could you try with
>reiser4? If yes, you might as well wait for us to optimize fsync first
>in reiser4.

Do I sense an attempt to get more users from non-reiser*fs to reiser4? ;-)


Jan Engelhardt
--
Gesellschaft für Wissenschaftliche Datenverarbeitung
Am Fassberg, 37077 Göttingen, http://www.gwdg.de

2004-11-19 22:07:18

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

> Is this an fsync intensive benchmark? If no, could you try with
> reiser4? If yes, you might as well wait for us to optimize fsync first
> in reiser4.

In the configuration I was running there are no fsync calls.

I'll have a go with reiser4 soon and let you know how it goes. I'm
also working on a new version of dbench that will better simulate the
filesystem access patterns of Samba4.

Cheers, Tridge

2004-11-19 22:30:44

by Andreas Dilger

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

On Nov 19, 2004 22:43 +1100, [email protected] wrote:
> > This patch also provides the infrastructure on disk for storing e.g.
> > nsecond and create timestamps in the ext3 large inodes, but the actual
> > implementation to save/load these isn't there yet. If that were
> > available, would you use it instead of explicitly storing the NTTIME in
> > an EA?
>
> certainly!

I can describe the "infrastructure" here, but there needs to be some
smallish coding work in order to get this working and I don't really
have much time to do that myself. The basic premise is that for ext3
filesystems formatted with large on-disk inodes we have reserved the first
word in the extra space to describe the "extra size" of the fixed fields
in struct ext3_inode stored in each inode on disk. This allows us to
add permanent fields to the end of struct ext3_inode, and any remaining
space is used for the fast EAs before falling back to an external block.

This space was always intended to store extra timestamp fields ala:

struct ext3_inode {
:
:
} osd2;
__u16 i_extra_isize;
__u16 i_pad1;
__u32 i_ctime_hilow; /* do we need nsecond atimes? */
__u32 i_mtime_hilow;
__u32 i_crtime;
__u32 i_crtime_hilow;
};

Since the i_[mac]time fields are in seconds, I would like to store:

_hilow = nseconds >> 6 | (([mac]time64 >> 32) << 26)

[mac]time64 = [mac]time | (__u64)((_hilow & 0xfc000000) << 6);
nseconds = _hilow << 6;

so we get about 60ns resolution but also increase our dynamic range
by a factor of 64 (year 8704 problem here we come ;-). Since crtime
is new we _could_ store it in the 100ns 64-bit format that NT uses.
Consistency is good on the one hand and we only need to do shift and OR,
while with straight 100ns times we also get a 6x larger dynamic range
(y58000) but also have to do a 64-bit divide by 10^7 for each access.


As we read an inode from disk we check i_extra_isize to determine
which fields, if any, are valid and when writing the inode we fill in
the fields and update i_extra_isize (taking care to push any existing
EAs out a bit, though that should be a rare case). This avoids the EA
speed/size overhead to parse/read/write these fields, and allows us to
add new "fixed" fields into the large inode as necessary.

We don't touch any fields that we don't understand (normal ext3 compat
flags will tell us if there are incompatible features there).


So, in summary, the "i_extra_isize" handling is already there for inodes
(currently always set to '4') but we don't do anything with that space
yet.

> - store create_time and change_time in the user.DosAttrib xattr, as
> 64 bit 100ns resolution times (same format as NT uses and Samba
> uses internally). I store change_time there as its definition is a
> little different from the posix ctime field (plus its settable).
>
> If we had a settable create_time field in the inode then I'd certainly
> want to use it in Samba4. A non-settable one wouldn't be nearly as
> useful. Some win32 applications care about being able to set all the
> time fields (such as excel 2003).

Hmm, seems kind of counter-productive to allow a crtime that is settable...

> I think it would make more sense to have a new varient of utime() for
> setting all available timestamps, and expose all timestamps in stat. A
> separate API for create time seems a bit hackish.

By all means go for it ;-). I'm not particularly fond of the proposed
pseudo-EA interface. You are probably more likely than anyone to get
support for it.

> > I would just configure out the xattr sharing code entirely since it will
> > likely do nothing but increase overhead if any of the EAs on an inode
> > are unique (this is the most common case, except for POSIX-ACL-only setups).
>
> I didn't know it was configurable. I can't see any CONFIG option for
> it - is there some trick I've missed?

It's CONFIG_FS_MBCACHE and/or CONFIG_EXT[23]_FS_XATTR_SHARING in the
original 2.4 xattr patches, not sure if they've disappeared in 2.6 kernels.

Hmm, seems that the CONFIG_FS_MBCACHE option doesn't allow you to turn it
off completely, which is a shame since both are completely useless for any
EAs which are different for each inode and just introduce overhead. The
CONFIG_EXT[23]_FS_XATTR_SHARING options don't exist at all anymore.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/


Attachments:
(No filename) (4.34 kB)
(No filename) (189.00 B)
Download all attachments

2004-11-19 23:06:50

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

I did some testing with reiser4 from 2.6.10-rc2-mm2. As far as I can
tell it doesn't seem to support the xattr calls (fsetxattr, fgetxattr
etc). Is that right, or did I miss a patch somewhere? The code seems
to set the xattr methods to NULL and has the prototypes #if'd out.

The result without xattr support was 52 MB/sec, which is a bit slower
than the reiser3 I tested in 2.6.10-rc2. For easy comparison, here are
the non-xattr results for the various filesystems I've tested:

tmpfs 69 MB/sec
ext2 68 MB/sec
ext3 67 MB/sec
xfs+2Kinode 63 MB/sec
xfs 62 MB/sec
reiser 58 MB/sec
reiser4 52 MB/sec (on a -mm2 kernel)
jfs 36 MB/sec

I used default options for mkreiser4, and default mount options. Can
you suggest some options to try or would you prefer to wait till I've
done the new dbench so you can try this more easily yourself? (you can
of course try installing Samba4 to test now, but its a fast moving
target and involves a lot more than just filesystem calls).

To make sure the problem wasn't some of the other patches in -mm2, I
reran the ext3 results on -mm2, and was surprised to find quite a
large improvement! ext3 got 73 MB/sec without xattr support. It oopsed
when I enabled xattr (I'm working with sct on fixing those oopses).

Once the oopses are fixed I'll rerun all the various filesystems with
-mm2 and see if it only improves ext3 or if it improves all of them.

Would anyone care to hazard a guess as to what aspect of -mm2 is
gaining us 10% in overall Samba4 performance?

Cheers, Tridge

2004-11-20 00:27:04

by Andrew Morton

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

[email protected] wrote:
>
> Would anyone care to hazard a guess as to what aspect of -mm2 is
> gaining us 10% in overall Samba4 performance?

Is it reproducible with your tricked-up dbench?

If so, please send me a machine description and the relevant command line
and I'll do a bsearch.

2004-11-20 04:45:03

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

can you describe qualitatively what your test does? You didn't answer
whether it does fsyncs, etc.

It might be worth testing it with the extents only mount option for reiser4.

Hans


2004-11-20 04:56:54

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

[email protected] wrote:

>Hans,
>
> > Is this an fsync intensive benchmark? If no, could you try with
> > reiser4? If yes, you might as well wait for us to optimize fsync first
> > in reiser4.
>
>In the configuration I was running there are no fsync calls.
>
>I'll have a go with reiser4 soon and let you know how it goes. I'm
>also working on a new version of dbench that will better simulate the
>filesystem access patterns of Samba4.
>
>
If you can describe what those are, it would do me a lot of good in
regards to my understanding what it means about an fs to get a certain
result on the benchmark, and what needs to be better optimized.

Cheers,

Hans


2004-11-20 06:48:29

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

> can you describe qualitatively what your test does?

The access patterns are very similar to dbench, which I believe you
are already familiar with. Let me know if you'd like an explanation of
dbench.

For the test I ran, the basic load file is almost the same as dbench,
but the interpretation of the load file is a little bit different.

For example, when the load file says "open a file", Samba4 needs to
first stat() the file, and if xattrs are being used then it needs to
do a fgetattr() to grab the extended DOS attributes. Additionally, if
the open has the effect of changing any of those attributes then
Samba4 needs to use fsetxattr() to write back the extended attributes,
and sometimes fchmod() and utime() as well depending on the open
parameters.

When dbench interprets one of these load files it would just call
open(), skipping all the extra system calls.

The full load file I used is at:

http://samba.org/ftp/tridge/dbench/client_enterprise.txt

and is based on a capture of a "Enterprise Disk Mix" Netbench run,
captured using the "nbench" load capturing proxy module in Samba4,
using a Win2003 server backend and WinXP client.

The working set size is approximately 20 MByte per client, and I was
testing with 10 simulated clients. That means its very much a "in
memory" test, as the machine has 2G of ram.

> You didn't answer whether it does fsyncs, etc.

I think I did mention that the test does no fsync calls in the
configuration I used. The reason I qualify the answer is that the load
file actually contains approximately 1% Flush calls, but in its
default configuration these are noops for Samba4. This is due to the
confusion in Win32 between a "flush" operation and a "fsync"
operation. Microsoft programmers use "flush" like a unix programmer
would use fflush() on stdio, which is a noop for Samba. You can also
configure Samba to treat flush as a "fsync", which is quite a
different operation.

The operation mix is as follows, listed with the approximate posix
equivalent operation.

(27%) ReadX (==pread)
(17%) NTCreateX (==open)
(16%) QUERY_PATH_INFORMATION (==stat)
(13%) Close (==close)
(9%) WriteX (==pwrite)
(6%) FIND_FIRST (==opendir/readdir/closedir)
(3%) Unlink (==unlink)
(3%) QUERY_FS_INFORMATION (==statfs)
(3%) QUERY_FILE_INFORMATION (==fstat)
(1%) SET_FILE_INFORMATION (==fchmod/utime)
(1%) Flush (==noop)
(1%) Rename (==rename)
(0%) UnlockX (==fcntl unlock)
(0%) LockX (==fcntl lock)

but the above can be a little misleading, as (for example) NTCreateX
is a very complex call, and can be used to create directories, create
files, open files or even delete files or directories (using the
delete on close semantics).

> It might be worth testing it with the extents only mount option for
> reiser4.

My apologies if I have just missed it, but I can't see an option that
looks like "extents only" in either reiser4_parse_options() or in
Documentation/filesystems/reiser4.txt in 2.6.10-rc2-mm2. Can you let
me know the exact option name?

Cheers, Tridge

2004-11-20 11:00:32

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Anton,

> But to answer your question I definitely would envisage an interface to
> the kernel driver rather than to libntfs. It is 'just' a matter of
> deciding how that would look...

How about prototyping the API in user space, using a "mmap the block
device" based filesystem library?

You might also like to take a peek at
http://samba.org/ftp/unpacked/samba4/source/include/smb_interfaces.h
and
http://samba.org/ftp/unpacked/samba4/source/ntvfs/ntvfs.h

those two files define the NTFS-like interfaces in Samba4. The
interface has proved to be quite flexible.

> Partially we will see what happens with Reiser4 as it faces the same or at
> least very simillar interface problems.

yep, I'm looking forward to experimenting with the "file is a
directory" stuff in reiser4 to see how well it can be made to match
what is needed for Samba4.

Cheers, Tridge

2004-11-20 16:13:20

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

[email protected] wrote:

>Hans,
>
> > can you describe qualitatively what your test does?
>
>The access patterns are very similar to dbench, which I believe you
>are already familiar with. Let me know if you'd like an explanation of
>dbench.
>
>
Actually, I would, because I have never read its code, and have been at
a loss for years to understand its meaning as a result of that.

>For the test I ran, the basic load file is almost the same as dbench,
>but the interpretation of the load file is a little bit different.
>
>For example, when the load file says "open a file", Samba4 needs to
>first stat() the file, and if xattrs are being used then it needs to
>do a fgetattr() to grab the extended DOS attributes. Additionally, if
>the open has the effect of changing any of those attributes then
>Samba4 needs to use fsetxattr() to write back the extended attributes,
>and sometimes fchmod() and utime() as well depending on the open
>parameters.
>
>When dbench interprets one of these load files it would just call
>open(), skipping all the extra system calls.
>
>The full load file I used is at:
>
> http://samba.org/ftp/tridge/dbench/client_enterprise.txt
>
>and is based on a capture of a "Enterprise Disk Mix" Netbench run,
>captured using the "nbench" load capturing proxy module in Samba4,
>using a Win2003 server backend and WinXP client.
>
>The working set size is approximately 20 MByte per client, and I was
>testing with 10 simulated clients. That means its very much a "in
>memory" test, as the machine has 2G of ram.
>
>
Ah, that explains a lot. For that kind of workload, the simpler the fs
the better, because really all you are doing is adding overhead to
copy_to_user and copy_from_user. All of reiser4's advanced features
will add little or no value if you are staying in ram.

> > You didn't answer whether it does fsyncs, etc.
>
>I think I did mention that the test does no fsync calls in the
>configuration I used. The reason I qualify the answer is that the load
>file actually contains approximately 1% Flush calls, but in its
>default configuration these are noops for Samba4. This is due to the
>confusion in Win32 between a "flush" operation and a "fsync"
>operation. Microsoft programmers use "flush" like a unix programmer
>would use fflush() on stdio, which is a noop for Samba. You can also
>configure Samba to treat flush as a "fsync", which is quite a
>different operation.
>
>The operation mix is as follows, listed with the approximate posix
>equivalent operation.
>
>(27%) ReadX (==pread)
>(17%) NTCreateX (==open)
>(16%) QUERY_PATH_INFORMATION (==stat)
>(13%) Close (==close)
>(9%) WriteX (==pwrite)
>(6%) FIND_FIRST (==opendir/readdir/closedir)
>(3%) Unlink (==unlink)
>(3%) QUERY_FS_INFORMATION (==statfs)
>(3%) QUERY_FILE_INFORMATION (==fstat)
>(1%) SET_FILE_INFORMATION (==fchmod/utime)
>(1%) Flush (==noop)
>(1%) Rename (==rename)
>(0%) UnlockX (==fcntl unlock)
>(0%) LockX (==fcntl lock)
>
>but the above can be a little misleading, as (for example) NTCreateX
>is a very complex call, and can be used to create directories, create
>files, open files or even delete files or directories (using the
>delete on close semantics).
>
> > It might be worth testing it with the extents only mount option for
> > reiser4.
>
>My apologies if I have just missed it, but I can't see an option that
>looks like "extents only" in either reiser4_parse_options() or in
>Documentation/filesystems/reiser4.txt in 2.6.10-rc2-mm2. Can you let
>me know the exact option name?
>
>Cheers, Tridge
>
>
>
>
mkfs.reiser4 -o extent=extent40

2004-11-20 16:22:55

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

[email protected] wrote:

>
>yep, I'm looking forward to experimenting with the "file is a
>directory" stuff in reiser4 to see how well it can be made to match
>what is needed for Samba4.
>
>
>
There are still bugs with it that have us turning it off for now, but I
think we will fix those in the next year.

2004-11-20 23:56:05

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

> mkfs.reiser4 -o extent=extent40

This lowered the performance by a small amount (from 52 MB/sec to 50
MB/sec).

It also revealed a bug. I have been doing my tests on a cleanly
formatted filesystem each time, but this time I re-ran the test a few
times in a row to determine just how consistent the results are. The
results I got were:

mkfs.reiser4 -o extent=extent40 50 MB/sec
48
43
41
37 (stuck)

the "stuck" result meant that smbd locked into a permanent D state at
the end of the fifth run. Unfortunately ps showed the wait-channel as
'-' so I don't have any more information about the bug. I needed to
power cycle the machine to recover.

To check if this is reproducable I tried it again and got the following:

reboot, mkfs again 50 MB/sec
48
44
42
40
(failed)

the "failed" on the sixth run was smbd stuck in D state again, this
time before the run completed so I didn't get a performance number.

I should note that the test completely wipes the directory tree
between runs, and the server processes restart, so the only way there
can be any state remaining that explains the slowdown between runs is
a filesystem bug. Do you think reiser4 could be "leaking" some on-disk
structures?

To determine if this problem is specific to the extent=extent40
option, I ran the same series of tests against reiser4 without the
extent option:

reboot, mkfs.reiser4 without options 52 MB/sec
52
45
41
(failed)

The failure on the fifth run showed the same symptoms as above.

To determine if the bug is specific to reiser4, I then ran the same
series of tests against ext3, using the same kernel:

reboot, mke2fs -j 70 MB/sec
70
69
70
71
70

So it looks like the gradual slowdown and eventual lockup is specific
to reiser4. What can I do to help you track this down? Would you like
me to write a "howto" for running this test, or would you prefer to
wait till I have an emulation of the test in dbench?

To give you an idea of the scales involved, each run lasts 100
seconds, and does approximately 1 million filesystem operations (the
exact number of operations completed in the 100 seconds is roughly
proportional to the performance result).

> Ah, that explains a lot. For that kind of workload, the simpler the fs
> the better, because really all you are doing is adding overhead to
> copy_to_user and copy_from_user. All of reiser4's advanced features
> will add little or no value if you are staying in ram.

I'll do some runs with larger numbers of simulated clients and send
you those results shortly. Do you think a working set size of about
double the total machine memory would be a good size to start showing
the reiser4 features?

Cheers, Tridge

2004-11-20 23:56:05

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

> There are still bugs with it that have us turning it off for now, but I
> think we will fix those in the next year.

Do you plan to add user xattr support before then?

The reason I ask is that without either xattr support or named streams
Samba4 has no way to store the additional file meta data it needs.

Maybe xattr support could be a reiser4 plugin?

Cheers, Tridge

2004-11-21 00:26:42

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

A bit more information about the slowdown between runs (and eventual
lockup) with reiser4 that I reported in my last email.

I found that a umount/mount between runs solved the problem, leading
to a fairly consistent result and no lockup. I also found that running
a simple /bin/sync between runs solved the problem.

This implies to me that it is some in-memory structure that is the
culprit. I can't see anything obvious in /proc/slabinfo, but its been
a while since I've done any serious kernel development so maybe I just
don't know what to look for.

I also tried enabling the "strict sync" option in Samba4. This makes
the 1% flush operations in the load file map to fsync() instead of a
noop. This caused reiser4 to lockup almost immediately, with the same
symptoms as the previous lockups I reported (all smbd processes stuck
in D state). No oops messages or anything unusual in dmesg.

Cheers, Tridge

2004-11-21 02:38:13

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

New benchmarks seem to be especially good at finding bugs.

vs, please find the bug and fix it.

Hans

[email protected] wrote:

>Hans,
>
> > mkfs.reiser4 -o extent=extent40
>
>This lowered the performance by a small amount (from 52 MB/sec to 50
>MB/sec).
>
>It also revealed a bug. I have been doing my tests on a cleanly
>formatted filesystem each time, but this time I re-ran the test a few
>times in a row to determine just how consistent the results are. The
>results I got were:
>
> mkfs.reiser4 -o extent=extent40 50 MB/sec
> 48
> 43
> 41
> 37 (stuck)
>
>the "stuck" result meant that smbd locked into a permanent D state at
>the end of the fifth run. Unfortunately ps showed the wait-channel as
>'-' so I don't have any more information about the bug. I needed to
>power cycle the machine to recover.
>
>To check if this is reproducable I tried it again and got the following:
>
>reboot, mkfs again 50 MB/sec
> 48
> 44
> 42
> 40
> (failed)
>
>the "failed" on the sixth run was smbd stuck in D state again, this
>time before the run completed so I didn't get a performance number.
>
>I should note that the test completely wipes the directory tree
>between runs, and the server processes restart, so the only way there
>can be any state remaining that explains the slowdown between runs is
>a filesystem bug. Do you think reiser4 could be "leaking" some on-disk
>structures?
>
>To determine if this problem is specific to the extent=extent40
>option, I ran the same series of tests against reiser4 without the
>extent option:
>
>reboot, mkfs.reiser4 without options 52 MB/sec
> 52
> 45
> 41
> (failed)
>
>The failure on the fifth run showed the same symptoms as above.
>
>To determine if the bug is specific to reiser4, I then ran the same
>series of tests against ext3, using the same kernel:
>
> reboot, mke2fs -j 70 MB/sec
> 70
> 69
> 70
> 71
> 70
>
>So it looks like the gradual slowdown and eventual lockup is specific
>to reiser4. What can I do to help you track this down? Would you like
>me to write a "howto" for running this test, or would you prefer to
>wait till I have an emulation of the test in dbench?
>
>To give you an idea of the scales involved, each run lasts 100
>seconds, and does approximately 1 million filesystem operations (the
>exact number of operations completed in the 100 seconds is roughly
>proportional to the performance result).
>
>
>
>>Ah, that explains a lot. For that kind of workload, the simpler the fs
>>the better, because really all you are doing is adding overhead to
>>copy_to_user and copy_from_user. All of reiser4's advanced features
>>will add little or no value if you are staying in ram.
>>
>>
>
>I'll do some runs with larger numbers of simulated clients and send
>you those results shortly. Do you think a working set size of about
>double the total machine memory would be a good size to start showing
>the reiser4 features?
>
>Cheers, Tridge
>
>
>
>

2004-11-21 02:41:20

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

[email protected] wrote:

>Hans,
>
>A bit more information about the slowdown between runs (and eventual
>lockup) with reiser4 that I reported in my last email.
>
>I found that a umount/mount between runs solved the problem, leading
>to a fairly consistent result and no lockup. I also found that running
>a simple /bin/sync between runs solved the problem.
>
>This implies to me that it is some in-memory structure that is the
>culprit. I can't see anything obvious in /proc/slabinfo, but its been
>a while since I've done any serious kernel development so maybe I just
>don't know what to look for.
>
>I also tried enabling the "strict sync" option in Samba4. This makes
>the 1% flush operations in the load file map to fsync() instead of a
>noop. This caused reiser4 to lockup almost immediately, with the same
>symptoms as the previous lockups I reported (all smbd processes stuck
>in D state). No oops messages or anything unusual in dmesg.
>
>Cheers, Tridge
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
>
Thanks much tridge. vs, please respond in detail.

2004-11-21 02:13:18

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Andrew,

> Is it reproducible with your tricked-up dbench?
>
> If so, please send me a machine description and the relevant command line
> and I'll do a bsearch.

I should explain a little more ....

The current dbench is showing way too much variance on this test to be
really useful. Here are the numbers for 5 runs of dbench 10 on
2.6.10-rc2 and 2.6.10-rc2-mm2:

2.6.10-rc2 325
320
364
360
347


-mm2 347
371
411
322
384

I've solved this variance problem in NBENCH by making the runs fixed
time rather than fixed number of operations, and adding a warmup
phase. I need to do the same to dbench in order to get sane numbers
out that would be at all useful for a binary patch search.

The current dbench worked OK when computers were slower, but now it is
completing its runs so fast that the noise is just silly.

Cheers, Tridge

2004-11-21 03:20:50

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

> Would you be willing to do some variation on it that scaled itself to
> the size of the machine, and generated disk load rather than fitting in ram?

You can do that now by varying the number of simulated clients, or by
varying the load file.

> I hope you understand my reluctance to optimize for tests that fit into
> ram.....

to some extent, yes, but "in memory" tests are actually pretty
important for file serving.

In a typical large office environment with one or two thousand users
you will only have between 20 and 100 of those users really actively
using the file server at any one time. The others are taking a nap, in
meetings or staring out the window. Or maybe (being generous), they
are all working furiously with cached data. I haven't actually gone
into the cubes to check - I just see the server side stats.

Of those that are active, they rarely have a working set size of over
100MB, and usually much less, so it is not uncommon for the whole
workload over a period of 5 minutes to fit in memory on typical file
servers. This is especially so on the modern big file servers that
might have 16G of ram or more, with modern clients that do agressive
lease based caching.

There are exceptions of course. Big print shops, rendering farms and
high performance computing sites are all examples of sites that have
active working sets much larger than typical system memory.

The point is that you need to test a wide range of working set
sizes. You also might like to notice that in the published commercial
NetBench runs paid for by the big players (like Microsoft, NetApp, EMC
etc), you tend to find that the graph only extends to a number of
clients equal to the total machine memory divided by 25MB. That is
perhaps not a coincidence given that the working set size per client
of NetBench is about 22MB. The people who pay for the benchmarks want
their customers to see a graph that doesn't have a big cliff at the
right hand side.

Also, with journaled filesystems running in-memory benchmarks isn't as
silly as it first seems, as there are in fact big differences between
how the filesystems cope. It isn't just a memory bandwidth
test. Windows clients do huge numbers of meta-data operations, and
nearly all of those cause journal writes which hit the metal.

So while I sympathise with you wanting reiser4 to be tuned for "big"
storage, please remember that a good proportion of the installs are
likely to be running "in-memory" workloads.

Cheers, Tridge

2004-11-21 02:48:31

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Would you be willing to do some variation on it that scaled itself to
the size of the machine, and generated disk load rather than fitting in ram?

I hope you understand my reluctance to optimize for tests that fit into
ram.....


Thanks,

Hans

>
>Finally, I should note that Spec is considering adding CIFS
>benchmarking to their suite of benchmarks. Interestingly, they are
>looking at using something based on my nbench tool, or something close
>to it, so eventually nbench might become the more "official"
>benchmark. That would certainly be an interesting turn of events :)
>
>Cheers, Tridge
>
>
>
>

2004-11-21 01:56:58

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Andrew,

> Is it reproducible with your tricked-up dbench?

The xattr enabled dbench I did for Stephen was just a quick hack to
demonstrate the oops in ext3. I'll do a more complete version over the
next few days.

Cheers, Tridge

2004-11-21 01:55:23

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hans,

> Actually, I would, because I have never read its code, and have been at
> a loss for years to understand its meaning as a result of that.

ok, first a bit of history.

In 1992 Ziff-Davis developed a benchmark called "NetBench" for
benchmarking file serving in PC client environments. NetBench was
freely downloadable, but without source code.

Over the years Netbench became the main benchmark used in the Windows
file serving world. In the Network Attached Storage market, good
NetBench numbers are absolutely essential, and companies tend to put a
lot of effort into building large "NetBench labs" for testing NetBench
performance. A couple of companies I have worked at have had people
working almost full time on running netbench results with various
configurations.

NetBench is quite different from Bonnie and other similar benchmarks,
as it is based on "replay of captured load". The load files for
NetBench come from common real-world scenarios where PC clients run
popular applications like MS Word, Excel, Corel Draw, MS Access,
Paradox, MS PowerPoint etc while storing their files on a remote PC
file server.

The usual output of Netbench is an Excel spreadsheet showing fairly
detailed performance numbers for different numbers of clients, plus
min, max and standard deviation numbers for the response time of each
type of operation.

NetBench came to prominance in the Linux world when Microsoft paid a
company called MindCraft to run some benchmarks comparing Windows file
server performance to Samba on Linux. It was initially difficult for
the Linux community to respond to this as we had no easy access to a
NetBench lab, and setting one up could easily be a million-dollar
effort.

To fix this, I wrote a suite of three benchmark tools, called
"nbench", "dbench" and "tbench". These tools were designed to provide
a fairly close emulation of NetBench, and to be extremely simple to
use (much simpler than NetBench). I also wanted them to be able to be
run on the typical hardware available to many home Linux
developers. They don't give output that is nearly as detailed as
NetBench, but when combined with common profiling tools this usually
isn't a problem.

The three tools are:

- nbench. This completely emulates a NetBench run. The current
versions produce almost identical sets of CIFS network packets to
a run of NetBench on WinXP. You need to have a CIFS file server
(like Samba) installed to run nbench.

- dbench. This emulates just the file system calls that a Samba
server would have to perform in order to complete a NetBench
run. It doesn't need Samba installed.

- tbench. This emulates just the TCP traffic that a Samba server
would have to send/receive in order to complete a NetBench run. It
doesn't need Samba installed.

Over the years I have improved these tools to give better and better
emulation of NetBench. Unfortunately this means that you can't
meaningfully compare results between versions.

All 3 tools use a load file to tell them what operations to
perform. This load file is written in terms of CIFS file sharing
operations, which are then interpreted by the benchmark tools into
either CIFS requests, filesystem requests or TCP traffic.

There are a number of ways to generate these load files. You can write
one yourself (good for measuring just write speed for example), or you
can capture a load file from any CIFS network activity, either by
post-processing a tcpdump or by using a Samba. The load files I
provide come from capturing real NetBench runs.

Note that in all of the above I never claimed that these tools are
"good" benchmarks. I merely try to make them produce results that
closely predict the results of real NetBench runs. Whether NetBench is
actually a "good" benchmark is another topic entirely.

Finally, I should note that Spec is considering adding CIFS
benchmarking to their suite of benchmarks. Interestingly, they are
looking at using something based on my nbench tool, or something close
to it, so eventually nbench might become the more "official"
benchmark. That would certainly be an interesting turn of events :)

Cheers, Tridge

2004-11-21 06:11:33

by Hans Reiser

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

[email protected] wrote:

>
>So while I sympathise with you wanting reiser4 to be tuned for "big"
>storage, please remember that a good proportion of the installs are
>likely to be running "in-memory" workloads.
>
>
I agree that in-memory workloads are important, and that is why we
compress on flush rather than compressing on write for our compression
plugin, and it is why we should spend some time optimizing reiser4 to
make its code paths more lightweight for the in-memory case. At the
same time, I think that the workloads where the filesystem matters the
most are the ones that access the disk. With computers, in a large
percentage of the time that people notice themselves waiting, it is the
disk drive they are waiting on.

Sigh, there are so many things we should optimize for, and it will be
years before we have hit all the important ones.

2004-11-21 23:24:00

by Nathan Scott

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Hi Andrew,

On Fri, Nov 19, 2004 at 06:38:40PM +1100, [email protected] wrote:
> ...
> The biggest change from the kernels point of view is that Samba4 makes
> extensive use of filesystem xattrs. Almost every file with have a
> ...
> I started some simple benchmarking today using the BENCH-NBENCH
> smbtorture benchmark, with 10 simulated clients and loopback
> networking on a dual Xeon server with 2G ram and a 50G scsi partition.
> I used a 2.6.10-rc2 kernel. This benchmark only involves a
> user.DosAttrib xattr of size 44 on every file (that will be the most
> common situation in production use).
> ...
> xfs 62 MB/sec
> xfs+xattr 40 MB/sec
> xfs+2Kinode 63 MB/sec
> xfs+xattr+2Kinode 58 MB/sec
> ...
> The XFS results with default options are rather disappointing, as XFS
> has usually been a good performer for Samba workloads. Increasing the
> inode size to 2k brought it back to a more reasonable level.

Interesting. There's been on-and-off discussion for some time
as to whether the default mkfs parameters should be changed,
this will add more fuel to that debate I expect.

I'm curious why you went to 2K inodes instead of 512 - I guess
because thats the largest inode size with a 4K blocksize? If
the defaults were changed, I expect it would be to switch over
to 512 byte inodes - do you have numbers for that?

> To make it easier to benchmark with xattrs, I'm planning on doing a
> new version of dbench with optional xattr support. That will allow
> others to play with xattr performance for the above workload without

Ah great, thanks, I'll be keen to try that when its available.

cheers.

--
Nathan

2004-11-21 23:44:54

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Nathan,

> I'm curious why you went to 2K inodes instead of 512 - I guess
> because thats the largest inode size with a 4K blocksize? If
> the defaults were changed, I expect it would be to switch over
> to 512 byte inodes - do you have numbers for that?

It was a fairly arbitrary choice. For the test I was running the
xattrs were small (44 bytes), so 512 would have been fine, but some
other tests I run use larger xattrs (for NT ACLs, streams, DOS EAs
etc).

> Ah great, thanks, I'll be keen to try that when its available.

It's now released. You can grab it at:

http://samba.org/ftp/tridge/dbench/dbench-3.0.tar.gz

It should produce much more consistent results than previous versions
of dbench, plus it has a -x option to enable xattr support. Other
changes include:

- the runs are now time limited, rather than being a fixed number of
operations. This gives much more consisten results, especially for
fast machines.

- I've changed the mapping of the filesystem operations to be much
closer to what Samba4 does, including the directory scans for case
insensitivity, the stat() calls in name resolution and things like
statfs() calls. The modelling could still be improved, but its
much better than it was.

- the load file is now compatible with the smbtorture NBENCH test
again (the two diverged a while back).

- the default load file has been updated to be based on NetBench
7.0.3, running a enterprise disk mix.

- the warmup/execute/cleanup phases are now better separated

Cheers, Tridge

2004-11-21 23:54:31

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Andrew,

> Is it reproducible with your tricked-up dbench?
>
> If so, please send me a machine description and the relevant command line
> and I'll do a bsearch.

The new dbench is finished (see my reply to Nathan for details).

I've done some initial runs comparing 2.6.10-rc2 and 2.6.10-rc2-mm2
and I am not seeing the performance gain with mm2 that I reported
earlier. I don't yet know if this is because I screwed up previously,
or there is some other factor that I haven't taken account of.

I'm now doing a larger set of runs comparing the two kernels with a
range of filesystems and much longer run times, plus more repeats per
run. I'm also using a script that reformats the filesystem before each
run in case that was a factor (as it was for reiser4). I'll get you
the results later today.

Cheers, Tridge

2004-11-22 13:05:51

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

I've put up graphs of the first set of dbench3 results for various
filesystems at:

http://samba.org/~tridge/xattr_results/

All the tests were run on a 2.6.10-rc2 kernel with the patch from
Andreas to add support to ext3 for large inodes. I needed to tweak the
patch for 2.6.10-rc2, but not by much. Full details on the setup are
in the README, and the scripts for reproducing the results yourself
(and the graphs) are in the same directory.

The results show that the ext3 large inode patch is extremely
worthwhile. Using a 256 byte inode on ext3 gained a factor of up to 7x
in performance, and only lost a very small amount when xattrs were not
used. It took ext3 from a very mediocre performance to being the clear
winner among current Linux journaled filesystems for performance when
xattrs are used. Eventually I think that larger inodes should become
the default.

Similarly on xfs, using the large inode option (512 bytes this time)
made a huge difference, gaining a factor of 6x in the best case. If
all versions of the xfs code can handle large inodes then I think it
would be good to change the default, especially as it seems to have
almost no cost when xattrs are not used.

Without xattrs reiser3 did extremely well under heavier load, where it
is less of a in-memory test, just as Hans thought it
would. Unfortunately I wasn't able to try reiser4 in these runs due to
the lockups I reported earlier, but I look forward to trying it once
those are fixed.

Reiser3 was also the best "out of the box" journaled filesystem with
xattrs, but it was easily beaten by xfs and ext3 once large inodes
were enabled in those.

jfs wins the award for consistency. As I watched the results develop I
was tempted to just disable the jfs tests as it was so slow, but
eventually it overtook xfs at very large loads. Maybe if I run large
enough loads it will be the overall winner :)

The massive gap between ext2 and the other filesystems really shows
clearly how much we are paying for journaling. I haven't tried any
journal on external device or journal on nvram card tricks yet, but it
looks like those will be worth pursuing.

I'll leave the test script running overnight generating some more
results for even higher loads. I'll update the graphs in the morning.

Cheers, Tridge

2004-11-23 09:40:45

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Andrew,

> > Would anyone care to hazard a guess as to what aspect of -mm2 is
> > gaining us 10% in overall Samba4 performance?
>
> Is it reproducible with your tricked-up dbench?
>
> If so, please send me a machine description and the relevant command line
> and I'll do a bsearch.

Sorry for the delay in getting back to you on this. The full set of
runs for the data I posted last night took 12 hours to produce, so the
machine was a bit busy.

I've now confirmed that the new dbench does indeed show a significant
improvement in 2.6.10-rc2-mm2 as compared to
2.6.10-rc2. Interestingly, the improvement seems to be only in ext3,
which confused me for a while. The difference is also much more
dramatic (as a percentage) when xattrs are enabled in the test.

Here are the results for dbench3 runs with varying numbers of clients,
and with rc2 and rc2-mm2 for ext3. First the non-xattr results:

clients -rc2 rc2-mm2
-----------------------
10 362 376
20 328 357
30 249 270
40 169 199
50 128 155
60 107 143

now the xattr results (using the -x option to dbench)

clients -rc2 rc2-mm2
-----------------------
10 58 125
20 44 64
30 43 54
40 42 52
50 49 49
60 40 47

I don't know why there was no improvement at size 50.

for comparison, there is very little difference for xfs (or the other
filesystems I tested, which were jfs, reiser and ext2). Here are the
non-xattr xfs results:

clients -rc2 rc2-mm2
-----------------------
10 365 368
20 324 328
30 254 257
40 194 212
50 128 139
60 58 59

The script I used to run dbench is at
http://samba.org/~tridge/xattr_results/
the details on the machine config are there too.

For your bsearch, its probably best to choose one of the clearest
and least noisy results (like the xattr result for size 20) and just
run the search for that one. That will take a bit under 5 minutes per
test if you use the same runtime I did. You could do it quicker, but
you risk getting more noise in the results.

Cheers, Tridge

2004-11-23 22:36:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

On Nov 23, 2004 20:37 +1100, [email protected] wrote:
> > > Would anyone care to hazard a guess as to what aspect of -mm2 is
> > > gaining us 10% in overall Samba4 performance?
> >
> > Is it reproducible with your tricked-up dbench?
> >
> > If so, please send me a machine description and the relevant command line
> > and I'll do a bsearch.
>
> I've now confirmed that the new dbench does indeed show a significant
> improvement in 2.6.10-rc2-mm2 as compared to 2.6.10-rc2. Interestingly,
> the improvement seems to be only in ext3, which confused me for a while.

Might it be the reservation patches?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/


Attachments:
(No filename) (769.00 B)
(No filename) (189.00 B)
Download all attachments

2004-11-23 22:36:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

On Nov 23, 2004 00:02 +1100, [email protected] wrote:
> I've put up graphs of the first set of dbench3 results for various
> filesystems at:
>
> http://samba.org/~tridge/xattr_results/
>
> The results show that the ext3 large inode patch is extremely
> worthwhile. Using a 256 byte inode on ext3 gained a factor of up to 7x
> in performance, and only lost a very small amount when xattrs were not
> used. It took ext3 from a very mediocre performance to being the clear
> winner among current Linux journaled filesystems for performance when
> xattrs are used. Eventually I think that larger inodes should become
> the default.

For Lustre we tune the inode size at format time to allow the storing
of the "default" EA data within the larger inode. Is this the case with
samba and 256-byte inodes (i.e. is your EA data all going to fit within
the extra 124 bytes of space for storing EAs)? If you have to put any
of the commonly-used EA data into an external block the benefits are lost.

> The massive gap between ext2 and the other filesystems really shows
> clearly how much we are paying for journaling. I haven't tried any
> journal on external device or journal on nvram card tricks yet, but it
> looks like those will be worth pursuing.

One of the other things we do for Lustre right away is create the ext3
filesystem with larger journal sizes so that for the many-client cases
we do not get synchronous journal flushing if there are lots of active
threads. This can make a huge difference in overall performance at
high loads. Use "mke2fs -J size=400 ..." to create a 400MB journal
(assuming you have at least that much RAM and a large enough block
device, at least 4x the journal size just from a "don't waste space"
point of view).

One factor is that you don't necessarily need to write so much data at one
time, but also that ext3 needs to reserve journal space for the worst-case
usage, so you get 40-100 threads allocating "worst case" then "filling"
the journal (causing new operations to block) and finally completing with
only a small fraction of those reserved journal blocks actually used.

Having an external journal device also generally gives you a large
journal (by default it is the full size of the block device specified)
so sometimes the effects of the large journal are confused with the
fact that it is external. I haven't seen any perf numbers recently on
what kind of effect having an external journal has. I highly doubt that
NVRAM cards are any better than a dedicated disk for the journal, since
journal IO is write-only (except during recovery) and virtually seek-free.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/


Attachments:
(No filename) (2.71 kB)
(No filename) (189.00 B)
Download all attachments

2004-11-24 08:15:34

by tridge

[permalink] [raw]
Subject: Re: performance of filesystem xattrs with Samba4

Andrew,

You can call off your bsearch - I found the culprit.

For the 2.6.10-rc2 tests I was running with the patch from Andreas
that added large ext3 inode support (in order to also test the
ext3-256 case). For the -mm2 test I wasn't.

This patch was supposed to have no effect if large inodes were not
setup at mkfs time. Unfortunately it does have an affect as it also
removes the in-place xattr modification logic from
ext3_xattr_set_handle(), so every xattr set becomes the same as a
delete+create pair. In plain -rc2 and in -mm2 an xattr set of the same
size will be done in-place. As every xattr set is of the same size in
dbench3 this made a huge difference.

Sorry for the false alarm.

Cheers, Tridge