I've been running bonnie++ filesystems testing on an IBM x335 server
recently. This box uses the MPT RAID controller, but I've disabled the
RAID and am addressing the disks individually. I'm getting wildly
different results between 2.4.20-20-9 (RedHat mod), 2.4.22 (stock), and
2.6.0-test9.
The full results are here: http://groove.jpj.net/x335-test.html
The base distro is RedHat 9, there are no extraneous daemons running or
modules loaded. I'm using a dedicated drive as the scratch directory.
I'm looking for some insight as to why I'm seeing such a disparity in
performance.
The server has Dual P4 3.06Ghz CPUs, 1.5GB RAM, two 36GB Ultra320 disks.
bonnie++ is run as
bonnie++ -d /test -s 3g -m x335-`uname -r` -n 200 -x 2 -u root -q
Thanks
-Paul
On Tue, 4 Nov 2003, Paul Venezia wrote:
>
> I've been running bonnie++ filesystems testing on an IBM x335 server
> recently. This box uses the MPT RAID controller, but I've disabled the
> RAID and am addressing the disks individually. I'm getting wildly
> different results between 2.4.20-20-9 (RedHat mod), 2.4.22 (stock), and
> 2.6.0-test9.
Interesting. The 2.4.22 sequential "per char" results are totally out of
line with anything else.
The thing is, the overhead for the per-char stuff really should be almost
all in user space unless I'm mistaken. It's just using getch/putch, no?
Which makes me suspect that either the libc does something different
depending on kernel version, _or_ 2.4.22 returns a different st_blksize
thing, causing stdio to use a different blocking size.
Have you tried stracing the "per char" parts of the benchmark to see what
the system call patterns are? That should show both effects.
Linus
On Tue, Nov 04, 2003 at 11:36:55AM -0800, Linus Torvalds wrote:
> > I've been running bonnie++ filesystems testing on an IBM x335 server
> > recently. This box uses the MPT RAID controller, but I've disabled the
> > RAID and am addressing the disks individually. I'm getting wildly
> > different results between 2.4.20-20-9 (RedHat mod), 2.4.22 (stock), and
> > 2.6.0-test9.
>
> Interesting. The 2.4.22 sequential "per char" results are totally out of
> line with anything else.
>
> The thing is, the overhead for the per-char stuff really should be almost
> all in user space unless I'm mistaken. It's just using getch/putch, no?
Unless bonnie++ is using the _unlocked() variants, it might be an issue of
the mutex overhead from NPTL v. LinuxThreads. Red Hat 9 has its share
of NPTL bugs.
It is probably worth rerunning the tests with LD_ASSUME_KERNEL=2.4.1 on
the Red Hat kernel.
Regards,
Bill Rugolsky
On Tue, 4 Nov 2003, Bill Rugolsky Jr. wrote:
>
> Unless bonnie++ is using the _unlocked() variants, it might be an issue of
> the mutex overhead from NPTL v. LinuxThreads. Red Hat 9 has its share
> of NPTL bugs.
Hmm.. That would easily explain the differences, since NPTL will trigger
both on 2.6.0 and the RH-2.4 kernel, but not on the standard 2.4.22
kernel.
But there really should be zero contention on the stdio data structures,
so the locking would have to be _seriously_ broken to make that kind o
fdifference (not necessarily buggy, but seriously badly implemented).
A non-contended lock should be at most one locked instruction if well
done, both on LinuxThreads and NPTL.
> It is probably worth rerunning the tests with LD_ASSUME_KERNEL=2.4.1 on
> the Red Hat kernel.
That would be interesting.
Linus
On Tue, 2003-11-04 at 15:30, Linus Torvalds wrote:
> On Tue, 4 Nov 2003, Bill Rugolsky Jr. wrote:
> >
> > Unless bonnie++ is using the _unlocked() variants, it might be an issue of
> > the mutex overhead from NPTL v. LinuxThreads. Red Hat 9 has its share
> > of NPTL bugs.
>
> Hmm.. That would easily explain the differences, since NPTL will trigger
> both on 2.6.0 and the RH-2.4 kernel, but not on the standard 2.4.22
> kernel.
>
> But there really should be zero contention on the stdio data structures,
> so the locking would have to be _seriously_ broken to make that kind o
> fdifference (not necessarily buggy, but seriously badly implemented).
>
> A non-contended lock should be at most one locked instruction if well
> done, both on LinuxThreads and NPTL.
Good point...
A truncated strace under 2.4.22 is here:
http://groove.jpj.net/bonnie-strace-trunc
It's incomplete, but shows the putc calls.
> > It is probably worth rerunning the tests with LD_ASSUME_KERNEL=2.4.1 on
> > the Red Hat kernel.
>
> That would be interesting.
Tests are running now. Updates as events warrant.
-Paul
On Tue, Nov 04, 2003 at 04:07:43PM -0500, Paul Venezia wrote:
> Tests are running now. Updates as events warrant.
Well, I'm too lazy to wait for a long test, but with a mere
100MB file, on 1GHz P3:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
NPTL 100M 7735 99 127068 98 63048 84 7890 98 +++++ +++ +++++ +++
LinuxThreads 100M 11000 99 127928 97 59075 84 11290 98 +++++ +++ +++++ +++
So something is amiss.
Regards,
Bill Rugolsky
On Tue, Nov 04, 2003 at 12:30:23PM -0800, Linus Torvalds wrote:
> But there really should be zero contention on the stdio data structures,
> so the locking would have to be _seriously_ broken to make that kind o
> fdifference (not necessarily buggy, but seriously badly implemented).
>
> A non-contended lock should be at most one locked instruction if well
> done, both on LinuxThreads and NPTL.
The results that I just posted are also for Red Hat 9, kernel 2.4.20-20.9.
rugolsky@ti31: getconf GNU_LIBPTHREAD_VERSION
NPTL 0.34
Ulrich's release notes for nptl-0.57 says:
The changes are numerous and most of them were made by Jakub:
...
~ better stdio locking
I don't have my laptop running Fedora handy, but that's the next thing
to test.
Regards,
Bill Rugolsky
On Tue, 4 Nov 2003, Bill Rugolsky Jr. wrote:
>
> Well, I'm too lazy to wait for a long test, but with a mere
> 100MB file, on 1GHz P3:
>
> Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> NPTL 100M 7735 99 127068 98 63048 84 7890 98 +++++ +++ +++++ +++
> LinuxThreads 100M 11000 99 127928 97 59075 84 11290 98 +++++ +++ +++++ +++
>
> So something is amiss.
Ok, so NPTL locking (even in the absense of any threads and thus any
contention) seems to be noticeably higher-overhead than the old
LinuxThreads.
90% of the overhead of a putc()/getc() implementation these days is likely
just locking. Even so, this implies that NPTL locking is about twice as
expensive as the old LinuxThreads one.
Don't ask me why. But I'm cc'ing Uli, who can probably tell us. Maybe the
RH-9 libraries are just not very good, and LinuxThreads has had a lot
longer to optimize their lock behaviour..
Linus
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
> Don't ask me why. But I'm cc'ing Uli, who can probably tell us. Maybe the
> RH-9 libraries are just not very good, and LinuxThreads has had a lot
> longer to optimize their lock behaviour..
I don't see any verison numbers mentioned. If you want to benchmark
NPTL use the recent code, e.g., from Fedora Core 1 or RHEL3. Nothing
else makes any sense since there have mean countless changes since the
early releases.
- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/qCFh2ijCOnn/RHQRApi1AKCaU7vBtJsATDmx2dStMYishtbF9wCaAvOe
kNaoizj4xtUNU4TV2wH5GAw=
=0kb0
-----END PGP SIGNATURE-----
On Tue, Nov 04, 2003 at 01:40:51PM -0800, Linus Torvalds wrote:
> On Tue, 4 Nov 2003, Bill Rugolsky Jr. wrote:
> >
> > Well, I'm too lazy to wait for a long test, but with a mere
> > 100MB file, on 1GHz P3:
> >
> > Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
> > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> > NPTL 100M 7735 99 127068 98 63048 84 7890 98 +++++ +++ +++++ +++
> > LinuxThreads 100M 11000 99 127928 97 59075 84 11290 98 +++++ +++ +++++ +++
> >
> > So something is amiss.
>
> Ok, so NPTL locking (even in the absense of any threads and thus any
> contention) seems to be noticeably higher-overhead than the old
> LinuxThreads.
>
> 90% of the overhead of a putc()/getc() implementation these days is likely
> just locking. Even so, this implies that NPTL locking is about twice as
> expensive as the old LinuxThreads one.
On Fedora 0.95, Pentium M 1.6GHz, 2.4.22-1.2115.nptl, glibc-2.3.2-10, (NPTL 0.60),
I get:
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
NPTL 100M 13070 100 +++++ +++ 14141 4 13099 100 +++++ +++ +++++ +++
LinuxThreads 100M 25957 100 +++++ +++ 20037 5 26777 99 +++++ +++ +++++ +++
Ugh, still there.
Bill Rugolsky
On Tue, Nov 04, 2003 at 05:19:04PM -0500, Bill Rugolsky Jr. wrote:
> On Fedora 0.95, Pentium M 1.6GHz, 2.4.22-1.2115.nptl, glibc-2.3.2-10, (NPTL 0.60),
> I get:
>
> Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
> NPTL 100M 13070 100 +++++ +++ 14141 4 13099 100 +++++ +++ +++++ +++
> LinuxThreads 100M 25957 100 +++++ +++ 20037 5 26777 99 +++++ +++ +++++ +++
Eek, that's glibc-2.3.2-101.
^
- Bill Rugolsky
On Tue, 4 Nov 2003, Ulrich Drepper wrote:
>
> I don't see any verison numbers mentioned. If you want to benchmark
> NPTL use the recent code, e.g., from Fedora Core 1 or RHEL3. Nothing
> else makes any sense since there have mean countless changes since the
> early releases.
This is actually _really_ trivial to see with a simple test program.
This is Fedora Core test3:
#include <stdlib.h>
/* Change this to match your CPU */
#define NR (10*1000*1000)
int main(int argc, char **argv)
{
int i;
for (i = 0; i < NR; i++)
putchar(0);
}
and then just time it.
I get:
torvalds@home:~> time ./a.out > /dev/null
real 0m1.305s
user 0m1.283s
sys 0m0.004s
and
torvalds@home:~> time LD_ASSUME_KERNEL=2.4.1 ./a.out > /dev/null
real 0m0.321s
user 0m0.318s
sys 0m0.003s
ie a factor of _four_ difference in the speed of "putchar()".
Interestingly, if I compile the program statically, I don't see this
effect, and it's noticeably faster still:
torvalds@home:~> gcc -O2 -static test.c
torvalds@home:~> time ./a.out > /dev/null
real 0m0.193s
user 0m0.191s
sys 0m0.002s
torvalds@home:~> time LD_ASSUME_KERNEL=2.4.1 ./a.out > /dev/null
real 0m0.194s
user 0m0.190s
sys 0m0.004s
Is the TLS stuff done through an extra dynamically loaded indirection or
something?
Linus
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
> Is the TLS stuff done through an extra dynamically loaded indirection or
> something?
This has nothing to do with TLS. The code currently used got to use the
general libpthread locking code. This was, I think, the result of one
of the last changes in the locking code where the libc side wasn't
updated correctly. I've done this and this is what I see:
drepper@ht 20031104-2$ time ./u > /dev/null
real 0m1.272s
user 0m1.270s
sys 0m0.000s
drepper@ht 20031104-2$ time LD_ASSUME_KERNEL=2.4.1 ./u > /dev/null
real 0m0.316s
user 0m0.320s
sys 0m0.000s
drepper@ht 20031104-2$ time LD_LIBRARY_PATH=. ./u > /dev/null
real 0m0.207s
user 0m0.210s
sys 0m0.000s
The first is the old nptl code, the second LinuxThreads, the third the
current nptl code.
- --
- --------------. ,-. 444 Castro Street
Ulrich Drepper \ ,-----------------' \ Mountain View, CA 94041 USA
Red Hat `--' drepper at redhat.com `---------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
iD8DBQE/qDrM2ijCOnn/RHQRAnAyAJ48OxeRGWefxHMVImZMiuZ2YaueOwCgk+8A
9k3SC5sMLghNmlMmzKwWv/E=
=UT6g
-----END PGP SIGNATURE-----
On Tue, 4 Nov 2003, Ulrich Drepper wrote:
>
> This was, I think, the result of one of the last changes in the locking
> code where the libc side wasn't updated correctly. I've done this and
> this is what I see:
Goodie.
> drepper@ht 20031104-2$ time ./u > /dev/null
> real 0m1.272s
> user 0m1.270s
> sys 0m0.000s
>
> drepper@ht 20031104-2$ time LD_ASSUME_KERNEL=2.4.1 ./u > /dev/null
> real 0m0.316s
> user 0m0.320s
> sys 0m0.000s
>
> drepper@ht 20031104-2$ time LD_LIBRARY_PATH=. ./u > /dev/null
> real 0m0.207s
> user 0m0.210s
> sys 0m0.000s
>
> The first is the old nptl code, the second LinuxThreads, the third the
> current nptl code.
Now _that_ looks a hell of a lot better. Thanks.
Linus
On Tue, Nov 04, 2003 at 03:48:28PM -0800, Ulrich Drepper wrote:
>
> The first is the old nptl code, the second LinuxThreads, the third the
> current nptl code.
By current, do you mean what is in Fedora, or you personal development copy?
Thanks,
Jim
On Tue, Nov 04, 2003 at 05:19:04PM -0500, Bill Rugolsky Jr. wrote:
> On Fedora 0.95, Pentium M 1.6GHz, 2.4.22-1.2115.nptl, glibc-2.3.2-10, (NPTL 0.60),
> I get:
>
> Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
> NPTL 100M 13070 100 +++++ +++ 14141 4 13099 100 +++++ +++ +++++ +++
> LinuxThreads 100M 25957 100 +++++ +++ 20037 5 26777 99 +++++ +++ +++++ +++
>
> Ugh, still there.
BTW, there are 3 different cases where locking might be different in glibc.
When -lpthread is not linked in, when -lpthread is linked in but
pthread_create hasn't been compiled yet and when first pthread_create has
been compiled already.
Could you post numbers for all these cases (ie. run the benchmark, then link
the benchmark against -lpthread as well and rerun it and last link it
against -lpthread and add:
static void * tf (void *a) { return NULL; }
...
pthread_t pt;
pthread_create (&pt, NULL, tf, 0);
pthread_join (pt, NULL);
...
to benchmark's main (in each case NPTL and LinuxThreads)?
Jakub
On Tue, Nov 04, 2003 at 07:58:16PM -0500, [email protected] wrote:
> On Tue, Nov 04, 2003 at 03:48:28PM -0800, Ulrich Drepper wrote:
> >
> > The first is the old nptl code, the second LinuxThreads, the third the
> > current nptl code.
>
> By current, do you mean what is in Fedora, or you personal development copy?
Ulrich meant glibc CVS HEAD.
For some reason, stdio locking was not using the jump around lock prefix
variant of locking:
__asm __volatile ("cmpl $0, %%gs:%P6\n\t" \
"je,pt 0f\n\t" \
"lock\n" \
"0:\tcmpxchgl %1, %2\n\t" \
"jnz _L_mutex_lock_%=\n\t" \
".subsection 1\m\t" ...
but one without the first 2 insns, so there were 2 instructions with lock
prefix in putc and similar functions even when only one thread was running.
Jakub