2002-07-26 17:43:32

by David Mosberger

[permalink] [raw]
Subject: performance experiment

Below is a patch that implements an alternate version of the core-loop
of do_select(). I'm interested in hearing how the two versions
(original and new) compare on various architectures. The new loop
happens to perform better on ia64 and I suspect the same will be true
for most RISC platforms. It wouldn't surprise me if the new loop
performed better even on some instances of x86. I suspect on older
x86s (e.g., 80486), the old loop does better. If someone is running
Linux on a Transmeta Crusoe chip, I'd be *very* interested in hearing
how the two loops perform there.

Here is what I'm proposing to do: if a couple of people are willing to
try out the patch below, I'll collect the results and post a summary.
To make sure we're comparing apples to apples, I'd like to suggest to
run LMbench 2 with and without the patch below. Then send me the
select results from the raw results file. For example, you would run
lmbench like so:

$ make rerun

Then look at the results file, which is stored in
results/CONFIG/HOSTNAME.N. For example, on a Pentium III machine
called "adler", the results of the first run would be stored in

results/i686-pc-linux-gnu/adler.0

I'd prefer if you sent me the complete result files, but if you don't
want to do that, it should be good enough to mail me the first and
second line of the file, all the lines starting with "Select", and a
description of the machine you were testing (CPU type, clock speed,
chipset, memory size, and compiler version would be ideal). For the
above example, the lines of interest would be:

[lmbench2.0 results for Linux adler.hpl.hp.com 2.4.8 #1 Wed Aug 15 12:20:46 PDT 2001 i686 unknown]
[LMBENCH_VER: Beta2.3 20010502085147]
Select on 10 fd's: 1.8509 microseconds
Select on 100 fd's: 8.2669 microseconds
Select on 250 fd's: 18.9724 microseconds
Select on 500 fd's: 36.6954 microseconds
Select on 10 tcp fd's: 2.5548 microseconds
Select on 100 tcp fd's: 15.4286 microseconds
Select on 250 tcp fd's: 33.2831 microseconds
Select on 500 tcp fd's: 65.4353 microseconds

IMPORTANT: please collect these results at least for a UP kernel. If
you can supply the results for an SMP kernel (even if it's run only on
a single-processor machine), that'd be icing on the cake.

Thanks & happy hacking,

--david

PS: I believe the alternate implementation is semantically identical
to the old one. But don't sue me if the kernel crashes and burns
after applying the patch.

--- linux-2.4.18/fs/select.c Mon Sep 24 15:08:21 2001
+++ lia64-2.4/fs/select.c Fri Jul 26 10:28:35 2002
@@ -164,7 +164,7 @@
int do_select(int n, fd_set_bits *fds, long *timeout)
{
poll_table table, *wait;
- int retval, i, off;
+ long retval, i, off;
long __timeout = *timeout;

read_lock(&current->files->file_lock);
@@ -182,6 +182,7 @@
retval = 0;
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
+#if 0
for (i = 0 ; i < n; i++) {
unsigned long bit = BIT(i);
unsigned long mask;
@@ -214,6 +215,55 @@
wait = NULL;
}
}
+#else
+ unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
+
+ inp = fds->in; outp = fds->out; exp = fds->ex;
+ rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
+
+ for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
+ unsigned long in, out, ex, all_bits, bit = 1, mask, j;
+ unsigned long res_in = 0, res_out = 0, res_ex = 0;
+ struct file_operations *f_op;
+ struct file *file = NULL;
+
+ in = *inp++; out = *outp++; ex = *exp++;
+ all_bits = in | out | ex;
+ if (all_bits == 0)
+ continue;
+
+ for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) {
+ if (i >= n)
+ break;
+ if (!(bit & all_bits))
+ continue;
+ file = fget(i);
+ if (file)
+ f_op = file->f_op;
+ mask = DEFAULT_POLLMASK;
+ if (file) {
+ if (f_op && f_op->poll)
+ mask = (*f_op->poll)(file, retval ? NULL : wait);
+ fput(file);
+ if ((mask & POLLIN_SET) && (in & bit)) {
+ res_in |= bit;
+ retval++;
+ }
+ if ((mask & POLLOUT_SET) && (out & bit)) {
+ res_out |= bit;
+ retval++;
+ }
+ if ((mask & POLLEX_SET) && (ex & bit)) {
+ res_ex |= bit;
+ retval++;
+ }
+ }
+ }
+ if (res_in) *rinp = res_in;
+ if (res_out) *routp = res_out;
+ if (res_ex) *rexp = res_ex;
+ }
+#endif
wait = NULL;
if (retval || !__timeout || signal_pending(current))
break;


2002-07-26 18:39:18

by David Mosberger

[permalink] [raw]
Subject: Re: performance experiment

>>>>> On Fri, 26 Jul 2002 11:09:31 -0700 (PDT), Davide Libenzi <[email protected]> said:

Davide> On Fri, 26 Jul 2002, David Mosberger wrote:
>> Below is a patch that implements an alternate version of the core-loop
>> of do_select(). I'm interested in hearing how the two versions
>> (original and new) compare on various architectures. The new loop
>> happens to perform better on ia64 and I suspect the same will be true
>> for most RISC platforms. It wouldn't surprise me if the new loop
>> performed better even on some instances of x86. I suspect on older
>> x86s (e.g., 80486), the old loop does better. If someone is running
>> Linux on a Transmeta Crusoe chip, I'd be *very* interested in hearing
>> how the two loops perform there.
>>
>> Here is what I'm proposing to do: if a couple of people are willing to
>> try out the patch below, I'll collect the results and post a summary.
>> To make sure we're comparing apples to apples, I'd like to suggest to
>> run LMbench 2 with and without the patch below. Then send me the
>> select results from the raw results file. For example, you would run
>> lmbench like so:
>>
>> $ make rerun
>>
>> Then look at the results file, which is stored in
>> results/CONFIG/HOSTNAME.N. For example, on a Pentium III machine
>> called "adler", the results of the first run would be stored in
>>
>> results/i686-pc-linux-gnu/adler.0
>>
>> I'd prefer if you sent me the complete result files, but if you don't
>> want to do that, it should be good enough to mail me the first and
>> second line of the file, all the lines starting with "Select", and a
>> description of the machine you were testing (CPU type, clock speed,
>> chipset, memory size, and compiler version would be ideal). For the
>> above example, the lines of interest would be:

Davide> i posted a 95% matching patch about one year ago but it fell
Davide> inside the Alan drop basket :-) basically

Yes, there is nothing deep in the patch. If it wasn't for
register-starved architectures such as x86, it would be the obviously
correct thing to do. Actually, it's a lot easier to convert register
accesses into memory accesses than vice versa, so in principle, the
new loop should do better even on x86 (this reasoning is what triggers
my interest in how Crusoe fares).

Of course, principles get often hit by that nasty thing called reality
(such as the reality of a poor compiler), so I do think some
experimentation is in place. (I should probably also say that my
primary aim is not to speed up select(); if that's an outcome of this
work, fine, but it I'm really more interest in how the two loops
perform across different architectures and implementations).

Davide> Alan requested perf numbers that i did not have time to
Davide> supply. glad you did them ...

Well, I have _not_ done them. I really do need help from others here,
as I have access only to so many types of machines.

Thanks,

--david

2002-07-26 18:04:14

by Davide Libenzi

[permalink] [raw]
Subject: Re: performance experiment

On Fri, 26 Jul 2002, David Mosberger wrote:

> Below is a patch that implements an alternate version of the core-loop
> of do_select(). I'm interested in hearing how the two versions
> (original and new) compare on various architectures. The new loop
> happens to perform better on ia64 and I suspect the same will be true
> for most RISC platforms. It wouldn't surprise me if the new loop
> performed better even on some instances of x86. I suspect on older
> x86s (e.g., 80486), the old loop does better. If someone is running
> Linux on a Transmeta Crusoe chip, I'd be *very* interested in hearing
> how the two loops perform there.
>
> Here is what I'm proposing to do: if a couple of people are willing to
> try out the patch below, I'll collect the results and post a summary.
> To make sure we're comparing apples to apples, I'd like to suggest to
> run LMbench 2 with and without the patch below. Then send me the
> select results from the raw results file. For example, you would run
> lmbench like so:
>
> $ make rerun
>
> Then look at the results file, which is stored in
> results/CONFIG/HOSTNAME.N. For example, on a Pentium III machine
> called "adler", the results of the first run would be stored in
>
> results/i686-pc-linux-gnu/adler.0
>
> I'd prefer if you sent me the complete result files, but if you don't
> want to do that, it should be good enough to mail me the first and
> second line of the file, all the lines starting with "Select", and a
> description of the machine you were testing (CPU type, clock speed,
> chipset, memory size, and compiler version would be ideal). For the
> above example, the lines of interest would be:

i posted a 95% matching patch about one year ago but it fell inside the
Alan drop basket :-) basically Alan requested perf numbers that i did not
have time to supply. glad you did them ...



- Davide


2002-07-26 20:44:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: performance experiment

On Fri, 26 Jul 2002, David Mosberger wrote:

> Davide> i posted a 95% matching patch about one year ago but it fell
> Davide> inside the Alan drop basket :-) basically
>
> Yes, there is nothing deep in the patch. If it wasn't for
> register-starved architectures such as x86, it would be the obviously
> correct thing to do. Actually, it's a lot easier to convert register
> accesses into memory accesses than vice versa, so in principle, the
> new loop should do better even on x86 (this reasoning is what triggers
> my interest in how Crusoe fares).

IMHO the patch makes sense, it reduces memory loads and it "helps" the
compiler to correctly allocate registers.



- Davide


2002-07-27 01:26:32

by David Mosberger

[permalink] [raw]
Subject: Re: performance experiment

>>>>> On Fri, 26 Jul 2002 13:49:27 -0700 (PDT), Davide Libenzi <[email protected]> said:

Davide> IMHO the patch makes sense, it reduces memory loads and it
Davide> "helps" the compiler to correctly allocate registers.

I just realized that a gcc3 feature snuck into the source-code by
accident. Below is a fixed patch which should work with gcc2 as well.

--david

--- linux-2.4.18/fs/select.c Mon Sep 24 15:08:21 2001
+++ lia64-2.4/fs/select.c Fri Jul 26 18:26:38 2002
@@ -164,7 +164,7 @@
int do_select(int n, fd_set_bits *fds, long *timeout)
{
poll_table table, *wait;
- int retval, i, off;
+ long retval, i, off;
long __timeout = *timeout;

read_lock(&current->files->file_lock);
@@ -181,6 +181,7 @@
wait = NULL;
retval = 0;
for (;;) {
+#if 0
set_current_state(TASK_INTERRUPTIBLE);
for (i = 0 ; i < n; i++) {
unsigned long bit = BIT(i);
@@ -214,6 +215,57 @@
wait = NULL;
}
}
+#else
+ unsigned long *rinp, *routp, *rexp, *inp, *outp, *exp;
+
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ inp = fds->in; outp = fds->out; exp = fds->ex;
+ rinp = fds->res_in; routp = fds->res_out; rexp = fds->res_ex;
+
+ for (i = 0; i < n; ++rinp, ++routp, ++rexp) {
+ unsigned long in, out, ex, all_bits, bit = 1, mask, j;
+ unsigned long res_in = 0, res_out = 0, res_ex = 0;
+ struct file_operations *f_op;
+ struct file *file = NULL;
+
+ in = *inp++; out = *outp++; ex = *exp++;
+ all_bits = in | out | ex;
+ if (all_bits == 0)
+ continue;
+
+ for (j = 0; j < __NFDBITS; ++j, ++i, bit <<= 1) {
+ if (i >= n)
+ break;
+ if (!(bit & all_bits))
+ continue;
+ file = fget(i);
+ if (file)
+ f_op = file->f_op;
+ mask = DEFAULT_POLLMASK;
+ if (file) {
+ if (f_op && f_op->poll)
+ mask = (*f_op->poll)(file, retval ? NULL : wait);
+ fput(file);
+ if ((mask & POLLIN_SET) && (in & bit)) {
+ res_in |= bit;
+ retval++;
+ }
+ if ((mask & POLLOUT_SET) && (out & bit)) {
+ res_out |= bit;
+ retval++;
+ }
+ if ((mask & POLLEX_SET) && (ex & bit)) {
+ res_ex |= bit;
+ retval++;
+ }
+ }
+ }
+ if (res_in) *rinp = res_in;
+ if (res_out) *routp = res_out;
+ if (res_ex) *rexp = res_ex;
+ }
+#endif
wait = NULL;
if (retval || !__timeout || signal_pending(current))
break;