2002-06-20 05:58:21

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.4.19pre10aa3

This fixes the o1 sched bits (so in turn breaks alpha :-/ any patch
fixing alpha is welcome of course). Also not yet sure if DaveM is ok
with the removal of prepare_to_switch, his last comment on that is
negative as far I could see. However this was needed for x86, and it
appears it could make a differece for the uml hang too. However I
rejected the sched_yield changes that looked dubious, see below for
details, Robert you may want to apply this o1 stuff to your o1 sched
patch, it's all incremental to it against mainline.

Also merges some stuff from 19pre10jam2, not all the same, in particular
irq-balance is quite different, previous algorithm looked not really
good while auditing it, benchmarks will tell, any feedback on this in
particular would be welcome. Have a look at xosview to see the
difference.

Then drops the intepeer cache under Andi's suggestion (I pretty much
agree with his arguments about NAT etc.., and personally in particular
I'm glad the collision-prone write_seq ^ jiffies gone away).

Some elevator-fix as well, maybe I overlooked something but the -=count
in the cleanup callbacks looked completely broken, see the comment in
the patch too.

The rest should be quite strightforward, all details below.

URL:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre10aa3.gz
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre10aa3/

Only in 2.4.19pre10aa2: 00_block-highmem-all-18b-12.gz
Only in 2.4.19pre10aa3: 00_block-highmem-all-19-1.gz

Latest version from Jens + enable highio on serverworks.

Only in 2.4.19pre10aa3: 00_drop-inetpeer-cache-1.gz

Drop inetpeer cache, we cannot make per-IP assumption
or we'll break with NAT. This avoids AVL complexity
in handling tcp connections. From Andi Kleen.

Further comments from me: previous afinet.id could also collide during
the connection because of the xor between jiffies and the initial
write_seq. Still it can collide if it's too fast but it's very unlikely
now. I also changed the GRAB_TIME from HZ/50 to HZ, so we don't waste
id space, 32 packets per second looks saner (and in a 1 second
timeframe the global id is not likely to overflow).

Only in 2.4.19pre10aa3: 00_elevator-fixes-1

Merged the "merge_only" logic from Andrew's read-latency patch.
Also noticed some apparently buggy stuff in the elevator, so corrected.
There's no point to do "sequence--" when we insert new requests and to
do "sequence -= count" when we merge (on top of the --!), if something
it should be the opposite, merges are much lighter than new requests
for the disk (let's ignore when we cannot merge because of
max_sectors). So let's only rely on the --.

Only in 2.4.19pre10aa3: 00_ext3-0.9.18.gz

Merged ext3 updates from 2.4.19pre10jam2.

Only in 2.4.19pre10aa3: 00_nfs-dcache-parent-optimize-1

Trond's parent lookup optimization.

Only in 2.4.19pre10aa3: 00_spinlock-no-egcs-1

Keep the spinlock structure the same across 2.95 and 3.x
(I got a bugreport due a miscompiled kernel with egcs
some week ago, so I don't care much about the buggy egcs
anymore, stable is 2.95, and bleeding edge is 3.1.1).

Only in 2.4.19pre10aa2: 03_sched-pipe-bandwidth-1

Obsoleted patch by 20_o1-sched-updates-A4-1, but the internals
are exactly the same, so still the feature from Mike is retained.

Also note I'm not using the _sync version in the read() path,
only in the write() path, I explained why in some email to l-k
(if you cannot find it ask). Due a mistake the previous
sched-pipe-bandiwidth-1 had the _sync in the read instead of the
write but conceptually I prefer it only in the write, even if
I doubt any difference could be measured in real life (either one of
the two places or both should be almost the same in practice, it's
more a microfeature).

Only in 2.4.19pre10aa3: 07_cpqarray-sard-1
Only in 2.4.19pre10aa3: 07_cpqfc-compile-1

cpq fc minor updates.

Only in 2.4.19pre10aa3: 07_e100-1.8.38.gz
Only in 2.4.19pre10aa3: 08_e100-includes-1
Only in 2.4.19pre10aa3: 09_e100-compilehack-1

Merged e100 GPL driver from Intel (also make it link
into the kernel).

Only in 2.4.19pre10aa3: 07_qlogicfc-1.gz
Only in 2.4.19pre10aa3: 08_qlogicfc-template-aa-1
Only in 2.4.19pre10aa3: 09_qlogic-link-1

Latest qlogic FC driver for the qla2x00 devices. Ported
to -aa enabled highio and vary_io, and fixed so that
it can be linked into the kernel too.

Only in 2.4.19pre10aa3: 20_o1-sched-updates-A4-1

O1 scheduler fixes from Ingo, includes the 03_sched-pipe-bandwidth-1
feature from Mike.

Only in 2.4.19pre10aa3: 21_o1-A4-aa-1

Some more incremental O1 cleanup on top of Ingo's changes (note:
not all of them, in particular rejected the sched_yield changes
that wrongly waits the timeslice to expire before giving the
cpu to the next task in the runqueue). Also the rq_lock rejected,
see the comment on the patch, -preempt is a no-way for 2.4, so
we can a bit more efficient thanks to it.

Only in 2.4.19pre10aa3: 30_irq-balance-12

Merged irq balance from 2.4.19pre10jam2. The algorithm here is very
different from the original one that seems wasteful to me, this new
one should do much better, benchmarks will tell. Also the
implementation is more optimized, and buggy places like idle_timeout
are also corrected. also have a look at the difference in xosview.

Only in 2.4.19pre10aa3: 50_uml-patch-2.4.18-34.gz

Updated to -34 uml revision from Jeff.

Only in 2.4.19pre10aa3: 90_buddyinfo-1

Statistics about the buddy allocator available from /proc/buddyinfo.

Only in 2.4.19pre10aa3: 94_discontigmem-meminfo-1

Provides per-node mem info stats in /proc/meminfo. Note: unlike
the original patch this is based on DISCONTIGMEM and not on NUMA.
This feature only depends on discontigmem, so there's no reason
to enable it only with numa selected (and the cost of it is not
significant).

Only in 2.4.19pre10aa3: 94_numaq-tsc-1

Disable tsc based gettimeofday on numaq. This is also different
from the original patch, please check, thanks.

Andrea


2002-06-20 06:10:23

by David Miller

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

From: Andrea Arcangeli <[email protected]>
Date: Thu, 20 Jun 2002 07:59:33 +0200

Also not yet sure if DaveM is ok with the removal of
prepare_to_switch, his last comment on that is negative as far I
could see.

Ingo's stuff is perfectly fine, it was a brain fart
wrt. prepare_to_switch.

2002-06-20 06:29:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Wed, Jun 19, 2002 at 11:04:54PM -0700, David S. Miller wrote:
> From: Andrea Arcangeli <[email protected]>
> Date: Thu, 20 Jun 2002 07:59:33 +0200
>
> Also not yet sure if DaveM is ok with the removal of
> prepare_to_switch, his last comment on that is negative as far I
> could see.
>
> Ingo's stuff is perfectly fine, it was a brain fart
> wrt. prepare_to_switch.

Ok, thanks for the info. prepare_arch_switch looked in the same place,
but just in case. :)

Andrea

2002-06-20 11:45:06

by Andrey Nekrasov

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

Hello Andrea Arcangeli,


Kernel 2.4.19pre10aa3 + hidden_arp (from LVS)



...
Intel(R) PRO/100 Fast Ethernet Adapter - Loadable driver, ver 1.8.38
Copyright (c) 2002 Intel Corporation

eth0: Intel(R) 8255x-based Ethernet Adapter
Mem:0xfb101000 IRQ:18 Speed:100 Mbps Dx:Full
Hardware receive checksums enabled
cpu cycle saver enabled
...

...
eth0 e100_wait_exec_cmd: Wait failed.
hw tcp v4 csum failed
hw tcp v4 csum failed
hw tcp v4 csum failed
hw tcp v4 csum failed
hw tcp v4 csum failed
hw tcp v4 csum failed
hw tcp v4 csum failed
...


Than this message is caused? It something serious also can be problems?


bye.
--

2002-06-20 13:05:29

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3


On 2002.06.20 Andrea Arcangeli wrote:
>
>Only in 2.4.19pre10aa3: 07_e100-1.8.38.gz
>Only in 2.4.19pre10aa3: 08_e100-includes-1
>Only in 2.4.19pre10aa3: 09_e100-compilehack-1
>
> Merged e100 GPL driver from Intel (also make it link
> into the kernel).
>

???

Current driver is 2.0.30...
And would not have been easier to get it from 2.5 ? You just have
good Makefiles, instead of hacking those from Intel, that I suppose
are prepared for building separate from kernel tree.

Or just take it from jam2...I have been using both e100 and e1000
in the same cluster and no problem with them.

Btw, would you mind mergin also e1000...? ;).

By.

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-pre10-jam3, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.4mdk)

2002-06-20 13:18:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Thu, Jun 20, 2002 at 03:44:59PM +0400, Andrey Nekrasov wrote:
> Hello Andrea Arcangeli,
>
>
> Kernel 2.4.19pre10aa3 + hidden_arp (from LVS)
>
>
>
> ...
> Intel(R) PRO/100 Fast Ethernet Adapter - Loadable driver, ver 1.8.38
> Copyright (c) 2002 Intel Corporation
>
> eth0: Intel(R) 8255x-based Ethernet Adapter
> Mem:0xfb101000 IRQ:18 Speed:100 Mbps Dx:Full
> Hardware receive checksums enabled
> cpu cycle saver enabled
> ...
>
> ...
> eth0 e100_wait_exec_cmd: Wait failed.
> hw tcp v4 csum failed
> hw tcp v4 csum failed
> hw tcp v4 csum failed
> hw tcp v4 csum failed
> hw tcp v4 csum failed
> hw tcp v4 csum failed
> hw tcp v4 csum failed
> ...
>
>
> Than this message is caused? It something serious also can be problems?

probably a driver issue with hw checksum. btw interestingly the stack
computes the cksum by hand for every tcp incoming packet (unless
ip_summed is set by the driver to CHECKSUM_UNNECESSARY), this is how it
noticed the hw checksum was wrong. I guess it's either an e100 driver
issue, or an hardware issue. Maybe it's setting DF_CSUM_OFFLOAD for your
card despite your hardware doesn't support that feature, or maybe
there's something wrong in the logic that sets the skb->csum.

Andrea

2002-06-20 13:31:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Thu, Jun 20, 2002 at 03:05:11PM +0200, J.A. Magallon wrote:
>
> On 2002.06.20 Andrea Arcangeli wrote:
> >
> >Only in 2.4.19pre10aa3: 07_e100-1.8.38.gz
> >Only in 2.4.19pre10aa3: 08_e100-includes-1
> >Only in 2.4.19pre10aa3: 09_e100-compilehack-1
> >
> > Merged e100 GPL driver from Intel (also make it link
> > into the kernel).
> >
>
> ???
>
> Current driver is 2.0.30...
> And would not have been easier to get it from 2.5 ? You just have
> good Makefiles, instead of hacking those from Intel, that I suppose
> are prepared for building separate from kernel tree.

I was using this version in an environment and I preferred not to change
this variable because I wouldn't had time to notice if it broke, but of
course I should upgrade soon, thanks for the reminder :). For your tree
you can backout these three patches and apply a more recent version of
course.

> Or just take it from jam2...I have been using both e100 and e1000
> in the same cluster and no problem with them.
>
> Btw, would you mind mergin also e1000...? ;).

I didn't need it in any environment, and I usually try to avoid any
driver update in -aa except in case I can test it somehow. However if
there's significant request for this I can as well add it (in particular
because it's unlikely to raise maintainance problems).

Andrea

2002-06-20 14:06:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

it seems in your last jam2 you reintroduced a bug fixed in mainline with
the smptimers patch, that can crash the kernel in the smptimers, see the
run_timer_list_running global variable.

Andrea

2002-06-20 14:08:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3


On Thu, 20 Jun 2002, Andrea Arcangeli wrote:

> This fixes the o1 sched bits (so in turn breaks alpha :-/ any patch
> fixing alpha is welcome of course). Also not yet sure if DaveM is ok
> with the removal of prepare_to_switch, his last comment on that is
> negative as far I could see. [...]

his last comment was that it's ok. All prepare_to_switch() functionality
can be put into prepare_arch_switch() just fine.

The -A4 backport sched.c should pretty much work on Alpha out of box, as
long as you have the 3-argument switch_to() that the vanilla kernel has,
and use the same _arch_ defines that x86 does - but add the
prepare_to_switch code to prepare_arch_switch().

> Only in 2.4.19pre10aa3: 21_o1-A4-aa-1
>
> Some more incremental O1 cleanup on top of Ingo's changes (note:
> not all of them, in particular rejected the sched_yield changes
> that wrongly waits the timeslice to expire before giving the
> cpu to the next task in the runqueue). [...]

the gradual decreasing of timeslices is intentional, although in the
simple cases it might not make much of a difference (besides yielding more
before giving up). But if there is some sort of bouncing around between
yielding tasks that happen to be at the lowest priority level already then
we should not be too abrupt about punishing tasks by taking away all their
timeslices.

the 'extra' yielding done does not matter much, since an expired task will
likely wait alot of time (many milliseconds) before running again - so the
extra yielding is amortized greatly.

> [...] Also the rq_lock rejected,
> see the comment on the patch, -preempt is a no-way for 2.4, so
> we can a bit more efficient thanks to it.

you mean this_rq_lock()? Note that your 21_o1-A4-aa-1 patch does not do
what is listed, those changes appear to be included in
20_o1-sched-updates-A4-1.

wrt. this_rq_lock(), the only difference should be the ordering of cli vs.
the assignment of rq - ie. there should be no efficiency difference.

Ingo

2002-06-20 14:39:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Thu, Jun 20, 2002 at 04:05:54PM +0200, Ingo Molnar wrote:
>
> On Thu, 20 Jun 2002, Andrea Arcangeli wrote:
>
> > This fixes the o1 sched bits (so in turn breaks alpha :-/ any patch
> > fixing alpha is welcome of course). Also not yet sure if DaveM is ok
> > with the removal of prepare_to_switch, his last comment on that is
> > negative as far I could see. [...]
>
> his last comment was that it's ok. All prepare_to_switch() functionality
> can be put into prepare_arch_switch() just fine.
>
> The -A4 backport sched.c should pretty much work on Alpha out of box, as

I didn't meant it was difficult to fix, just one more breakage to fix
that is unfixed at the moment so alpha won't compile, just in case
somebody wondered why they couldn't compile on alpha.

> long as you have the 3-argument switch_to() that the vanilla kernel has,
> and use the same _arch_ defines that x86 does - but add the
> prepare_to_switch code to prepare_arch_switch().
>
> > Only in 2.4.19pre10aa3: 21_o1-A4-aa-1
> >
> > Some more incremental O1 cleanup on top of Ingo's changes (note:
> > not all of them, in particular rejected the sched_yield changes
> > that wrongly waits the timeslice to expire before giving the
> > cpu to the next task in the runqueue). [...]
>
> the gradual decreasing of timeslices is intentional, although in the
> simple cases it might not make much of a difference (besides yielding more
> before giving up). But if there is some sort of bouncing around between
> yielding tasks that happen to be at the lowest priority level already then
> we should not be too abrupt about punishing tasks by taking away all their
> timeslices.
>
> the 'extra' yielding done does not matter much, since an expired task will
> likely wait alot of time (many milliseconds) before running again - so the
> extra yielding is amortized greatly.

I will think more about this, however the "besides yielding more
before giving up" is the part I didn't like of it, it should giveup the
cpu immediatly IMHO, no matter if there's only total bouncing, however
while giving up the cpu it also must not lose its timeslice, but that
looked ok before you apparently decreased it in -A4. again, I may be
wrong, need more time to check those bits.

> > [...] Also the rq_lock rejected,
> > see the comment on the patch, -preempt is a no-way for 2.4, so
> > we can a bit more efficient thanks to it.
>
> you mean this_rq_lock()? Note that your 21_o1-A4-aa-1 patch does not do
> what is listed, those changes appear to be included in
> 20_o1-sched-updates-A4-1.

correct, this part of the comment really applied to the
20_o1-sched-updates-A4-1, not to 21_o1-A4-aa-1, I wrote it there just
because it was still an o1 topic and I forgot to mention it in the
previous point (should had gone up a few lines before writing it).

> wrt. this_rq_lock(), the only difference should be the ordering of cli vs.
> the assignment of rq - ie. there should be no efficiency difference.

well the microoptimization is to reduce of a few cycles the cli
protected sections, I just didn't see the point in growing them.

however I noticed an smp bug in my changes, I was too aggressive
removing the loop in task_rq_lock, not that such bug ever triggered yet
but the rq may change under us while we take the lock if the task is
getting migrated to another cpu.

You may also want to review the irq-balance changes I made while
integrating it.

thanks,

Andrea

2002-06-20 14:43:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Thu, Jun 20, 2002 at 04:40:33PM +0200, Andrea Arcangeli wrote:
> however I noticed an smp bug in my changes, I was too aggressive
> removing the loop in task_rq_lock, not that such bug ever triggered yet
> but the rq may change under us while we take the lock if the task is
> getting migrated to another cpu.

just for reference, here it is the fix:

--- sched/kernel/sched.c.~1~ Thu Jun 20 16:42:41 2002
+++ sched/kernel/sched.c Thu Jun 20 16:43:36 2002
@@ -133,19 +133,13 @@ static inline runqueue_t *task_rq_lock(t
{
struct runqueue *rq;

- /*
- * 2.4 cannot be made preemptive or it can trigger preemption bugs all
- * over the place (just check the networking per-cpu data), so it's
- * pointless to disable irq before reading the current runqueue address.
- */
+repeat_lock_task:
rq = task_rq(p);
spin_lock_irqsave(&rq->lock, *flags);
- if (unlikely(rq != task_rq(p)))
- /*
- * Bug just in case somebody made the 2.4 kernel non preemptive
- * as an experiment on a non production system.
- */
- BUG();
+ if (unlikely(rq != task_rq(p))) {
+ spin_unlock_irqrestore(&rq->lock, *flags);
+ goto repeat_lock_task;
+ }
return rq;
}


Andrea

2002-06-20 15:39:32

by Heinz Diehl

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Thu Jun 20 2002, Andrea Arcangeli wrote:

[....]

Pre10-aa3 fails to compile on my systems:

[....]
gcc -D__KERNEL__ -I/usr/src/linux/include -Wall -Wstrict-prototypes
-Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer
-pipe -mpreferred-stack-boundary=2 -march=k6 -nostdinc -I
/usr/local/gcc2/lib/gcc-lib/i586-pc-linux-gnu/2.95.3/include
-DKBUILD_BASENAME=ioctl -c -o ioctl.o ioctl.c
gcc: Internal compiler error: program cc1 got fatal signal 11
make[3]: *** [ioctl.o] Error 1
make[3]: Leaving directory /usr/src/linux/fs/ext3'
make[2]: *** [first_rule] Error 2
make[2]: Leaving directory /usr/src/linux/fs/ext3'
make[1]: *** [_subdir_ext3] Error 2
make[1]: Leaving directory /usr/src/linux/fs'
make: *** [_dir_fs] Error 2
chiara:/usr/src/linux #

This is not a hardware problem, all other programs and kernels compile
without any problems (and also -aa2 and older -aa kernels).

gcc is "gcc version 2.95.3 20010315 (release)", and it also does not
compile with "Thread model: single, gcc version 3.1". Both compilers
built -aa2 flawlessly.

--
# Heinz Diehl, 68259 Mannheim, Germany

2002-06-20 17:02:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Thu, Jun 20, 2002 at 03:19:51PM +0200, Andrea Arcangeli wrote:
> On Thu, Jun 20, 2002 at 03:44:59PM +0400, Andrey Nekrasov wrote:
> > Hello Andrea Arcangeli,
> >
> >
> > Kernel 2.4.19pre10aa3 + hidden_arp (from LVS)
> >
> >
> >
> > ...
> > Intel(R) PRO/100 Fast Ethernet Adapter - Loadable driver, ver 1.8.38
> > Copyright (c) 2002 Intel Corporation
> >
> > eth0: Intel(R) 8255x-based Ethernet Adapter
> > Mem:0xfb101000 IRQ:18 Speed:100 Mbps Dx:Full
> > Hardware receive checksums enabled
> > cpu cycle saver enabled
> > ...
> >
> > ...
> > eth0 e100_wait_exec_cmd: Wait failed.
> > hw tcp v4 csum failed
> > hw tcp v4 csum failed
> > hw tcp v4 csum failed
> > hw tcp v4 csum failed
> > hw tcp v4 csum failed
> > hw tcp v4 csum failed
> > hw tcp v4 csum failed
> > ...
> >
> >
> > Than this message is caused? It something serious also can be problems?
>
> probably a driver issue with hw checksum. btw interestingly the stack
> computes the cksum by hand for every tcp incoming packet (unless
> ip_summed is set by the driver to CHECKSUM_UNNECESSARY), this is how it
> noticed the hw checksum was wrong. I guess it's either an e100 driver
> issue, or an hardware issue. Maybe it's setting DF_CSUM_OFFLOAD for your
> card despite your hardware doesn't support that feature, or maybe
> there's something wrong in the logic that sets the skb->csum.

you may want to try again with aa4, it has a newer e100 driver taken
from jam2 that might make difference here.

Andrea

2002-06-20 22:48:42

by J.A. Magallon

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3


On 2002.06.20 Andrea Arcangeli wrote:
>
>Also merges some stuff from 19pre10jam2, not all the same, in particular
>irq-balance is quite different, previous algorithm looked not really
>good while auditing it, benchmarks will tell, any feedback on this in
>particular would be welcome. Have a look at xosview to see the
>difference.
>

Still not tested on the dual xeon, but on a BX with doal PII:
werewolf:~# cat /proc/interrupts
CPU0 CPU1
0: 83491 63920 IO-APIC-edge timer
1: 2044 1213 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
5: 1 1 IO-APIC-level bttv
8: 0 1 IO-APIC-edge rtc
10: 52335 24289 IO-APIC-level aic7xxx, EMU10K1
11: 69049 50801 IO-APIC-level eth0, nvidia
12: 33155 23661 IO-APIC-edge PS/2 Mouse
14: 2 14 IO-APIC-edge ide0
15: 3 13 IO-APIC-edge ide1
NMI: 0 0
LOC: 147281 147325
ERR: 0
MIS: 36

Old patch, on the dual P4Xeon box:
werewolf:~> ssh annwn cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 3302667 3295991 3299383 3299984 IO-APIC-edge timer
1: 4813 4680 4796 4846 IO-APIC-edge keyboard
2: 0 0 0 0 XT-PIC cascade
8: 1 0 0 0 IO-APIC-edge rtc
12: 64959 64803 65238 63623 IO-APIC-edge PS/2 Mouse
16: 65038 66347 60169 67167 IO-APIC-level e100
17: 0 0 0 0 IO-APIC-level Intel ICH2
18: 529910 524941 535660 535544 IO-APIC-level aic7xxx, eth2
19: 71883 71973 72540 72460 IO-APIC-level usb-uhci, eth0
22: 2497022 2491901 2495182 2495298 IO-APIC-level nvidia
23: 0 0 0 0 IO-APIC-level usb-uhci
NMI: 0 0 0 0
LOC: 13198438 13198374 13198440 13198453
ERR: 0
MIS: 0

I think the old one looks much better... ;)

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-pre10-jam3a, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.4mdk)

2002-06-21 04:31:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre10aa3

On Fri, Jun 21, 2002 at 12:48:31AM +0200, J.A. Magallon wrote:
>
> On 2002.06.20 Andrea Arcangeli wrote:
> >
> >Also merges some stuff from 19pre10jam2, not all the same, in particular
> >irq-balance is quite different, previous algorithm looked not really
> >good while auditing it, benchmarks will tell, any feedback on this in
> >particular would be welcome. Have a look at xosview to see the
> >difference.
> >
>
> Still not tested on the dual xeon, but on a BX with doal PII:
> werewolf:~# cat /proc/interrupts
> CPU0 CPU1
> 0: 83491 63920 IO-APIC-edge timer
> 1: 2044 1213 IO-APIC-edge keyboard
> 2: 0 0 XT-PIC cascade
> 5: 1 1 IO-APIC-level bttv
> 8: 0 1 IO-APIC-edge rtc
> 10: 52335 24289 IO-APIC-level aic7xxx, EMU10K1
> 11: 69049 50801 IO-APIC-level eth0, nvidia
> 12: 33155 23661 IO-APIC-edge PS/2 Mouse
> 14: 2 14 IO-APIC-edge ide0
> 15: 3 13 IO-APIC-edge ide1
> NMI: 0 0
> LOC: 147281 147325
> ERR: 0
> MIS: 36
>
> Old patch, on the dual P4Xeon box:
> werewolf:~> ssh annwn cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 3302667 3295991 3299383 3299984 IO-APIC-edge timer
> 1: 4813 4680 4796 4846 IO-APIC-edge keyboard
> 2: 0 0 0 0 XT-PIC cascade
> 8: 1 0 0 0 IO-APIC-edge rtc
> 12: 64959 64803 65238 63623 IO-APIC-edge PS/2 Mouse
> 16: 65038 66347 60169 67167 IO-APIC-level e100
> 17: 0 0 0 0 IO-APIC-level Intel ICH2
> 18: 529910 524941 535660 535544 IO-APIC-level aic7xxx, eth2
> 19: 71883 71973 72540 72460 IO-APIC-level usb-uhci, eth0
> 22: 2497022 2491901 2495182 2495298 IO-APIC-level nvidia
> 23: 0 0 0 0 IO-APIC-level usb-uhci
> NMI: 0 0 0 0
> LOC: 13198438 13198374 13198440 13198453
> ERR: 0
> MIS: 0
>
> I think the old one looks much better... ;)

I think you're missing, the more irq you get in the same cpu the better.
You're wasting tons of icache for no good reason. It's an illusion that
seeing the number distributed is a good thing. With the foster the bad
thing is that all irqs are delivered to the same cpu, so the load cannot
scale, but for any benchmark (at least before my patch) you should use
irq binding to avoid wasting icache, now with my logic maybe we can rely
on the random distribution under load, and on the idle cpu distribution
in multiway and in general in workloads with sometime idle cpus.

Andrea