NAPI network drivers mask the rx interrupts in their interrupt handler,
and reenable them in dev->poll(). In the worst case, that happens for
every packet. I've tried to measure the overhead of that operation.
The cpu time needed to recieve 50k packets/sec:
without NAPI: 53.7 %
with NAPI: 59.9 %
50k packets/sec is the limit for NAPI, at higher packet rates the forced
mitigation kicks in and every interrupt recieves more than one packet.
The cpu time was measured by busy-looping in user space, the numbers
should be accurate to less than 1 %.
Summary: with my setup, the overhead is around 11 %.
Could someone try to reproduce my results?
Sender:
# sendpkt <target ip> 1 <10..50, go get a good packet rate>
Receiver:
$ loadtest
Please disable any interrupt mitigation features of your nic, otherwise
the mitigation will dramatically change the needed cpu time.
The sender sends ICMP echo reply packets, evenly spaced by
"memset(,,n*512)" between the syscalls.
The cpu load was measured with a user space app that calls
"memset(,,16384)" in a tight loop, and reports the number of loops per
second.
I've used a patched tulip driver, the current NAPI driver contains a
loop that severely slows down the nic under such loads.
The patch and my test apps are at
http://www.q-ag.de/~manfred/loadtest
hardware setup:
Duron 700, VIA KT 133
no IO APIC, i.e. slow 8259 XT PIC.
Accton tulip clone, ADMtek comet.
crossover cable
Sender: Celeron 1.13 GHz, rtl8139
--
Manfred
From: Manfred Spraul <[email protected]>
Date: Tue, 17 Sep 2002 21:53:03 +0200
Receiver:
$ loadtest
This appears to be x86 only, sorry I can't test this out for you as
all my boxes are sparc64.
I was actually eager to try your tests out here.
Do you really need to use x86 instructions to do what you
are doing? There are portable pthread mutexes available.
"David S. Miller" wrote:
>
> From: Manfred Spraul <[email protected]>
> Date: Tue, 17 Sep 2002 21:53:03 +0200
>
> Receiver:
> $ loadtest
>
> This appears to be x86 only, sorry I can't test this out for you as
> all my boxes are sparc64.
>
> I was actually eager to try your tests out here.
>
> Do you really need to use x86 instructions to do what you
> are doing? There are portable pthread mutexes available.
There is a similar background loadtester at
http://www.zip.com.au/~akpm/linux/#zc .
It's fairly fancy - I wrote it for measuring networking
efficiency. It doesn't seem to have any PCisms....
(I measured similar regression using an ancient NAPIfied
3c59x a long time ago).
From: Andrew Morton <[email protected]>
Date: Tue, 17 Sep 2002 14:32:09 -0700
There is a similar background loadtester at
http://www.zip.com.au/~akpm/linux/#zc .
It's fairly fancy - I wrote it for measuring networking
efficiency. It doesn't seem to have any PCisms....
Thanks I'll check it out, but meanwhile I hacked up sparc
specific assembler for manfred's code :-)
(I measured similar regression using an ancient NAPIfied
3c59x a long time ago).
Well, it is due to the same problems manfred saw initially,
namely just a crappy or buggy NAPI driver implementation. :-)
"David S. Miller" wrote:
>
> From: Andrew Morton <[email protected]>
> Date: Tue, 17 Sep 2002 14:32:09 -0700
>
> There is a similar background loadtester at
> http://www.zip.com.au/~akpm/linux/#zc .
>
> It's fairly fancy - I wrote it for measuring networking
> efficiency. It doesn't seem to have any PCisms....
>
> Thanks I'll check it out, but meanwhile I hacked up sparc
> specific assembler for manfred's code :-)
>
> (I measured similar regression using an ancient NAPIfied
> 3c59x a long time ago).
>
> Well, it is due to the same problems manfred saw initially,
> namely just a crappy or buggy NAPI driver implementation. :-)
It was due to additional inl()'s and outl()'s in the driver fastpath.
Testcase was netperf Tx and Rx. Just TCP over 100bT. AFAIK, this overhead
is intrinsic to NAPI. Not to say that its costs outweigh its benefits,
but it's just there.
If someone wants to point me at all the bits and pieces to get a
NAPIfied 3c59x working on 2.5.current I'll retest, and generate
some instruction-level oprofiles.
David S. Miller wrote:
> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.
Just to pick nits... my example went from 2 or 3 IOs [depending on the
presence/absence of a work loop] to 6 IOs.
Feel free to re-read my message and point out where an IO can be
eliminated...
Jeff
From: Andrew Morton <[email protected]>
Date: Tue, 17 Sep 2002 14:45:08 -0700
"David S. Miller" wrote:
> Well, it is due to the same problems manfred saw initially,
> namely just a crappy or buggy NAPI driver implementation. :-)
It was due to additional inl()'s and outl()'s in the driver fastpath.
How many? Did the implementation cache the register value in a
software state word or did it read the register each time to write
the IRQ masking bits back?
It is issues like this that make me say "crappy or buggy NAPI
implementation"
Any driver should be able to get the NAPI overhead to max out at
2 PIOs per packet.
And if the performance is really concerning, perhaps add an option to
use MEM space in the 3c59x driver too, IO instructions are constant
cost regardless of how fast the PCI bus being used is :-)
"David S. Miller" wrote:
>
> From: Andrew Morton <[email protected]>
> Date: Tue, 17 Sep 2002 14:45:08 -0700
>
> "David S. Miller" wrote:
> > Well, it is due to the same problems manfred saw initially,
> > namely just a crappy or buggy NAPI driver implementation. :-)
>
> It was due to additional inl()'s and outl()'s in the driver fastpath.
>
> How many? Did the implementation cache the register value in a
> software state word or did it read the register each time to write
> the IRQ masking bits back?
>
Looks like it cached it:
- outw(SetIntrEnb | (inw(ioaddr + 10) & ~StatsFull), ioaddr + EL3_CMD);
vp->intr_enable &= ~StatsFull;
+ outw(vp->intr_enable, ioaddr + EL3_CMD);
> It is issues like this that make me say "crappy or buggy NAPI
> implementation"
>
> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.
>
> And if the performance is really concerning, perhaps add an option to
> use MEM space in the 3c59x driver too, IO instructions are constant
> cost regardless of how fast the PCI bus being used is :-)
Yup. But deltas are interesting.
From: Jeff Garzik <[email protected]>
Date: Tue, 17 Sep 2002 17:54:42 -0400
David S. Miller wrote:
> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.
Just to pick nits... my example went from 2 or 3 IOs [depending on the
presence/absence of a work loop] to 6 IOs.
I mean "2 extra PIOs" not "2 total PIOs".
I think it's doable for just about every driver, even tg3 with it's
weird semaphore scheme takes 2 extra PIOs worst case with NAPI.
The semaphore I have to ACK anyways at hw IRQ time anyways, and since
I keep a software copy of the IRQ masking register, mask and unmask
are each one PIO.
Manfred, could you please turn MMIO (you can select it
via kernel config) and see what the new difference looks like?
I am not so sure with that 6% difference there is no other bug lurking
there; 6% seems too large for an extra two PCI transactions per packet.
If someone could test a different NIC this would be great.
Actually what would be even better is to go something like 20kpps,
50kpps, 80 kpps, 100kpps and 140 kpps and see what we get.
cheers,
jamal
From: jamal <[email protected]>
Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)
I am not so sure with that 6% difference there is no other bug lurking
there; 6% seems too large for an extra two PCI transactions per packet.
{in,out}{b,w,l}() operations have a fixed timing, therefore his
results doesn't sound that far off.
It is also one of the reasons I suspect Andrew saw such bad results
with 3c59x, but probably that is not the only reason.
David S. Miller wrote:
> From: Jeff Garzik <[email protected]>
> Date: Tue, 17 Sep 2002 17:54:42 -0400
>
> David S. Miller wrote:
> > Any driver should be able to get the NAPI overhead to max out at
> > 2 PIOs per packet.
>
> Just to pick nits... my example went from 2 or 3 IOs [depending on the
> presence/absence of a work loop] to 6 IOs.
>
> I mean "2 extra PIOs" not "2 total PIOs".
>
> I think it's doable for just about every driver, even tg3 with it's
> weird semaphore scheme takes 2 extra PIOs worst case with NAPI.
>
> The semaphore I have to ACK anyways at hw IRQ time anyways, and since
> I keep a software copy of the IRQ masking register, mask and unmask
> are each one PIO.
You're looking at at least one extra get-irq-status too, at least in the
classical 10/100 drivers I'm used to seeing...
Jeff
From: Jeff Garzik <[email protected]>
Date: Tue, 17 Sep 2002 22:11:14 -0400
You're looking at at least one extra get-irq-status too, at least in the
classical 10/100 drivers I'm used to seeing...
How so? The number of ones done in the e1000 NAPI code are the same
(read register until no interesting status bits remain set, same as
pre-NAPI e1000 driver).
For tg3 it's a cheap memory read from the status block not a PIO.
"David S. Miller" wrote:
>
> From: jamal <[email protected]>
> Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)
>
> I am not so sure with that 6% difference there is no other bug lurking
> there; 6% seems too large for an extra two PCI transactions per packet.
>
> {in,out}{b,w,l}() operations have a fixed timing, therefore his
> results doesn't sound that far off.
>
> It is also one of the reasons I suspect Andrew saw such bad results
> with 3c59x, but probably that is not the only reason.
They weren't "very bad", iirc. Maybe a 5% increase in CPU load.
It was all a long time ago. Will retest if someone sends URLs.
David S. Miller wrote:
> From: Jeff Garzik <[email protected]>
> Date: Tue, 17 Sep 2002 22:11:14 -0400
>
> You're looking at at least one extra get-irq-status too, at least in the
> classical 10/100 drivers I'm used to seeing...
>
> How so? The number of ones done in the e1000 NAPI code are the same
> (read register until no interesting status bits remain set, same as
> pre-NAPI e1000 driver).
>
> For tg3 it's a cheap memory read from the status block not a PIO.
Non-NAPI:
get-irq-stat
ack-irq
get-irq-stat (omit, if no work loop)
NAPI:
get-irq-stat
ack-all-but-rx-irq
mask-rx-irqs
get-irq-stat (omit, if work loop)
...
ack-rx-irqs
get-irq-stat
unmask-rx-irqs
This is the low load / low latency case only. The number of IOs
decreases at higher loads [obviously :)]
"David S. Miller" <[email protected]> writes:
> From: jamal <[email protected]>
> Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)
>
> I am not so sure with that 6% difference there is no other bug lurking
> there; 6% seems too large for an extra two PCI transactions per packet.
>
> {in,out}{b,w,l}() operations have a fixed timing, therefore his
> results doesn't sound that far off.
????
I don't see why they should be. If it is a pci device the cost should
the same as a pci memory I/O. The bus packets are the same. So things like
increasing the pci bus speed should make it take less time.
Plus I have played with calibrating the TSC with outb to port
0x80 and there was enough variation that it was unuseable. On some
newer systems it would take twice as long as on some older ones.
Eric
On Wed, 2002-09-18 at 18:27, Eric W. Biederman wrote:
> Plus I have played with calibrating the TSC with outb to port
> 0x80 and there was enough variation that it was unuseable. On some
> newer systems it would take twice as long as on some older ones.
port 0x80 isnt going to PCI space.
x86 generally posts mmio write but not io write. Thats quite measurable.
From: [email protected] (Eric W. Biederman)
Date: 18 Sep 2002 11:27:34 -0600
"David S. Miller" <[email protected]> writes:
> {in,out}{b,w,l}() operations have a fixed timing, therefore his
> results doesn't sound that far off.
????
I don't see why they should be. If it is a pci device the cost should
the same as a pci memory I/O. The bus packets are the same. So things like
increasing the pci bus speed should make it take less time.
The x86 processor has a well defined timing for executing inb
etc. instructions, the timing is fixed and is independant of the
speed of the PCI bus the device is on.
On Wed, 2002-09-18 at 21:23, David S. Miller wrote:
> The x86 processor has a well defined timing for executing inb
> etc. instructions, the timing is fixed and is independant of the
> speed of the PCI bus the device is on.
Earth calling Dave Miller
The inb timing depends on the PCI bus. If you want proof set a Matrox
G400 into no pci retry mode, run a large X load at it and time some inbs
you should be able to get to about 100 milliseconds for an inb to
execute
From: Alan Cox <[email protected]>
Date: 18 Sep 2002 21:43:09 +0100
The inb timing depends on the PCI bus. If you want proof set a Matrox
G400 into no pci retry mode, run a large X load at it and time some inbs
you should be able to get to about 100 milliseconds for an inb to
execute
Matrox isn't using inb/outb instructions to IO space, it is being
accessed by X using MEM space which is done using normal load and
store instructions on x86 after the card is mmap()'d into user space.
On Wed, 2002-09-18 at 21:46, David S. Miller wrote:
> From: Alan Cox <[email protected]>
> Date: 18 Sep 2002 21:43:09 +0100
>
> The inb timing depends on the PCI bus. If you want proof set a Matrox
> G400 into no pci retry mode, run a large X load at it and time some inbs
> you should be able to get to about 100 milliseconds for an inb to
> execute
>
> Matrox isn't using inb/outb instructions to IO space, it is being
> accessed by X using MEM space which is done using normal load and
> store instructions on x86 after the card is mmap()'d into user space.
It doesnt matter what XFree86 is doing. Thats just to load the PCI bus
and jam it up to prove the point. It'll change your inb timing
From: Alan Cox <[email protected]>
Date: 18 Sep 2002 22:15:27 +0100
It doesnt matter what XFree86 is doing. Thats just to load the PCI bus
and jam it up to prove the point. It'll change your inb timing
Understood. Maybe a more accurate wording would be "a fixed minimum
timing".
Alan Cox <[email protected]> writes:
> On Wed, 2002-09-18 at 18:27, Eric W. Biederman wrote:
> > Plus I have played with calibrating the TSC with outb to port
> > 0x80 and there was enough variation that it was unuseable. On some
> > newer systems it would take twice as long as on some older ones.
>
> port 0x80 isnt going to PCI space.
Agreed. It isn't going anywhere, and it takes it a while to recogonize
that.
> x86 generally posts mmio write but not io write. Thats quite measurable.
The difference timing difference between posted and non-posted writes
I can see.
Eric
"David S. Miller" <[email protected]> writes:
> From: Alan Cox <[email protected]>
> Date: 18 Sep 2002 22:15:27 +0100
>
> It doesnt matter what XFree86 is doing. Thats just to load the PCI bus
> and jam it up to prove the point. It'll change your inb timing
>
> Understood. Maybe a more accurate wording would be "a fixed minimum
> timing".
Why?
If I do an inb to a PCI-X device running at 133Mhz it should come back
much faster than an inb from my serial port on the ISA port. What
is the reason for the fixed minimum timing?
Alan asserted there is a posting behavior difference, but that should
not affect reads.
What is different between mmio and pio to a pci device when doing reads
that should make mmio faster?
Eric
On Thu, 2002-09-19 at 16:03, Eric W. Biederman wrote:
> If I do an inb to a PCI-X device running at 133Mhz it should come back
> much faster than an inb from my serial port on the ISA port. What
> is the reason for the fixed minimum timing?
As far as I can tell the minimum time for the inb/outb is simply the
time it takes the bus to respond. The only difference there is that for
writel rather than outl you won't wait for the write to complete on the
PCI bus just dump it into the fifo if its empty