LinuxLists.cc - Info: NAPI performance at "low" loads

2002-09-17 19:48:14

Subject: Info: NAPI performance at "low" loads

NAPI network drivers mask the rx interrupts in their interrupt handler,
and reenable them in dev->poll(). In the worst case, that happens for
every packet. I've tried to measure the overhead of that operation.

The cpu time needed to recieve 50k packets/sec:

without NAPI: 53.7 %
with NAPI: 59.9 %

50k packets/sec is the limit for NAPI, at higher packet rates the forced
mitigation kicks in and every interrupt recieves more than one packet.

The cpu time was measured by busy-looping in user space, the numbers
should be accurate to less than 1 %.
Summary: with my setup, the overhead is around 11 %.

Could someone try to reproduce my results?

Sender:
# sendpkt <target ip> 1 <10..50, go get a good packet rate>

Receiver:
$ loadtest

Please disable any interrupt mitigation features of your nic, otherwise
the mitigation will dramatically change the needed cpu time.
The sender sends ICMP echo reply packets, evenly spaced by
"memset(,,n*512)" between the syscalls.
The cpu load was measured with a user space app that calls
"memset(,,16384)" in a tight loop, and reports the number of loops per
second.

I've used a patched tulip driver, the current NAPI driver contains a
loop that severely slows down the nic under such loads.

The patch and my test apps are at

http://www.q-ag.de/~manfred/loadtest

hardware setup:
Duron 700, VIA KT 133
no IO APIC, i.e. slow 8259 XT PIC.
Accton tulip clone, ADMtek comet.
crossover cable
Sender: Celeron 1.13 GHz, rtl8139

--
Manfred

2002-09-17 21:03:42

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: Manfred Spraul <[email protected]>
Date: Tue, 17 Sep 2002 21:53:03 +0200

Receiver:
$ loadtest

This appears to be x86 only, sorry I can't test this out for you as
all my boxes are sparc64.

I was actually eager to try your tests out here.

Do you really need to use x86 instructions to do what you
are doing? There are portable pthread mutexes available.

2002-09-17 21:27:18

by Andrew Morton

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

"David S. Miller" wrote:
>
> From: Manfred Spraul <[email protected]>
> Date: Tue, 17 Sep 2002 21:53:03 +0200
>
> Receiver:
> $ loadtest
>
> This appears to be x86 only, sorry I can't test this out for you as
> all my boxes are sparc64.
>
> I was actually eager to try your tests out here.
>
> Do you really need to use x86 instructions to do what you
> are doing? There are portable pthread mutexes available.

There is a similar background loadtester at
http://www.zip.com.au/~akpm/linux/#zc .

It's fairly fancy - I wrote it for measuring networking
efficiency. It doesn't seem to have any PCisms....

(I measured similar regression using an ancient NAPIfied
3c59x a long time ago).

2002-09-17 21:30:42

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: Andrew Morton <[email protected]>
Date: Tue, 17 Sep 2002 14:32:09 -0700

There is a similar background loadtester at
http://www.zip.com.au/~akpm/linux/#zc .

It's fairly fancy - I wrote it for measuring networking
efficiency. It doesn't seem to have any PCisms....

Thanks I'll check it out, but meanwhile I hacked up sparc
specific assembler for manfred's code :-)

(I measured similar regression using an ancient NAPIfied
3c59x a long time ago).

Well, it is due to the same problems manfred saw initially,
namely just a crappy or buggy NAPI driver implementation. :-)

2002-09-17 21:40:19

by Andrew Morton

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

"David S. Miller" wrote:
>
> From: Andrew Morton <[email protected]>
> Date: Tue, 17 Sep 2002 14:32:09 -0700
>
> There is a similar background loadtester at
> http://www.zip.com.au/~akpm/linux/#zc .
>
> It's fairly fancy - I wrote it for measuring networking
> efficiency. It doesn't seem to have any PCisms....
>
> Thanks I'll check it out, but meanwhile I hacked up sparc
> specific assembler for manfred's code :-)
>
> (I measured similar regression using an ancient NAPIfied
> 3c59x a long time ago).
>
> Well, it is due to the same problems manfred saw initially,
> namely just a crappy or buggy NAPI driver implementation. :-)

It was due to additional inl()'s and outl()'s in the driver fastpath.

Testcase was netperf Tx and Rx. Just TCP over 100bT. AFAIK, this overhead
is intrinsic to NAPI. Not to say that its costs outweigh its benefits,
but it's just there.

If someone wants to point me at all the bits and pieces to get a
NAPIfied 3c59x working on 2.5.current I'll retest, and generate
some instruction-level oprofiles.

2002-09-17 21:50:15

by Jeff Garzik

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

David S. Miller wrote:
> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.

Just to pick nits... my example went from 2 or 3 IOs [depending on the
presence/absence of a work loop] to 6 IOs.

Feel free to re-read my message and point out where an IO can be
eliminated...

Jeff

2002-09-17 21:43:55

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: Andrew Morton <[email protected]>
Date: Tue, 17 Sep 2002 14:45:08 -0700

"David S. Miller" wrote:
> Well, it is due to the same problems manfred saw initially,
> namely just a crappy or buggy NAPI driver implementation. :-)

It was due to additional inl()'s and outl()'s in the driver fastpath.

How many? Did the implementation cache the register value in a
software state word or did it read the register each time to write
the IRQ masking bits back?

It is issues like this that make me say "crappy or buggy NAPI
implementation"

Any driver should be able to get the NAPI overhead to max out at
2 PIOs per packet.

And if the performance is really concerning, perhaps add an option to
use MEM space in the 3c59x driver too, IO instructions are constant
cost regardless of how fast the PCI bus being used is :-)

2002-09-17 21:54:35

by Andrew Morton

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

"David S. Miller" wrote:
>
> From: Andrew Morton <[email protected]>
> Date: Tue, 17 Sep 2002 14:45:08 -0700
>
> "David S. Miller" wrote:
> > Well, it is due to the same problems manfred saw initially,
> > namely just a crappy or buggy NAPI driver implementation. :-)
>
> It was due to additional inl()'s and outl()'s in the driver fastpath.
>
> How many? Did the implementation cache the register value in a
> software state word or did it read the register each time to write
> the IRQ masking bits back?
>

Looks like it cached it:

- outw(SetIntrEnb | (inw(ioaddr + 10) & ~StatsFull), ioaddr + EL3_CMD);
vp->intr_enable &= ~StatsFull;
+ outw(vp->intr_enable, ioaddr + EL3_CMD);

> It is issues like this that make me say "crappy or buggy NAPI
> implementation"
>
> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.
>
> And if the performance is really concerning, perhaps add an option to
> use MEM space in the 3c59x driver too, IO instructions are constant
> cost regardless of how fast the PCI bus being used is :-)

Yup. But deltas are interesting.

2002-09-17 21:53:21

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: Jeff Garzik <[email protected]>
Date: Tue, 17 Sep 2002 17:54:42 -0400

David S. Miller wrote:
> Any driver should be able to get the NAPI overhead to max out at
> 2 PIOs per packet.

Just to pick nits... my example went from 2 or 3 IOs [depending on the
presence/absence of a work loop] to 6 IOs.

I mean "2 extra PIOs" not "2 total PIOs".

I think it's doable for just about every driver, even tg3 with it's
weird semaphore scheme takes 2 extra PIOs worst case with NAPI.

The semaphore I have to ACK anyways at hw IRQ time anyways, and since
I keep a software copy of the IRQ masking register, mask and unmask
are each one PIO.

2002-09-18 01:00:03

by jamal

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

Manfred, could you please turn MMIO (you can select it
via kernel config) and see what the new difference looks like?

I am not so sure with that 6% difference there is no other bug lurking
there; 6% seems too large for an extra two PCI transactions per packet.
If someone could test a different NIC this would be great.
Actually what would be even better is to go something like 20kpps,
50kpps, 80 kpps, 100kpps and 140 kpps and see what we get.

cheers,
jamal

2002-09-18 01:04:21

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: jamal <[email protected]>
Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)

I am not so sure with that 6% difference there is no other bug lurking
there; 6% seems too large for an extra two PCI transactions per packet.

{in,out}{b,w,l}() operations have a fixed timing, therefore his
results doesn't sound that far off.

It is also one of the reasons I suspect Andrew saw such bad results
with 3c59x, but probably that is not the only reason.

2002-09-18 02:06:46

by Jeff Garzik

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

David S. Miller wrote:
> From: Jeff Garzik <[email protected]>
> Date: Tue, 17 Sep 2002 17:54:42 -0400
>
> David S. Miller wrote:
> > Any driver should be able to get the NAPI overhead to max out at
> > 2 PIOs per packet.
>
> Just to pick nits... my example went from 2 or 3 IOs [depending on the
> presence/absence of a work loop] to 6 IOs.
>
> I mean "2 extra PIOs" not "2 total PIOs".
>
> I think it's doable for just about every driver, even tg3 with it's
> weird semaphore scheme takes 2 extra PIOs worst case with NAPI.
>
> The semaphore I have to ACK anyways at hw IRQ time anyways, and since
> I keep a software copy of the IRQ masking register, mask and unmask
> are each one PIO.

You're looking at at least one extra get-irq-status too, at least in the
classical 10/100 drivers I'm used to seeing...

Jeff

2002-09-18 02:10:54

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: Jeff Garzik <[email protected]>
Date: Tue, 17 Sep 2002 22:11:14 -0400

You're looking at at least one extra get-irq-status too, at least in the
classical 10/100 drivers I'm used to seeing...

How so? The number of ones done in the e1000 NAPI code are the same
(read register until no interesting status bits remain set, same as
pre-NAPI e1000 driver).

For tg3 it's a cheap memory read from the status block not a PIO.

2002-09-18 02:11:33

by Andrew Morton

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

"David S. Miller" wrote:
>
> From: jamal <[email protected]>
> Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)
>
> I am not so sure with that 6% difference there is no other bug lurking
> there; 6% seems too large for an extra two PCI transactions per packet.
>
> {in,out}{b,w,l}() operations have a fixed timing, therefore his
> results doesn't sound that far off.
>
> It is also one of the reasons I suspect Andrew saw such bad results
> with 3c59x, but probably that is not the only reason.

They weren't "very bad", iirc. Maybe a 5% increase in CPU load.

It was all a long time ago. Will retest if someone sends URLs.

2002-09-18 02:32:08

by Jeff Garzik

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

David S. Miller wrote:
> From: Jeff Garzik <[email protected]>
> Date: Tue, 17 Sep 2002 22:11:14 -0400
>
> You're looking at at least one extra get-irq-status too, at least in the
> classical 10/100 drivers I'm used to seeing...
>
> How so? The number of ones done in the e1000 NAPI code are the same
> (read register until no interesting status bits remain set, same as
> pre-NAPI e1000 driver).
>
> For tg3 it's a cheap memory read from the status block not a PIO.

Non-NAPI:

get-irq-stat
ack-irq
get-irq-stat (omit, if no work loop)

NAPI:

get-irq-stat
ack-all-but-rx-irq
mask-rx-irqs
get-irq-stat (omit, if work loop)
...
ack-rx-irqs
get-irq-stat
unmask-rx-irqs

This is the low load / low latency case only. The number of IOs
decreases at higher loads [obviously :)]

2002-09-18 17:37:18

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

"David S. Miller" <[email protected]> writes:

> From: jamal <[email protected]>
> Date: Tue, 17 Sep 2002 20:57:58 -0400 (EDT)
>
> I am not so sure with that 6% difference there is no other bug lurking
> there; 6% seems too large for an extra two PCI transactions per packet.
>
> {in,out}{b,w,l}() operations have a fixed timing, therefore his
> results doesn't sound that far off.
????

I don't see why they should be. If it is a pci device the cost should
the same as a pci memory I/O. The bus packets are the same. So things like
increasing the pci bus speed should make it take less time.

Plus I have played with calibrating the TSC with outb to port
0x80 and there was enough variation that it was unuseable. On some
newer systems it would take twice as long as on some older ones.

Eric

2002-09-18 17:43:07

by Alan

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

On Wed, 2002-09-18 at 18:27, Eric W. Biederman wrote:
> Plus I have played with calibrating the TSC with outb to port
> 0x80 and there was enough variation that it was unuseable. On some
> newer systems it would take twice as long as on some older ones.

port 0x80 isnt going to PCI space.

x86 generally posts mmio write but not io write. Thats quite measurable.

2002-09-18 20:28:17

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: [email protected] (Eric W. Biederman)
Date: 18 Sep 2002 11:27:34 -0600

"David S. Miller" <[email protected]> writes:

> {in,out}{b,w,l}() operations have a fixed timing, therefore his
> results doesn't sound that far off.
????

I don't see why they should be. If it is a pci device the cost should
the same as a pci memory I/O. The bus packets are the same. So things like
increasing the pci bus speed should make it take less time.

The x86 processor has a well defined timing for executing inb
etc. instructions, the timing is fixed and is independant of the
speed of the PCI bus the device is on.

2002-09-18 20:34:35

by Alan

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

On Wed, 2002-09-18 at 21:23, David S. Miller wrote:
> The x86 processor has a well defined timing for executing inb
> etc. instructions, the timing is fixed and is independant of the
> speed of the PCI bus the device is on.

Earth calling Dave Miller

The inb timing depends on the PCI bus. If you want proof set a Matrox
G400 into no pci retry mode, run a large X load at it and time some inbs
you should be able to get to about 100 milliseconds for an inb to
execute

2002-09-18 20:50:58

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: Alan Cox <[email protected]>
Date: 18 Sep 2002 21:43:09 +0100

The inb timing depends on the PCI bus. If you want proof set a Matrox
G400 into no pci retry mode, run a large X load at it and time some inbs
you should be able to get to about 100 milliseconds for an inb to
execute

Matrox isn't using inb/outb instructions to IO space, it is being
accessed by X using MEM space which is done using normal load and
store instructions on x86 after the card is mmap()'d into user space.

2002-09-18 21:06:44

by Alan

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

On Wed, 2002-09-18 at 21:46, David S. Miller wrote:
> From: Alan Cox <[email protected]>
> Date: 18 Sep 2002 21:43:09 +0100
>
> The inb timing depends on the PCI bus. If you want proof set a Matrox
> G400 into no pci retry mode, run a large X load at it and time some inbs
> you should be able to get to about 100 milliseconds for an inb to
> execute
>
> Matrox isn't using inb/outb instructions to IO space, it is being
> accessed by X using MEM space which is done using normal load and
> store instructions on x86 after the card is mmap()'d into user space.

It doesnt matter what XFree86 is doing. Thats just to load the PCI bus
and jam it up to prove the point. It'll change your inb timing

2002-09-18 21:28:03

by David Miller

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

From: Alan Cox <[email protected]>
Date: 18 Sep 2002 22:15:27 +0100

It doesnt matter what XFree86 is doing. Thats just to load the PCI bus
and jam it up to prove the point. It'll change your inb timing

Understood. Maybe a more accurate wording would be "a fixed minimum
timing".

2002-09-19 15:08:48

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

Alan Cox <[email protected]> writes:

> On Wed, 2002-09-18 at 18:27, Eric W. Biederman wrote:
> > Plus I have played with calibrating the TSC with outb to port
> > 0x80 and there was enough variation that it was unuseable. On some
> > newer systems it would take twice as long as on some older ones.
>
> port 0x80 isnt going to PCI space.

Agreed. It isn't going anywhere, and it takes it a while to recogonize
that.

> x86 generally posts mmio write but not io write. Thats quite measurable.

The difference timing difference between posted and non-posted writes
I can see.

Eric

2002-09-19 15:13:27

by Eric W. Biederman

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

"David S. Miller" <[email protected]> writes:

> From: Alan Cox <[email protected]>
> Date: 18 Sep 2002 22:15:27 +0100
>
> It doesnt matter what XFree86 is doing. Thats just to load the PCI bus
> and jam it up to prove the point. It'll change your inb timing
>
> Understood. Maybe a more accurate wording would be "a fixed minimum
> timing".

Why?

If I do an inb to a PCI-X device running at 133Mhz it should come back
much faster than an inb from my serial port on the ISA port. What
is the reason for the fixed minimum timing?

Alan asserted there is a posting behavior difference, but that should
not affect reads.

What is different between mmio and pio to a pci device when doing reads
that should make mmio faster?

Eric

2002-09-19 15:44:13

by Alan

[permalink] [raw]

Subject: Re: Info: NAPI performance at "low" loads

On Thu, 2002-09-19 at 16:03, Eric W. Biederman wrote:
> If I do an inb to a PCI-X device running at 133Mhz it should come back
> much faster than an inb from my serial port on the ISA port. What
> is the reason for the fixed minimum timing?

As far as I can tell the minimum time for the inb/outb is simply the
time it takes the bus to respond. The only difference there is that for
writel rather than outl you won't wait for the write to complete on the
PCI bus just dump it into the fifo if its empty