2007-08-24 13:59:30

by Jan-Bernd Themann

[permalink] [raw]
Subject: RFC: issues concerning the next NAPI interface

Hi,

when I tried to get the eHEA driver working with the new interface,
the following issues came up.

1) The current implementation of netif_rx_schedule, netif_rx_complete
? ?and the net_rx_action have the following problem: netif_rx_schedule
? ?sets the NAPI_STATE_SCHED flag and adds the NAPI instance to the poll_list.
? ?netif_rx_action checks NAPI_STATE_SCHED, if set it will add the device
? ?to the poll_list again (as well). netif_rx_complete clears the NAPI_STATE_SCHED.
? ?If an interrupt handler calls netif_rx_schedule on CPU 2
? ?after netif_rx_complete has been called on CPU 1 (and the poll function
? ?has not returned yet), the NAPI instance will be added twice to the
? ?poll_list (by netif_rx_schedule and net_rx_action). Problems occur when
? ?netif_rx_complete is called twice for the device (BUG() called)

2) If an ethernet chip supports multiple receive queues, the queues are
? ?currently all processed on the CPU where the interrupt comes in. This
? ?is because netif_rx_schedule will always add the rx queue to the CPU's
? ?napi poll_list. The result under heavy presure is that all queues will
? ?gather on the weakest CPU (with highest CPU load) after some time as they
? ?will stay there as long as the entire queue is emptied. On SMP systems
? ?this behaviour is not desired. It should also work well without interrupt
? ?pinning.
? ?It would be nice if it is possible to schedule queues to other CPU's, or
? ?at least to use interrupts to put the queue to another cpu (not nice for
? ?as you never know which one you will hit).
? ?I'm not sure how bad the tradeoff would be.

3) On modern systems the incoming packets are processed very fast. Especially
? ?on SMP systems when we use multiple queues we process only a few packets
? ?per napi poll cycle. So NAPI does not work very well here and the interrupt
? ?rate is still high. What we need would be some sort of timer polling mode
? ?which will schedule a device after a certain amount of time for high load
? ?situations. With high precision timers this could work well. Current
? ?usual timers are too slow. A finer granularity would be needed to keep the
latency down (and queue length moderate).

What do you think?

Thanks,
Jan-Bernd


2007-08-24 15:38:33

by Arthur Kepner

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
> .......
> 3) On modern systems the incoming packets are processed very fast. Especially
> ? ?on SMP systems when we use multiple queues we process only a few packets
> ? ?per napi poll cycle. So NAPI does not work very well here and the interrupt
> ? ?rate is still high. What we need would be some sort of timer polling mode
> ? ?which will schedule a device after a certain amount of time for high load
> ? ?situations. With high precision timers this could work well. Current
> ? ?usual timers are too slow. A finer granularity would be needed to keep the
> latency down (and queue length moderate).
>

We found the same on ia64-sn systems with tg3 a couple of years
ago. Using simple interrupt coalescing ("don't interrupt until
you've received N packets or M usecs have elapsed") worked
reasonably well in practice. If your h/w supports that (and I'd
guess it does, since it's such a simple thing), you might try
it.

--
Arthur

2007-08-24 15:48:07

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Hi,

On Friday 24 August 2007 17:37, [email protected] wrote:
> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
> > .......
> > 3) On modern systems the incoming packets are processed very fast. Especially
> > ? ?on SMP systems when we use multiple queues we process only a few packets
> > ? ?per napi poll cycle. So NAPI does not work very well here and the interrupt
> > ? ?rate is still high. What we need would be some sort of timer polling mode
> > ? ?which will schedule a device after a certain amount of time for high load
> > ? ?situations. With high precision timers this could work well. Current
> > ? ?usual timers are too slow. A finer granularity would be needed to keep the
> > latency down (and queue length moderate).
> >
>
> We found the same on ia64-sn systems with tg3 a couple of years
> ago. Using simple interrupt coalescing ("don't interrupt until
> you've received N packets or M usecs have elapsed") worked
> reasonably well in practice. If your h/w supports that (and I'd
> guess it does, since it's such a simple thing), you might try
> it.
>

I don't see how this should work. Our latest machines are fast enough that they
simply empty the queue during the first poll iteration (in most cases).
Even if you wait until X packets have been received, it does not help for
the next poll cycle. The average number of packets we process per poll queue
is low. So a timer would be preferable that periodically polls the
queue, without the need of generating a HW interrupt. This would allow us
to wait until a reasonable amount of packets have been received in the meantime
to keep the poll overhead low. This would also be useful in combination
with LRO.

Regards,
Jan-Bernd

2007-08-24 15:53:00

by Stephen Hemminger

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, 24 Aug 2007 17:47:15 +0200
Jan-Bernd Themann <[email protected]> wrote:

> Hi,
>
> On Friday 24 August 2007 17:37, [email protected] wrote:
> > On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
> > > .......
> > > 3) On modern systems the incoming packets are processed very fast. Especially
> > >    on SMP systems when we use multiple queues we process only a few packets
> > >    per napi poll cycle. So NAPI does not work very well here and the interrupt
> > >    rate is still high. What we need would be some sort of timer polling mode
> > >    which will schedule a device after a certain amount of time for high load
> > >    situations. With high precision timers this could work well. Current
> > >    usual timers are too slow. A finer granularity would be needed to keep the
> > > latency down (and queue length moderate).
> > >
> >
> > We found the same on ia64-sn systems with tg3 a couple of years
> > ago. Using simple interrupt coalescing ("don't interrupt until
> > you've received N packets or M usecs have elapsed") worked
> > reasonably well in practice. If your h/w supports that (and I'd
> > guess it does, since it's such a simple thing), you might try
> > it.
> >
>
> I don't see how this should work. Our latest machines are fast enough that they
> simply empty the queue during the first poll iteration (in most cases).
> Even if you wait until X packets have been received, it does not help for
> the next poll cycle. The average number of packets we process per poll queue
> is low. So a timer would be preferable that periodically polls the
> queue, without the need of generating a HW interrupt. This would allow us
> to wait until a reasonable amount of packets have been received in the meantime
> to keep the poll overhead low. This would also be useful in combination
> with LRO.
>

You need hardware support for deferred interrupts. Most devices have it (e1000, sky2, tg3)
and it interacts well with NAPI. It is not a generic thing you want done by the stack,
you want the hardware to hold off interrupts until X packets or Y usecs have expired.

The parameters for controlling it are already in ethtool, the issue is finding a good
default set of values for a wide range of applications and architectures. Maybe some
heuristic based on processor speed would be a good starting point. The dynamic irq
moderation stuff is not widely used because it is too hard to get right.

--
Stephen Hemminger <[email protected]>

2007-08-24 16:45:54

by linas

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
> 3) On modern systems the incoming packets are processed very fast. Especially
> ? ?on SMP systems when we use multiple queues we process only a few packets
> ? ?per napi poll cycle. So NAPI does not work very well here and the interrupt
> ? ?rate is still high.

I saw this too, on a system that is "modern" but not terribly fast, and
only slightly (2-way) smp. (the spidernet)

I experimented wih various solutions, none were terribly exciting. The
thing that killed all of them was a crazy test case that someone sprung on
me: They had written a worst-case network ping-pong app: send one
packet, wait for reply, send one packet, etc.

If I waited (indefinitely) for a second packet to show up, the test case
completely stalled (since no second packet would ever arrive). And if I
introduced a timer to wait for a second packet, then I just increased
the latency in the response to the first packet, and this was noticed,
and folks complained.

In the end, I just let it be, and let the system work as a busy-beaver,
with the high interrupt rate. Is this a wise thing to do? I was
thinking that, if the system is under heavy load, then the interrupt
rate would fall, since (for less pathological network loads) more
packets would queue up before the poll was serviced. But I did not
actually measure the interrupt rate under heavy load ...

--linas

2007-08-24 16:51:23

by David Stevens

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Stephen Hemminger <[email protected]> wrote on 08/24/2007
08:52:03 AM:

>
> You need hardware support for deferred interrupts. Most devices have it
> (e1000, sky2, tg3)
> and it interacts well with NAPI. It is not a generic thing you want done
by the stack,
> you want the hardware to hold off interrupts until X packets or Y usecs
have expired.

For generic hardware that doesn't support it, couldn't you use an
estimater
and adjust the timer dynamicly in software based on sampled values? Switch
to per-packet
interrupts when the receive rate is low...
Actually, that's how I thought NAPI worked before I found out
otherwise (ie,
before I looked :-)).

The hardware-accelerated one is essentially siloing as done by
ancient serial
devices on UNIX systems. If you had a tunable for a target count, and an
estimator
for the time interval, then switch to per-packet when the estimator
exceeds a tunable
max threshold (and also, I suppose, if you near overflowing the ring on
the min
timer granularity), you get almost all of it, right?
Problem is if it increases rapidly, you may drop packets before
you notice
that the ring is full in the current estimated interval.

+-DLS


2007-08-24 16:51:39

by linas

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, Aug 24, 2007 at 08:52:03AM -0700, Stephen Hemminger wrote:
>
> You need hardware support for deferred interrupts. Most devices have it (e1000, sky2, tg3)
> and it interacts well with NAPI. It is not a generic thing you want done by the stack,
> you want the hardware to hold off interrupts until X packets or Y usecs have expired.

Just to be clear, in the previous email I posted on this thread, I
described a worst-case network ping-pong test case (send a packet, wait
for reply), and found out that a deffered interrupt scheme just damaged
the performance of the test case. Since the folks who came up with the
test case were adamant, I turned off the defferred interrupts.
While defferred interrupts are an "obvious" solution, I decided that
they weren't a good solution. (And I have no other solution to offer).

--linas

2007-08-24 17:09:36

by Rick Jones

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

> Just to be clear, in the previous email I posted on this thread, I
> described a worst-case network ping-pong test case (send a packet, wait
> for reply), and found out that a deffered interrupt scheme just damaged
> the performance of the test case. Since the folks who came up with the
> test case were adamant, I turned off the defferred interrupts.
> While defferred interrupts are an "obvious" solution, I decided that
> they weren't a good solution. (And I have no other solution to offer).

Sounds exactly like the default netperf TCP_RR test and any number of other
benchmarks. The "send a request, wait for reply, send next request, etc etc
etc" is a rather common application behaviour afterall.

rick jones

2007-08-24 17:17:15

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Stephen Hemminger wrote:
> On Fri, 24 Aug 2007 17:47:15 +0200
> Jan-Bernd Themann <[email protected]> wrote:
>
>> Hi,
>>
>> On Friday 24 August 2007 17:37, [email protected] wrote:
>>> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
>>>> .......
>>>> 3) On modern systems the incoming packets are processed very fast. Especially
>>>> on SMP systems when we use multiple queues we process only a few packets
>>>> per napi poll cycle. So NAPI does not work very well here and the interrupt
>>>> rate is still high. What we need would be some sort of timer polling mode
>>>> which will schedule a device after a certain amount of time for high load
>>>> situations. With high precision timers this could work well. Current
>>>> usual timers are too slow. A finer granularity would be needed to keep the
>>>> latency down (and queue length moderate).
>>>>
>>> We found the same on ia64-sn systems with tg3 a couple of years
>>> ago. Using simple interrupt coalescing ("don't interrupt until
>>> you've received N packets or M usecs have elapsed") worked
>>> reasonably well in practice. If your h/w supports that (and I'd
>>> guess it does, since it's such a simple thing), you might try
>>> it.
>>>
>> I don't see how this should work. Our latest machines are fast enough that they
>> simply empty the queue during the first poll iteration (in most cases).
>> Even if you wait until X packets have been received, it does not help for
>> the next poll cycle. The average number of packets we process per poll queue
>> is low. So a timer would be preferable that periodically polls the
>> queue, without the need of generating a HW interrupt. This would allow us
>> to wait until a reasonable amount of packets have been received in the meantime
>> to keep the poll overhead low. This would also be useful in combination
>> with LRO.
>>
>
> You need hardware support for deferred interrupts. Most devices have it (e1000, sky2, tg3)
> and it interacts well with NAPI. It is not a generic thing you want done by the stack,
> you want the hardware to hold off interrupts until X packets or Y usecs have expired.

Does hardware interrupt mitigation really interact well with NAPI? In my
experience, holding off interrupts for X packets or Y usecs does more
harm than good; such hardware features are useful only when the OS has
no NAPI-like mechanism.

When tuning NAPI drivers for packets/sec performance (which is a good
indicator of driver performance), I make sure that the driver stays in
NAPI polled mode while it has any rx or tx work to do. If the CPU is
fast enough that all work is always completed on each poll, I have the
driver stay in polled mode until dev->poll() is called N times with no
work being done. This keeps interrupts disabled for reasonable traffic
levels, while minimizing packet processing latency. No need for hardware
interrupt mitigation.

> The parameters for controlling it are already in ethtool, the issue is finding a good
> default set of values for a wide range of applications and architectures. Maybe some
> heuristic based on processor speed would be a good starting point. The dynamic irq
> moderation stuff is not widely used because it is too hard to get right.

I agree. It would be nice to find a way for the typical user to derive
best values for these knobs for his/her particular system. Perhaps a
tool using pktgen and network device phy internal loopback could be
developed?

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-24 17:45:35

by Shirley Ma

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

> Just to be clear, in the previous email I posted on this thread, I
> described a worst-case network ping-pong test case (send a packet, wait
> for reply), and found out that a deffered interrupt scheme just damaged
> the performance of the test case.

When splitting rx and tx handler, I found some performance gain by
deffering interrupt scheme in tx not rx in IPoIB driver.

Shirley

2007-08-24 18:12:25

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

James Chapman schrieb:
> Stephen Hemminger wrote:
>> On Fri, 24 Aug 2007 17:47:15 +0200
>> Jan-Bernd Themann <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> On Friday 24 August 2007 17:37, [email protected] wrote:
>>>> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
>>>>> .......
>>>>> 3) On modern systems the incoming packets are processed very fast.
>>>>> Especially
>>>>> on SMP systems when we use multiple queues we process only a
>>>>> few packets
>>>>> per napi poll cycle. So NAPI does not work very well here and
>>>>> the interrupt rate is still high. What we need would be some
>>>>> sort of timer polling mode which will schedule a device after a
>>>>> certain amount of time for high load situations. With high
>>>>> precision timers this could work well. Current
>>>>> usual timers are too slow. A finer granularity would be needed
>>>>> to keep the
>>>>> latency down (and queue length moderate).
>>>>>
>>>> We found the same on ia64-sn systems with tg3 a couple of years
>>>> ago. Using simple interrupt coalescing ("don't interrupt until
>>>> you've received N packets or M usecs have elapsed") worked
>>>> reasonably well in practice. If your h/w supports that (and I'd
>>>> guess it does, since it's such a simple thing), you might try it.
>>>>
>>> I don't see how this should work. Our latest machines are fast
>>> enough that they
>>> simply empty the queue during the first poll iteration (in most cases).
>>> Even if you wait until X packets have been received, it does not
>>> help for
>>> the next poll cycle. The average number of packets we process per
>>> poll queue
>>> is low. So a timer would be preferable that periodically polls the
>>> queue, without the need of generating a HW interrupt. This would
>>> allow us
>>> to wait until a reasonable amount of packets have been received in
>>> the meantime
>>> to keep the poll overhead low. This would also be useful in combination
>>> with LRO.
>>>
>>
>> You need hardware support for deferred interrupts. Most devices have
>> it (e1000, sky2, tg3)
>> and it interacts well with NAPI. It is not a generic thing you want
>> done by the stack,
>> you want the hardware to hold off interrupts until X packets or Y
>> usecs have expired.
>
> Does hardware interrupt mitigation really interact well with NAPI? In
> my experience, holding off interrupts for X packets or Y usecs does
> more harm than good; such hardware features are useful only when the
> OS has no NAPI-like mechanism.
>
> When tuning NAPI drivers for packets/sec performance (which is a good
> indicator of driver performance), I make sure that the driver stays in
> NAPI polled mode while it has any rx or tx work to do. If the CPU is
> fast enough that all work is always completed on each poll, I have the
> driver stay in polled mode until dev->poll() is called N times with no
> work being done. This keeps interrupts disabled for reasonable traffic
> levels, while minimizing packet processing latency. No need for
> hardware interrupt mitigation.
Yes, that was one idea as well. But the problem with that is that
net_rx_action will call
the same poll function over and over again in a row if there are no
further network
devices. The problem about this approach is that you always poll just a
very few packets
each time. This does not work with LRO well, as there are no packets to
aggregate...
So it would make more sense to wait for a certain time before trying it
again.
Second problem: after the jiffies incremented by one in net_rx_action
(after some poll rounds), net_rx_action will quit and return control to
the softIRQ handler. The poll function
is called again as the softIRQ handler thinks there is more work to be
done. So even
then we do not wait... After some rounds in the softIRQ handler, we
finally wait some time.

>
>> The parameters for controlling it are already in ethtool, the issue
>> is finding a good
>> default set of values for a wide range of applications and
>> architectures. Maybe some
>> heuristic based on processor speed would be a good starting point.
>> The dynamic irq
>> moderation stuff is not widely used because it is too hard to get right.
>
> I agree. It would be nice to find a way for the typical user to derive
> best values for these knobs for his/her particular system. Perhaps a
> tool using pktgen and network device phy internal loopback could be
> developed?
>


2007-08-24 19:05:29

by Bodo Eggert

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Linas Vepstas <[email protected]> wrote:
> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:

>> 3) On modern systems the incoming packets are processed very fast. Especially
>> on SMP systems when we use multiple queues we process only a few packets
>> per napi poll cycle. So NAPI does not work very well here and the interrupt
>> rate is still high.
>
> I saw this too, on a system that is "modern" but not terribly fast, and
> only slightly (2-way) smp. (the spidernet)
>
> I experimented wih various solutions, none were terribly exciting. The
> thing that killed all of them was a crazy test case that someone sprung on
> me: They had written a worst-case network ping-pong app: send one
> packet, wait for reply, send one packet, etc.
>
> If I waited (indefinitely) for a second packet to show up, the test case
> completely stalled (since no second packet would ever arrive). And if I
> introduced a timer to wait for a second packet, then I just increased
> the latency in the response to the first packet, and this was noticed,
> and folks complained.

Possible solution / possible brainfart:

Introduce a timer, but don't start to use it to combine packets unless you
receive n packets within the timeframe. If you receive less than m packets
within one timeframe, stop using the timer. The system should now have a
decent response time when the network is idle, and when the network is
busy, nobody will complain about the latency.-)
--
Funny quotes:
22. When everything's going your way, you're in the wrong lane and and going
the wrong way.
Fri?, Spammer: [email protected] [email protected]

2007-08-24 20:25:30

by Stephen Hemminger

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, 24 Aug 2007 21:04:56 +0200
Bodo Eggert <[email protected]> wrote:

> Linas Vepstas <[email protected]> wrote:
> > On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
>
> >> 3) On modern systems the incoming packets are processed very fast. Especially
> >> on SMP systems when we use multiple queues we process only a few packets
> >> per napi poll cycle. So NAPI does not work very well here and the interrupt
> >> rate is still high.
> >
> > I saw this too, on a system that is "modern" but not terribly fast, and
> > only slightly (2-way) smp. (the spidernet)
> >
> > I experimented wih various solutions, none were terribly exciting. The
> > thing that killed all of them was a crazy test case that someone sprung on
> > me: They had written a worst-case network ping-pong app: send one
> > packet, wait for reply, send one packet, etc.
> >
> > If I waited (indefinitely) for a second packet to show up, the test case
> > completely stalled (since no second packet would ever arrive). And if I
> > introduced a timer to wait for a second packet, then I just increased
> > the latency in the response to the first packet, and this was noticed,
> > and folks complained.
>
> Possible solution / possible brainfart:
>
> Introduce a timer, but don't start to use it to combine packets unless you
> receive n packets within the timeframe. If you receive less than m packets
> within one timeframe, stop using the timer. The system should now have a
> decent response time when the network is idle, and when the network is
> busy, nobody will complain about the latency.-)

I expect the overhead of OS timers and resolution makes this unsuitable for fast
networks. You need the hardware to do it. On slow networks, it doesn't matter.

--
Stephen Hemminger <[email protected]>

2007-08-24 20:43:15

by linas

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, Aug 24, 2007 at 09:04:56PM +0200, Bodo Eggert wrote:
> Linas Vepstas <[email protected]> wrote:
> > On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
> >> 3) On modern systems the incoming packets are processed very fast. Especially
> >> on SMP systems when we use multiple queues we process only a few packets
> >> per napi poll cycle. So NAPI does not work very well here and the interrupt
> >> rate is still high.
> >
> > worst-case network ping-pong app: send one
> > packet, wait for reply, send one packet, etc.
>
> Possible solution / possible brainfart:
>
> Introduce a timer, but don't start to use it to combine packets unless you
> receive n packets within the timeframe. If you receive less than m packets
> within one timeframe, stop using the timer. The system should now have a
> decent response time when the network is idle, and when the network is
> busy, nobody will complain about the latency.-)

Ohh, that was inspirational. Let me free-associate some wild ideas.

Suppose we keep a running average of the recent packet arrival rate,
Lets say its 10 per millisecond ("typical" for a gigabit eth runnning
flat-out). If we could poll the driver at a rate of 10-20 per
millisecond (i.e. letting the OS do other useful work for 0.05 millisec),
then we could potentially service the card without ever having to enable
interrupts on the card, and without hurting latency.

If the packet arrival rate becomes slow enough, we go back to an
interrupt-driven scheme (to keep latency down).

The main problem here is that, even for HZ=1000 machines, this amounts
to 10-20 polls per jiffy. Which, if implemented in kernel, requires
using the high-resolution timers. And, umm, don't the HR timers require
a cpu timer interrupt to make them go? So its not clear that this is much
of a win.

The eHEA is a 10 gigabit device, so it can expect 80-100 packets per
millisecond for large packets, and even more, say 1K packets per
millisec, for small packets. (Even the spec for my 1Gb spidernet card
claims its internal rate is 1M packets/sec.)

Another possiblity is to set HZ to 5000 or 20000 or something humongous
... after all cpu's are now faster! But, since this might be wasteful,
maybe we could make HZ be dynamically variable: have high HZ rates when
there's lots of network/disk activity, and low HZ rates when not. That
means a non-constant jiffy.

If all drivers used interrupt mitigation, then the variable-high
frequency jiffy could take thier place, and be more "fair" to everyone.
Most drivers would be polled most of the time when they're busy, and
only use interrupts when they're not.

--linas

2007-08-24 21:12:52

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Linas Vepstas schrieb:
> On Fri, Aug 24, 2007 at 09:04:56PM +0200, Bodo Eggert wrote:
>
>> Linas Vepstas <[email protected]> wrote:
>>
>>> On Fri, Aug 24, 2007 at 03:59:16PM +0200, Jan-Bernd Themann wrote:
>>>
>>>> 3) On modern systems the incoming packets are processed very fast. Especially
>>>> on SMP systems when we use multiple queues we process only a few packets
>>>> per napi poll cycle. So NAPI does not work very well here and the interrupt
>>>> rate is still high.
>>>>
>>> worst-case network ping-pong app: send one
>>> packet, wait for reply, send one packet, etc.
>>>
>> Possible solution / possible brainfart:
>>
>> Introduce a timer, but don't start to use it to combine packets unless you
>> receive n packets within the timeframe. If you receive less than m packets
>> within one timeframe, stop using the timer. The system should now have a
>> decent response time when the network is idle, and when the network is
>> busy, nobody will complain about the latency.-)
>>
>
> Ohh, that was inspirational. Let me free-associate some wild ideas.
>
> Suppose we keep a running average of the recent packet arrival rate,
> Lets say its 10 per millisecond ("typical" for a gigabit eth runnning
> flat-out). If we could poll the driver at a rate of 10-20 per
> millisecond (i.e. letting the OS do other useful work for 0.05 millisec),
> then we could potentially service the card without ever having to enable
> interrupts on the card, and without hurting latency.
>
> If the packet arrival rate becomes slow enough, we go back to an
> interrupt-driven scheme (to keep latency down).
>
> The main problem here is that, even for HZ=1000 machines, this amounts
> to 10-20 polls per jiffy. Which, if implemented in kernel, requires
> using the high-resolution timers. And, umm, don't the HR timers require
> a cpu timer interrupt to make them go? So its not clear that this is much
> of a win.
>
That is indeed a good question. At least for 10G eHEA we see
that the average number of packets/poll cycle is very low.
With high precision timers we could control the poll interval
better and thus make sure we get enough packets on the queue in
high load situations to benefit from LRO while keeping the
latency moderate. When the traffic load is low we could just
stick to plain NAPI. I don't know how expensive hp timers are,
we probably just have to test it (when they are available for
POWER in our case). However, having more packets
per poll run would make LRO more efficient and thus the total
CPU utilization would decrease.

I guess on most systems there are not many different network
cards working in parallel. So if the driver could set the poll
interval for its devices, it could be well optimized depending
on the NICs characteristics.

Maybe it would be good enough to have a timer that schedules
the device for NAPI (and thus triggers SoftIRQs, which will
trigger NAPI). Whether this timer would be used via a generic
interface or would be implemented as a proprietary solution
would depend on whether other drivers want / need this feature
as well. Drivers / NICs that work fine with plain NAPI don't
have to use timer :-)

I tried to implement something with "normal" timers, but the result
was everything but great. The timers seem to be far too slow.
I'm not sure if it helps to increase it from 1000HZ to 2500HZ
or more.

Regards,
Jan-Bernd

2007-08-24 21:32:33

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: Jan-Bernd Themann <[email protected]>
Date: Fri, 24 Aug 2007 15:59:16 +0200

> ? ?It would be nice if it is possible to schedule queues to other CPU's, or
> ? ?at least to use interrupts to put the queue to another cpu (not nice for
> ? ?as you never know which one you will hit).
> ? ?I'm not sure how bad the tradeoff would be.

Once the per-cpu NAPI poll queues start needing locks, much of the
gain will be lost. This is strictly what we want to avoid.

We need real facilities for IRQ distribution policies. With that none
of this is an issue.

This is also a platform specific problem with IRQ behavior, the IRQ
distibution scheme you mention would never occur on sparc64 for
example. We use a fixed round-robin distribution of interrupts to
CPUS there, they don't move.

Each scheme has it's advantages, but you want a difference scheme here
than what is implemented and the fix is therefore not in the
networking :-)

Furthermore, most cards that will be using multi-queue will be
using hashes on the packet headers to choose the MSI-X interrupt
and thus the cpu to be targetted. Those cards will want fixed
instead of dynamic interrupt to cpu distribution schemes as well,
so your problem is not unique and they'll need the same fix as
you do.

2007-08-24 21:35:30

by linas

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, Aug 24, 2007 at 11:11:56PM +0200, Jan-Bernd Themann wrote:
> (when they are available for
> POWER in our case).

hrtimer worked fine on the powerpc cell arch last summer.
I assume they work on p5 and p6 too, no ??

> I tried to implement something with "normal" timers, but the result
> was everything but great. The timers seem to be far too slow.
> I'm not sure if it helps to increase it from 1000HZ to 2500HZ
> or more.

Heh. Do the math. Even on 1gigabit cards, that's not enough:

(1gigabit/sec) x (byte/8 bits) x (packet/1500bytes) x (sec/1000 jiffy)

is 83 packets a jiffy (for big packets, even more for small packets,
and more again for 10 gigabit cards). So polling once per jiffy is a
latency disaster.

--linas

2007-08-24 21:38:07

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: Jan-Bernd Themann <[email protected]>
Date: Fri, 24 Aug 2007 15:59:16 +0200

> 1) The current implementation of netif_rx_schedule, netif_rx_complete
> ? ?and the net_rx_action have the following problem: netif_rx_schedule
> ? ?sets the NAPI_STATE_SCHED flag and adds the NAPI instance to the poll_list.
> ? ?netif_rx_action checks NAPI_STATE_SCHED, if set it will add the device
> ? ?to the poll_list again (as well). netif_rx_complete clears the NAPI_STATE_SCHED.
> ? ?If an interrupt handler calls netif_rx_schedule on CPU 2
> ? ?after netif_rx_complete has been called on CPU 1 (and the poll function
> ? ?has not returned yet), the NAPI instance will be added twice to the
> ? ?poll_list (by netif_rx_schedule and net_rx_action). Problems occur when
> ? ?netif_rx_complete is called twice for the device (BUG() called)

Indeed, this is the "who should manage the list" problem.
Probably the answer is that whoever transitions the NAPI_STATE_SCHED
bit from cleared to set should do the list addition.

Patches welcome :-)

> 3) On modern systems the incoming packets are processed very fast. Especially
> ? ?on SMP systems when we use multiple queues we process only a few packets
> ? ?per napi poll cycle. So NAPI does not work very well here and the interrupt
> ? ?rate is still high. What we need would be some sort of timer polling mode
> ? ?which will schedule a device after a certain amount of time for high load
> ? ?situations. With high precision timers this could work well. Current
> ? ?usual timers are too slow. A finer granularity would be needed to keep the
> latency down (and queue length moderate).

This is why minimal levels of HW interrupt mitigation should be enabled
in your chip. If it does not support this, you will indeed need to look
into using high resolution timers or other schemes to alleviate this.

I do not think it deserves a generic core networking helper facility,
the chips that can't mitigate interrupts are few and obscure.

2007-08-24 21:43:56

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: [email protected] (Linas Vepstas)
Date: Fri, 24 Aug 2007 11:45:41 -0500

> In the end, I just let it be, and let the system work as a
> busy-beaver, with the high interrupt rate. Is this a wise thing to
> do?

The tradeoff is always going to be latency vs. throughput.

A sane default should defer enough to catch multiple packets coming in
at something close to line rate, but not so much that latency unduly
suffers.

2007-08-24 21:44:51

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: David Stevens <[email protected]>
Date: Fri, 24 Aug 2007 09:50:58 -0700

> Problem is if it increases rapidly, you may drop packets
> before you notice that the ring is full in the current estimated
> interval.

This is one of many reasons why hardware interrupt mitigation
is really needed for this.

2007-08-24 21:47:24

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: James Chapman <[email protected]>
Date: Fri, 24 Aug 2007 18:16:45 +0100

> Does hardware interrupt mitigation really interact well with NAPI?

It interacts quite excellently.

There was a long saga about this with tg3 and huge SGI numa
systems with large costs for interrupt processing, and the
fix was to do a minimal amount of interrupt mitigation and
this basically cleared up all the problems.

Someone should reference that thread _now_ before this discussion goes
too far and we repeat a lot of information and people like myself have
to stay up all night correcting the misinformation and
misunderstandings that are basically guarenteed for this topic :)

2007-08-24 21:51:21

by linas

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, Aug 24, 2007 at 02:44:36PM -0700, David Miller wrote:
> From: David Stevens <[email protected]>
> Date: Fri, 24 Aug 2007 09:50:58 -0700
>
> > Problem is if it increases rapidly, you may drop packets
> > before you notice that the ring is full in the current estimated
> > interval.
>
> This is one of many reasons why hardware interrupt mitigation
> is really needed for this.

When turning off interrupts, don't turn them *all* off.
Leave the queue-full interrupt always on.

--linas

2007-08-24 22:07:34

by Arthur Kepner

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Fri, Aug 24, 2007 at 02:47:11PM -0700, David Miller wrote:

> ....
> Someone should reference that thread _now_ before this discussion goes
> too far and we repeat a lot of information ......

Here's part of the thread:
http://marc.info/?t=111595306000001&r=1&w=2

Also, Jamal's paper may be of interest - Google for ""when napi comes
to town".

--
Arthur

2007-08-25 02:43:20

by Mitchell Erblich

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Jan-Bernd Themann,

IMO, a box must be aware of the speed of all its
interfaces, whether the
interface impliments tail-drop or RED or XYZ, the latency
to access the
packet, etc.

Then when a packet arrives, a timer is started for
interrupt colelesing,
and to process awaiting packets, if tail-drop is
implemented, it is
possible to wait until a the input FIFO fills to a
specific point before
starting a timer.

This may maximize the number of packets per interupt. And
realize that
the worse case of a interrupt per packet is wirespeed
pings (echo
request/reply) of 64 bytes per packet.

Mitchell Erblich




2007-08-26 19:41:36

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

David Miller wrote:
> From: James Chapman <[email protected]>
> Date: Fri, 24 Aug 2007 18:16:45 +0100
>
>> Does hardware interrupt mitigation really interact well with NAPI?
>
> It interacts quite excellently.

If NAPI disables interrupts and keeps them disabled while there are more
packets arriving or more transmits being completed, why do hardware
interrupt mitigation / coalescing features of the network silicon help?

> There was a long saga about this with tg3 and huge SGI numa
> systems with large costs for interrupt processing, and the
> fix was to do a minimal amount of interrupt mitigation and
> this basically cleared up all the problems.
>
> Someone should reference that thread _now_ before this discussion goes
> too far and we repeat a lot of information and people like myself have
> to stay up all night correcting the misinformation and
> misunderstandings that are basically guarenteed for this topic :)

I hope I'm not spreading misinformation. :) The original poster was
observing NAPI going in/out of polled mode because the CPU is fast
enough to process all work per poll. I've seen the same and I'm
suggesting that the NAPI driver keeps itself in polled mode for N polls
or M jiffies after it sees workdone=0. This has always worked for me in
packet forwarding scenarios to maximize packets/sec and minimize
latency. I'm considering putting a patch together to add this as another
tuning knob, hence I'm keen to understand what I'm missing. :)

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-27 01:58:26

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: James Chapman <[email protected]>
Date: Sun, 26 Aug 2007 20:36:20 +0100

> David Miller wrote:
> > From: James Chapman <[email protected]>
> > Date: Fri, 24 Aug 2007 18:16:45 +0100
> >
> >> Does hardware interrupt mitigation really interact well with NAPI?
> >
> > It interacts quite excellently.
>
> If NAPI disables interrupts and keeps them disabled while there are more
> packets arriving or more transmits being completed, why do hardware
> interrupt mitigation / coalescing features of the network silicon help?

Because if your packet rate is low enough such that the cpu can
process the interrupt fast enough and thus only one packet gets
processed per NAPI poll, the cost of going into and out of NAPI mode
dominates the packet processing costs.

2007-08-27 09:47:27

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Monday 27 August 2007 03:58, David Miller wrote:
> From: James Chapman <[email protected]>
> Date: Sun, 26 Aug 2007 20:36:20 +0100
>
> > David Miller wrote:
> > > From: James Chapman <[email protected]>
> > > Date: Fri, 24 Aug 2007 18:16:45 +0100
> > >
> > >> Does hardware interrupt mitigation really interact well with NAPI?
> > >
> > > It interacts quite excellently.
> >
> > If NAPI disables interrupts and keeps them disabled while there are more
> > packets arriving or more transmits being completed, why do hardware
> > interrupt mitigation / coalescing features of the network silicon help?
>
> Because if your packet rate is low enough such that the cpu can
> process the interrupt fast enough and thus only one packet gets
> processed per NAPI poll, the cost of going into and out of NAPI mode
> dominates the packet processing costs.

As far as I understand your argumentation, NAPI is supposed to work well only
for HW with coalescing features (concerning dropping the interrupt rate).
NAPI itself does not provide a reliable functionality to reduce the
number of interrupts, especially not for systems with only 1 NIC.
NAPI will only wait for some time when the budget is exceeded
and the softIRQs don't call net_rx_action again. This seems to be the case
after 10 rounds. That means NAPI really waits after 300 x 10 packets
have been processed in a row (worst case).

As a matter of fact there is HW that does not have this feature. There seems
to be HW which does not work well with plain NAPI.
This HW performs well if the operation system supports HP timers
(and when they are used either in the device driver or polling engine).

So the question is simply: Do we want drivers that need (benefit from)
a timer based polling support to implement their own timers each,
or should there be a generic support?

Thanks,
Jan-Bernd

2007-08-27 15:51:52

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

David Miller wrote:
> From: James Chapman <[email protected]>
> Date: Sun, 26 Aug 2007 20:36:20 +0100
>
>> David Miller wrote:
>>> From: James Chapman <[email protected]>
>>> Date: Fri, 24 Aug 2007 18:16:45 +0100
>>>
>>>> Does hardware interrupt mitigation really interact well with NAPI?
>>> It interacts quite excellently.
>> If NAPI disables interrupts and keeps them disabled while there are more
>> packets arriving or more transmits being completed, why do hardware
>> interrupt mitigation / coalescing features of the network silicon help?
>
> Because if your packet rate is low enough such that the cpu can
> process the interrupt fast enough and thus only one packet gets
> processed per NAPI poll, the cost of going into and out of NAPI mode
> dominates the packet processing costs.

In the second half of my previous reply (which seems to have been
deleted), I suggest a way to avoid this problem without using hardware
interrupt mitigation / coalescing. Original text is quoted below.

>> I've seen the same and I'm suggesting that the NAPI driver keeps
>> itself in polled mode for N polls or M jiffies after it sees
>> workdone=0. This has always worked for me in packet forwarding
>> scenarios to maximize packets/sec and minimize latency.

To implement this, there's no need for timers, hrtimers or generic NAPI
support that others have suggested. A driver's poll() would set an
internal flag and record the current jiffies value when finding
workdone=0 rather than doing an immediate napi_complete(). Early in
poll() it would test this flag and if set, do a low-cost test to see if
it had any work to do. If no work, it would check the saved jiffies
value and do the napi_complete() only if no work has been done for a
configurable number of jiffies. This keeps interrupts disabled longer at
the expense of many more calls to poll() where no work is done. So
critical to this scheme is modifying the driver's poll() to fastpath the
case of having no work to do while waiting for its local jiffy count to
expire.

Here's an untested patch for tg3 that illustrates the idea.

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 710dccc..59e151b 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3473,6 +3473,24 @@ static int tg3_poll(struct napi_struct *napi,
struct tg3_hw_status *sblk = tp->hw_status;
int work_done = 0;

+ /* fastpath having no work while we're holding ourself in
+ * polled mode
+ */
+ if ((tp->exit_poll_time) && (!tg3_has_work(tp))) {
+ if (time_after(jiffies, tp->exit_poll_time)) {
+ tp->exit_poll_time = 0;
+ /* tell net stack and NIC we're done */
+ netif_rx_complete(netdev, napi);
+ tg3_restart_ints(tp);
+ }
+ return 0;
+ }
+
+ /* if we get here, there might be work to do so disable the
+ * poll hold fastpath above
+ */
+ tp->exit_poll_time = 0;
+
/* handle link change and other phy events */
if (!(tp->tg3_flags &
(TG3_FLAG_USE_LINKCHG_REG |
@@ -3511,11 +3529,11 @@ static int tg3_poll(struct napi_struct *napi,
} else
sblk->status &= ~SD_STATUS_UPDATED;

- /* if no more work, tell net stack and NIC we're done */
- if (!tg3_has_work(tp)) {
- netif_rx_complete(netdev, napi);
- tg3_restart_ints(tp);
- }
+ /* if no more work, set the time in jiffies when we should
+ * exit polled mode
+ */
+ if (!tg3_has_work(tp))
+ tp->exit_poll_time = jiffies + 2;

return work_done;
}
diff --git a/drivers/net/tg3.h b/drivers/net/tg3.h
index a6a23bb..a0d24d3 100644
--- a/drivers/net/tg3.h
+++ b/drivers/net/tg3.h
@@ -2163,6 +2163,7 @@ struct tg3 {
u32 last_tag;

u32 msg_enable;
+ unsigned long exit_poll_time;

/* begin "tx thread" cacheline section */
void (*write32_tx_mbox) (struct tg3 *, u32,


--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-27 16:14:37

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Monday 27 August 2007 17:51, James Chapman wrote:

> In the second half of my previous reply (which seems to have been
> deleted), I suggest a way to avoid this problem without using hardware
> interrupt mitigation / coalescing. Original text is quoted below.
>
> >> I've seen the same and I'm suggesting that the NAPI driver keeps
> >> itself in polled mode for N polls or M jiffies after it sees
> >> workdone=0. This has always worked for me in packet forwarding
> >> scenarios to maximize packets/sec and minimize latency.
>
> To implement this, there's no need for timers, hrtimers or generic NAPI
> support that others have suggested. A driver's poll() would set an
> internal flag and record the current jiffies value when finding
> workdone=0 rather than doing an immediate napi_complete(). Early in
> poll() it would test this flag and if set, do a low-cost test to see if
> it had any work to do. If no work, it would check the saved jiffies
> value and do the napi_complete() only if no work has been done for a
> configurable number of jiffies. This keeps interrupts disabled longer at
> the expense of many more calls to poll() where no work is done. So
> critical to this scheme is modifying the driver's poll() to fastpath the
> case of having no work to do while waiting for its local jiffy count to
> expire.
>

The problem I see with this approach is that the time that passes between
two jiffies might be too long for 10G ethernet adapters. (I tried to implement
a timer based approach with usual timers and the result was a disaster).
HW interrupts / or HP timer avoid the jiffy problem as they activate softIRQs
as soon as you call netif_rx_schedule.

2007-08-27 17:06:14

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Jan-Bernd Themann wrote:
> On Monday 27 August 2007 17:51, James Chapman wrote:
>
>> In the second half of my previous reply (which seems to have been
>> deleted), I suggest a way to avoid this problem without using hardware
>> interrupt mitigation / coalescing. Original text is quoted below.
>>
>> >> I've seen the same and I'm suggesting that the NAPI driver keeps
>> >> itself in polled mode for N polls or M jiffies after it sees
>> >> workdone=0. This has always worked for me in packet forwarding
>> >> scenarios to maximize packets/sec and minimize latency.
>>
>> To implement this, there's no need for timers, hrtimers or generic NAPI
>> support that others have suggested. A driver's poll() would set an
>> internal flag and record the current jiffies value when finding
>> workdone=0 rather than doing an immediate napi_complete(). Early in
>> poll() it would test this flag and if set, do a low-cost test to see if
>> it had any work to do. If no work, it would check the saved jiffies
>> value and do the napi_complete() only if no work has been done for a
>> configurable number of jiffies. This keeps interrupts disabled longer at
>> the expense of many more calls to poll() where no work is done. So
>> critical to this scheme is modifying the driver's poll() to fastpath the
>> case of having no work to do while waiting for its local jiffy count to
>> expire.
>>
>
> The problem I see with this approach is that the time that passes between
> two jiffies might be too long for 10G ethernet adapters.

Why would staying in polled mode for 2 jiffies be too long in the 10G
case? I don't see why 10G makes any difference. Your poll() would be
called as fast as your CPU allows during those 2 jiffies (it would
actually be between 1 and 2 jiffies in practice). It is therefore
critical that the driver's poll() implementation is as efficient as
possible for the "no work" case to minimize the overhead of the extra
poll() calls. Your poll might be called thousands of times in 1-2
jiffies with nothing to do...

> (I tried to implement
> a timer based approach with usual timers and the result was a disaster).
> HW interrupts / or HP timer avoid the jiffy problem as they activate softIRQs
> as soon as you call netif_rx_schedule.

My scheme doesn't use timers to do netif_rx_schedule() because the
device stays in polled mode for 1-2 jiffies _after_ it detects it has no
more work. So the device remains scheduled, processing packets as usual.
The device deschedules itself and re-enables its interrupts only when it
has a period of 1-2 jiffies of doing no work.

BTW, I chose 2 jiffies in the example patch just to keep the patch
simple. It might be more for systems with large HZ or those that want to
be even more aggressive at staying in polled mode. I envisage it being
another parameter that can be tweaked using ethtool if people see a
benefit of this scheme.

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-27 20:37:49

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: Jan-Bernd Themann <[email protected]>
Date: Mon, 27 Aug 2007 11:47:01 +0200

> So the question is simply: Do we want drivers that need (benefit
> from) a timer based polling support to implement their own timers
> each, or should there be a generic support?

I'm trying to figure out how an hrtimer implementation would
even work.

Would you start the timer from the chip interrupt handler? If so,
that's taking two steps backwards as you've already taken all of the
overhead of running the interrupt handler.

2007-08-27 21:03:08

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: James Chapman <[email protected]>
Date: Mon, 27 Aug 2007 16:51:29 +0100

> To implement this, there's no need for timers, hrtimers or generic NAPI
> support that others have suggested. A driver's poll() would set an
> internal flag and record the current jiffies value when finding
> workdone=0 rather than doing an immediate napi_complete(). Early in
> poll() it would test this flag and if set, do a low-cost test to see if
> it had any work to do. If no work, it would check the saved jiffies
> value and do the napi_complete() only if no work has been done for a
> configurable number of jiffies. This keeps interrupts disabled longer at
> the expense of many more calls to poll() where no work is done. So
> critical to this scheme is modifying the driver's poll() to fastpath the
> case of having no work to do while waiting for its local jiffy count to
> expire.
>
> Here's an untested patch for tg3 that illustrates the idea.

It's only going to work with hrtimers, these interfaces can
process at least 100,000 per jiffies tick.

And the hrtimer granularity is going to need to be significantly low,
and futhermore you're adding a guaranteed extra interrupt (for the
hrtimer firing) in these cases where we're exactly trying to avoid is
more interrupts.

If you can make it work, fine, but it's going to need to be at a
minimum disabled when the hrtimer granularity is not sufficient.

But there are huger fish to fry for you I think. Talk to your
platform maintainers and ask for an interface for obtaining
a flat static distribution of interrupts to cpus in order to
support multiqueue NAPI better.

In your previous postings you made arguments saying that the
automatic placement of interrupts to cpus made everything
bunch of to a single cpu and you wanted to propagate the
NAPI work to other cpu's software interrupts from there.

That logic is bogus, because it merely proves that the hardware
interrupt distribution is broken. If it's a bad cpu to run
software interrupts on, it's also a bad cpu to run hardware
interrupts on.

2007-08-27 21:52:59

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

David Miller wrote:
> From: James Chapman <[email protected]>
> Date: Mon, 27 Aug 2007 16:51:29 +0100
>
>> To implement this, there's no need for timers, hrtimers or generic NAPI
>> support that others have suggested. A driver's poll() would set an
>> internal flag and record the current jiffies value when finding
>> workdone=0 rather than doing an immediate napi_complete(). Early in
>> poll() it would test this flag and if set, do a low-cost test to see if
>> it had any work to do. If no work, it would check the saved jiffies
>> value and do the napi_complete() only if no work has been done for a
>> configurable number of jiffies. This keeps interrupts disabled longer at
>> the expense of many more calls to poll() where no work is done. So
>> critical to this scheme is modifying the driver's poll() to fastpath the
>> case of having no work to do while waiting for its local jiffy count to
>> expire.
>>
>> Here's an untested patch for tg3 that illustrates the idea.
>
> It's only going to work with hrtimers, these interfaces can
> process at least 100,000 per jiffies tick.

I don't understand where hrtimers or interface speed comes in. If the
CPU is fast enough to call poll() 100,000 times per jiffies tick, it
means 100,000 wasted poll() calls while the netdev migrates from active
to poll-off state. Hence the need to fastpath the "no work" case in the
netdev's poll(). These extra poll() calls are tolerable if it avoids
NAPI thrashing between poll-on and poll-off states for certain packet rates.

> And the hrtimer granularity is going to need to be significantly low,
> and futhermore you're adding a guaranteed extra interrupt (for the
> hrtimer firing) in these cases where we're exactly trying to avoid is
> more interrupts.
>
> If you can make it work, fine, but it's going to need to be at a
> minimum disabled when the hrtimer granularity is not sufficient.
>
> But there are huger fish to fry for you I think. Talk to your
> platform maintainers and ask for an interface for obtaining
> a flat static distribution of interrupts to cpus in order to
> support multiqueue NAPI better.
>
> In your previous postings you made arguments saying that the
> automatic placement of interrupts to cpus made everything
> bunch of to a single cpu and you wanted to propagate the
> NAPI work to other cpu's software interrupts from there.

I don't recall saying anything in previous posts about this. Are you
confusing my posts with Jan-Bernd's? Jan-Bernd has been talking about
using hrtimers to _reschedule_ NAPI. My posts are suggesting an
alternative mechanism that keeps NAPI active (with interrupts disabled)
for a jiffy or two after it would otherwise have gone idle in order to
avoid too many interrupts when the packet rate is such that NAPI
thrashes between poll-on and poll-off.

> That logic is bogus, because it merely proves that the hardware
> interrupt distribution is broken. If it's a bad cpu to run
> software interrupts on, it's also a bad cpu to run hardware
> interrupts on.

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-27 21:57:53

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: James Chapman <[email protected]>
Date: Mon, 27 Aug 2007 22:41:43 +0100

> I don't recall saying anything in previous posts about this. Are you
> confusing my posts with Jan-Bernd's?

Yes, my bad.

> Jan-Bernd has been talking about using hrtimers to _reschedule_
> NAPI. My posts are suggesting an alternative mechanism that keeps
> NAPI active (with interrupts disabled) for a jiffy or two after it
> would otherwise have gone idle in order to avoid too many interrupts
> when the packet rate is such that NAPI thrashes between poll-on and
> poll-off.

So in this scheme what runs ->poll() to process incoming packets?
The hrtimer?

2007-08-28 09:23:27

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

David Miller wrote:
> From: James Chapman <[email protected]>
> Date: Mon, 27 Aug 2007 22:41:43 +0100
>
>> I don't recall saying anything in previous posts about this. Are you
>> confusing my posts with Jan-Bernd's?
>
> Yes, my bad.
>
>> Jan-Bernd has been talking about using hrtimers to _reschedule_
>> NAPI. My posts are suggesting an alternative mechanism that keeps
>> NAPI active (with interrupts disabled) for a jiffy or two after it
>> would otherwise have gone idle in order to avoid too many interrupts
>> when the packet rate is such that NAPI thrashes between poll-on and
>> poll-off.
>
> So in this scheme what runs ->poll() to process incoming packets?
> The hrtimer?

No, the regular NAPI networking core calls ->poll() as usual; no timers
are involved. This scheme simply delays the napi_complete() from the
driver so the device stays in the poll list longer. It means that its
->poll() will be called when there is no work to do for 1-2 jiffies,
hence the optimization at the top of ->poll() to efficiently handle that
case. The device's ->poll() is called by the NAPI core until it has
continuously done no work for 1-2 jiffies, at which point it finally
does the netif_rx_complete() and re-enables its interrupts.

If people feel that holding the device in the poll list for 1-2 jiffies
is too long (because there are too many wasted polls), a counter could
be used to to delay the netif_rx_complete() by N polls instead. N would
be a value depending on CPU speed. I use the jiffy sampling method
because it results in some natural randomization of the actual delay
depending on when the jiffy value was sampled in relation to the jiffy tick.

Here is the tg3 patch again that illustrates the idea. The patch changes
only the ->poll() routine; note how the netif_rx_complete() call is delayed.

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 710dccc..59e151b 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3473,6 +3473,24 @@ static int tg3_poll(struct napi_struct *napi,
struct tg3_hw_status *sblk = tp->hw_status;
int work_done = 0;

+ /* fastpath having no work while we're holding ourself in
+ * polled mode
+ */
+ if ((tp->exit_poll_time) && (!tg3_has_work(tp))) {
+ if (time_after(jiffies, tp->exit_poll_time)) {
+ tp->exit_poll_time = 0;
+ /* tell net stack and NIC we're done */
+ netif_rx_complete(netdev, napi);
+ tg3_restart_ints(tp);
+ }
+ return 0;
+ }
+
+ /* if we get here, there might be work to do so disable the
+ * poll hold fastpath above
+ */
+ tp->exit_poll_time = 0;
+
/* handle link change and other phy events */
if (!(tp->tg3_flags &
(TG3_FLAG_USE_LINKCHG_REG |
@@ -3511,11 +3529,11 @@ static int tg3_poll(struct napi_struct *napi,
} else
sblk->status &= ~SD_STATUS_UPDATED;

- /* if no more work, tell net stack and NIC we're done */
- if (!tg3_has_work(tp)) {
- netif_rx_complete(netdev, napi);
- tg3_restart_ints(tp);
- }
+ /* if no more work, set the time in jiffies when we should
+ * exit polled mode
+ */
+ if (!tg3_has_work(tp))
+ tp->exit_poll_time = jiffies + 2;

return work_done;
}
diff --git a/drivers/net/tg3.h b/drivers/net/tg3.h
index a6a23bb..a0d24d3 100644
--- a/drivers/net/tg3.h
+++ b/drivers/net/tg3.h
@@ -2163,6 +2163,7 @@ struct tg3 {
u32 last_tag;

u32 msg_enable;
+ unsigned long exit_poll_time;

/* begin "tx thread" cacheline section */
void (*write32_tx_mbox) (struct tg3 *, u32,


--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-28 11:19:35

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Monday 27 August 2007 22:37, David Miller wrote:
> From: Jan-Bernd Themann <[email protected]>
> Date: Mon, 27 Aug 2007 11:47:01 +0200
>
> > So the question is simply: Do we want drivers that need (benefit
> > from) a timer based polling support to implement their own timers
> > each, or should there be a generic support?
>
> I'm trying to figure out how an hrtimer implementation would
> even work.
>
> Would you start the timer from the chip interrupt handler? If so,
> that's taking two steps backwards as you've already taken all of the
> overhead of running the interrupt handler.

I'm also still trying to understand how hrtimer work exactly.
The implementation of hrtimers for P6 has not been finished yet, so
I can't do experiments with hrtimers and eHEA now.

I will try the following scheme (once we get hrtimers):
Each device (queue) has a hrtimer.
Schedule the timer in the poll function instead of reactivating IRQs
when a high load situation has been detected and all packets have
been emptied from the receive queue.
The timer function could then just call netif_rx_schedule to register
the rx_queue for NAPI again.

The advantages of this scheme (if it works as I understood it) would be:
- we don't have to modify NAPI
- benefit from fairness amoung rx_queues / network devices
- The poll function can decide how long to stick to the timer based
polling mode, and when to switch back to it's HW IRQs.
- driver can determine the time to wait based on the receive queue length and
speed

Regards,
Jan-Bernd

2007-08-28 11:21:25

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Hi

On Monday 27 August 2007 23:02, David Miller wrote:
> But there are huger fish to fry for you I think. Talk to your
> platform maintainers and ask for an interface for obtaining
> a flat static distribution of interrupts to cpus in order to
> support multiqueue NAPI better.
>
> In your previous postings you made arguments saying that the
> automatic placement of interrupts to cpus made everything
> bunch of to a single cpu and you wanted to propagate the
> NAPI work to other cpu's software interrupts from there.
>
> That logic is bogus, because it merely proves that the hardware
> interrupt distribution is broken. If it's a bad cpu to run
> software interrupts on, it's also a bad cpu to run hardware
> interrupts on.
>

As already mentioned some mails were mixed up here. To clarify the interrupt
issue that has nothing to do with the reduction of interrupts:
- Interrupts are distributed the round robin way on IBM POWER6 processors
- Interrupt distribution can be modified by user/daemons (smp_affinity, pinning)
- NAPI is scheduled on CPU where interrupt is catched
- NAPI polls on that CPU as long as poll has packets to process (default)
(David please correct if there is a misunderstanding here)

Problem for multi queue driver with interrupt distribution scheme set to
round robin for this simple example:
Assuming we have 2 SLOW CPUs. CPU_1 is under heavy load (applications). CPU_2
is not under heavy load. Now we receive a lot of packets (high load situation).
Receive queue 1 (RQ1) is scheduled on CPU_1. NAPI-Poll does not manage to empty
RQ1 ever, so it stays on CPU_1. The second receive queue (RQ2) is scheduled on
CPU_2. As that CPU is not under heavy load, RQ2 can be emptied, and the next IRQ
for RQ2 will go to CPU_1. Then both RQs are on CPU_1 and will stay there if
no IRQ is forced at some time as both RQs are never emptied completely.

This is a simplified example to demonstrate the problem. The interrupt scheme
is not bogus, it is just an effect you see if you don't use pinning.
The user can avoid this problem by pinning the interrupts to CPUs.

As pointed out by David, it will be too expensive to schedule NAPI poll on
a different CPU than the one that gets the IRQ. So I guess one solution
is to "force" an HW interrupt when two many RQs are processed on the same CPU
(when no IRQ pinning is used). This is something the driver has to handle.

Regards,
Jan-Bernd

2007-08-28 11:49:14

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Tuesday 28 August 2007 11:22, James Chapman wrote:
> > So in this scheme what runs ->poll() to process incoming packets?
> > The hrtimer?
>
> No, the regular NAPI networking core calls ->poll() as usual; no timers
> are involved. This scheme simply delays the napi_complete() from the
> driver so the device stays in the poll list longer. It means that its
> ->poll() will be called when there is no work to do for 1-2 jiffies,
> hence the optimization at the top of ->poll() to efficiently handle that
> case. The device's ->poll() is called by the NAPI core until it has
> continuously done no work for 1-2 jiffies, at which point it finally
> does the netif_rx_complete() and re-enables its interrupts.
>
I'm not sure if I understand your approach correctly.
This approach may reduce the number of interrupts, but it does so
by blocking the CPU for up to 1 jiffy (that can be quite some time
on some platforms). So no other application / tasklet / softIRQ type
can do anything in between. The CPU utilization does not drop at all,
and I thought that is one reason why we try to reduce the number of interrupts.

> If people feel that holding the device in the poll list for 1-2 jiffies
> is too long (because there are too many wasted polls), a counter could
> be used to to delay the netif_rx_complete() by N polls instead. N would
> be a value depending on CPU speed. I use the jiffy sampling method
> because it results in some natural randomization of the actual delay
> depending on when the jiffy value was sampled in relation to the jiffy tick.
>

Waiting for N polls seems to make no sense if there are no further network adapters
in that machine. It would take no time to call poll N times in a row when no
new packets arrive. There is no real delay as the net_rx_action function will
do nothing else between the poll calls.

Please correct me if I'm wrong.

Regards,
Jan-Bernd

2007-08-28 12:17:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Tue, Aug 28, 2007 at 01:48:20PM +0200, Jan-Bernd Themann ([email protected]) wrote:
> I'm not sure if I understand your approach correctly.
> This approach may reduce the number of interrupts, but it does so
> by blocking the CPU for up to 1 jiffy (that can be quite some time
> on some platforms). So no other application / tasklet / softIRQ type
> can do anything in between. The CPU utilization does not drop at all,
> and I thought that is one reason why we try to reduce the number of interrupts.

Only NICs interrupts are suposed to be stopped, system will continue to
work as usual, since all others are alive.
Having hrtimer to reshcedule NIC procesing can work only if number of
timer's interrupts are much less than NICs and if rate of the timer's
starts/changes (presumbly in NICs interrupt) is small too, otherwise
having too many NIC interrupts will not gain anything (actually it is
what is supposed to be dropped noticebly).

--
Evgeniy Polyakov

2007-08-28 14:57:45

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Jan-Bernd Themann wrote:
> On Tuesday 28 August 2007 11:22, James Chapman wrote:
>>> So in this scheme what runs ->poll() to process incoming packets?
>>> The hrtimer?
>> No, the regular NAPI networking core calls ->poll() as usual; no timers
>> are involved. This scheme simply delays the napi_complete() from the
>> driver so the device stays in the poll list longer. It means that its
>> ->poll() will be called when there is no work to do for 1-2 jiffies,
>> hence the optimization at the top of ->poll() to efficiently handle that
>> case. The device's ->poll() is called by the NAPI core until it has
>> continuously done no work for 1-2 jiffies, at which point it finally
>> does the netif_rx_complete() and re-enables its interrupts.
>>
> I'm not sure if I understand your approach correctly.
> This approach may reduce the number of interrupts, but it does so
> by blocking the CPU for up to 1 jiffy (that can be quite some time
> on some platforms). So no other application / tasklet / softIRQ type
> can do anything in between.

I think I've misread the reworked NAPI net_rx_action code. I thought
that it ran each device ->poll() just once, rescheduling the NET_RX
softirq again if a device stayed in polled mode. I can see now that it
loops while one or more devices stays in the poll list for up to a
jiffy, just like it always has. So by keeping the device in the poll
list and not consuming quota, net_rx_action() spins until the next jiffy
tick unless another device consumes quota, like you say.

I can only assume that the encouraging results that I get with this
scheme are specific to my test setups (measuring packet forwarding
rates). I agree that it isn't desirable to tie up the CPU for up to a
jiffy in net_rx_action() in order to do this. I need to go away and
rework my ideas. Perhaps it is possible to get the behavior I'm looking
for by somehow special-casing the zero return from ->poll() in
net_rx_action(), but I'm not sure.

Thanks for asking questions.

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-28 20:22:53

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: Jan-Bernd Themann <[email protected]>
Date: Tue, 28 Aug 2007 13:19:03 +0200

> I will try the following scheme (once we get hrtimers): Each device
> (queue) has a hrtimer. Schedule the timer in the poll function
> instead of reactivating IRQs when a high load situation has been
> detected and all packets have been emptied from the receive queue.
> The timer function could then just call netif_rx_schedule to
> register the rx_queue for NAPI again.

Interrupt mitigation only works if it helps you avoid interrupts.
This scheme potentially makes more of them happen.

The hrtimer is just another interrupt, a cpu locally triggered one,
but it has much of the same costs nonetheless.

So if you set this timer, it triggers, and no packets arrive, you are
taking more interrupts and doing more work than if you had disabled
NAPI.

In fact, for certain packet rates, your scheme would result in
twice as many interrupts than the current scheme.

This is one of several reasons why hardware is the only truly proper
place for this kind of logic. Only the hardware can see the packet
arrive, and do the interrupt deferral without any cpu intervention
whatsoever.

2007-08-28 20:26:14

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: Jan-Bernd Themann <[email protected]>
Date: Tue, 28 Aug 2007 13:21:09 +0200

> Problem for multi queue driver with interrupt distribution scheme set to
> round robin for this simple example:
> Assuming we have 2 SLOW CPUs. CPU_1 is under heavy load (applications). CPU_2
> is not under heavy load. Now we receive a lot of packets (high load situation).
> Receive queue 1 (RQ1) is scheduled on CPU_1. NAPI-Poll does not manage to empty
> RQ1 ever, so it stays on CPU_1. The second receive queue (RQ2) is scheduled on
> CPU_2. As that CPU is not under heavy load, RQ2 can be emptied, and the next IRQ
> for RQ2 will go to CPU_1. Then both RQs are on CPU_1 and will stay there if
> no IRQ is forced at some time as both RQs are never emptied completely.

Why would RQ2's IRQ move over to CPU_1? That's stupid.

If you lock the IRQs to specific cpus, the load adjusts automatically.
Because packet work, in high load situations, will be processed via
ksoftirqd daemons even if the packet is not for a user application's
socket.

In such a case the scheduler will move the user applications that are
now not able to get their timeslices over to a less busy cpu.

If you keep moving the interrupts around, the scheduler cannot react
properly and it makes the situation worse.

2007-08-28 20:27:53

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: Jan-Bernd Themann <[email protected]>
Date: Tue, 28 Aug 2007 13:21:09 +0200

> So I guess one solution is to "force" an HW interrupt when two many
> RQs are processed on the same CPU (when no IRQ pinning is
> used). This is something the driver has to handle.

No, the solution is to lock the interrupts onto one specific
processor and don't move it around. That's what's causing
all of the problems.

And you can enforce this policy now in the driver even if just
for testing by calling the set_affinity() interfaces on the
interrupts your driver has.

You can even walk over the cpu_online_map and choose a load
distribution of your liking.

2007-08-29 07:11:11

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Hi David

David Miller schrieb:
> Interrupt mitigation only works if it helps you avoid interrupts.
> This scheme potentially makes more of them happen.
>
> The hrtimer is just another interrupt, a cpu locally triggered one,
> but it has much of the same costs nonetheless.
>
> So if you set this timer, it triggers, and no packets arrive, you are
> taking more interrupts and doing more work than if you had disabled
> NAPI.
>
> In fact, for certain packet rates, your scheme would result in
> twice as many interrupts than the current scheme
>
That depends how smart the driver switches between timer
polling and plain NAPI (depending on load situation).
> This is one of several reasons why hardware is the only truly proper
> place for this kind of logic. Only the hardware can see the packet
> arrive, and do the interrupt deferral without any cpu intervention
> whatsoever.
>
What I'm trying to improve with this approach is interrupt
mitigation for NICs where the hardware support for interrupt
mitigation is limited. I'm not trying to improve this for NICs
that work well with the means their HW provides. I'm aware of
the fact that this scheme has it's tradeoffs and certainly
can not be as good as a HW approach.
So I'm grateful for any ideas that do have less tradeoffs and
provide a mechanism to reduce interrupts without depending on
HW support of the NIC.

In the end I want to reduce the CPU utilization. And one way
to do that is LRO which also works only well if there are more
then just a very few packets to aggregate. So at least our
driver (eHEA) would benefit from a mix of timer based polling
and plain NAPI (depending on load situations).

If there is no need for a generic mechanism for this kind of
network adapters, then we can just leave this to each device
driver.




2007-08-29 08:15:39

by James Chapman

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

Jan-Bernd Themann wrote:
> Hi David
>
> David Miller schrieb:
>> Interrupt mitigation only works if it helps you avoid interrupts.
>> This scheme potentially makes more of them happen.
>>
>> The hrtimer is just another interrupt, a cpu locally triggered one,
>> but it has much of the same costs nonetheless.
>>
>> So if you set this timer, it triggers, and no packets arrive, you are
>> taking more interrupts and doing more work than if you had disabled
>> NAPI.
>>
>> In fact, for certain packet rates, your scheme would result in
>> twice as many interrupts than the current scheme
>>
> That depends how smart the driver switches between timer
> polling and plain NAPI (depending on load situation).
>> This is one of several reasons why hardware is the only truly proper
>> place for this kind of logic. Only the hardware can see the packet
>> arrive, and do the interrupt deferral without any cpu intervention
>> whatsoever.
>>
> What I'm trying to improve with this approach is interrupt
> mitigation for NICs where the hardware support for interrupt
> mitigation is limited. I'm not trying to improve this for NICs
> that work well with the means their HW provides. I'm aware of
> the fact that this scheme has it's tradeoffs and certainly
> can not be as good as a HW approach.
> So I'm grateful for any ideas that do have less tradeoffs and
> provide a mechanism to reduce interrupts without depending on
> HW support of the NIC.
>
> In the end I want to reduce the CPU utilization. And one way
> to do that is LRO which also works only well if there are more
> then just a very few packets to aggregate. So at least our
> driver (eHEA) would benefit from a mix of timer based polling
> and plain NAPI (depending on load situations).

Wouldn't you achieve the same result by enabling hardware interrupt
mitigation in eHEA in combination with NAPI? Presumably a 10G interface
has hardware mitigation features?

> If there is no need for a generic mechanism for this kind of
> network adapters, then we can just leave this to each device
> driver.

I've been looking at this from a different angle. My goal is to optimize
NAPI packet forwarding rates while minimizing packet latency. Using
hardware interrupt mitigation hurts latency so I'm investigating ways to
turn it off without risking NAPI poll on/off thrashing at certain packet
rates.

Jan-Bernd, I think I've found a solution to the issue that you
highlighted with my scheme yesterday and it doesn't involve generating
other interrupts using hrtimers etc. :) Initial results are very
encouraging in my setups. Would you be willing to test it with eHEA? I
don't have a 10G setup. If results are encouraging, I'll post an RFC to
ask for review / feedback from the NAPI experts here. What do you think?

--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

2007-08-29 08:29:30

by David Miller

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

From: Jan-Bernd Themann <[email protected]>
Date: Wed, 29 Aug 2007 09:10:15 +0200

> In the end I want to reduce the CPU utilization. And one way
> to do that is LRO which also works only well if there are more
> then just a very few packets to aggregate. So at least our
> driver (eHEA) would benefit from a mix of timer based polling
> and plain NAPI (depending on load situations).
>
> If there is no need for a generic mechanism for this kind of
> network adapters, then we can just leave this to each device
> driver.

No objections from me either way, if something works then
fine.

Let's come back to this once you have a tested sample implementation
that does what you want, ok?

2007-08-29 08:31:45

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Wednesday 29 August 2007 10:29, David Miller wrote:
> From: Jan-Bernd Themann <[email protected]>
> Date: Wed, 29 Aug 2007 09:10:15 +0200
>
> > In the end I want to reduce the CPU utilization. And one way
> > to do that is LRO which also works only well if there are more
> > then just a very few packets to aggregate. So at least our
> > driver (eHEA) would benefit from a mix of timer based polling
> > and plain NAPI (depending on load situations).
> >
> > If there is no need for a generic mechanism for this kind of
> > network adapters, then we can just leave this to each device
> > driver.
>
> No objections from me either way, if something works then
> fine.
>
> Let's come back to this once you have a tested sample implementation
> that does what you want, ok?

Sounds good

2007-08-29 08:43:29

by Jan-Bernd Themann

[permalink] [raw]
Subject: Re: RFC: issues concerning the next NAPI interface

On Wednesday 29 August 2007 10:15, James Chapman wrote:
> Jan-Bernd Themann wrote:
> > What I'm trying to improve with this approach is interrupt
> > mitigation for NICs where the hardware support for interrupt
> > mitigation is limited. I'm not trying to improve this for NICs
> > that work well with the means their HW provides. I'm aware of
> > the fact that this scheme has it's tradeoffs and certainly
> > can not be as good as a HW approach.
> > So I'm grateful for any ideas that do have less tradeoffs and
> > provide a mechanism to reduce interrupts without depending on
> > HW support of the NIC.
> >
> > In the end I want to reduce the CPU utilization. And one way
> > to do that is LRO which also works only well if there are more
> > then just a very few packets to aggregate. So at least our
> > driver (eHEA) would benefit from a mix of timer based polling
> > and plain NAPI (depending on load situations).
>
> Wouldn't you achieve the same result by enabling hardware interrupt
> mitigation in eHEA in combination with NAPI? Presumably a 10G interface
> has hardware mitigation features?

Quote from above: "What I'm trying to improve with this approach
is interrupt mitigation for NICs where the hardware support for
interrupt mitigation is limited"

So guess why I'm doing that ;-)

>
> > If there is no need for a generic mechanism for this kind of
> > network adapters, then we can just leave this to each device
> > driver.
>
> I've been looking at this from a different angle. My goal is to optimize
> NAPI packet forwarding rates while minimizing packet latency. Using
> hardware interrupt mitigation hurts latency so I'm investigating ways to
> turn it off without risking NAPI poll on/off thrashing at certain packet
> rates.
>
> Jan-Bernd, I think I've found a solution to the issue that you
> highlighted with my scheme yesterday and it doesn't involve generating
> other interrupts using hrtimers etc. :) Initial results are very
> encouraging in my setups. Would you be willing to test it with eHEA? I
> don't have a 10G setup. If results are encouraging, I'll post an RFC to
> ask for review / feedback from the NAPI experts here. What do you think?
>

I'm not sure which solution you mean. If you post your RFC, please create
a new thread (other title)