2009-10-22 21:41:05

by David Daney

[permalink] [raw]
Subject: Irq architecture for multi-core network driver.

My network controller is part of a multicore SOC family[1] with up to 32
cpu cores.

The the packets-ready signal from the network controller can trigger
an interrupt on any or all cpus and is configurable on a per cpu basis.

If more than one cpu has the interrupt enabled, they would all get the
interrupt, so if a single packet were to be ready, all cpus could be
interrupted and try to process it. The kernel interrupt management
functions don't seem to give me a good way to manage the interrupts.
More on this later.

My current approach is to add a NAPI instance for each cpu. I start
with the interrupt enabled on a single cpu, when the interrupt
triggers, I mask the interrupt on that cpu and schedule the
napi_poll. When the napi_poll function is entered, I look at the
packet backlog and if it is above a threshold , I enable the interrupt
on an additional cpu. The process then iterates until the number of cpu
running the napi_poll function can maintain the backlog under the
threshold. This all seems to work fairly well.

The main problem I have encountered is how to fit the interrupt
management into the kernel framework. Currently the interrupt source
is connected to a single irq number. I request_irq, and then manage
the masking and unmasking on a per cpu basis by directly manipulating
the interrupt controller's affinity/routing registers. This goes
behind the back of all the kernel's standard interrupt management
routines. I am looking for a better approach.

One thing that comes to mind is that I could assign a different
interrupt number per cpu to the interrupt signal. So instead of
having one irq I would have 32 of them. The driver would then do
request_irq for all 32 irqs, and could call enable_irq and disable_irq
to enable and disable them. The problem with this is that there isn't
really a single packets-ready signal, but instead 16 of them. So If I
go this route I would have 16(lines) x 32(cpus) = 512 interrupt
numbers just for the networking hardware, which seems a bit excessive.

A second possibility is to add something like:

int irq_add_affinity(unsigned int irq, cpumask_t cpumask);

int irq_remove_affinity(unsigned int irq, cpumask_t cpumask);

These would atomically add and remove cpus from an irq's affinity.
This is essentially what my current driver does, but it would be with
a new officially blessed kernel interface.

Any opinions about the best way forward are most welcome.

Thanks,
David Daney

[1]: See: arch/mips/cavium-octeon and drivers/staging/octeon. Yes the
staging driver is ugly, I am working to improve it.


2009-10-22 22:09:00

by Chris Friesen

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

On 10/22/2009 03:40 PM, David Daney wrote:

> The main problem I have encountered is how to fit the interrupt
> management into the kernel framework. Currently the interrupt source
> is connected to a single irq number. I request_irq, and then manage
> the masking and unmasking on a per cpu basis by directly manipulating
> the interrupt controller's affinity/routing registers. This goes
> behind the back of all the kernel's standard interrupt management
> routines. I am looking for a better approach.
>
> One thing that comes to mind is that I could assign a different
> interrupt number per cpu to the interrupt signal. So instead of
> having one irq I would have 32 of them. The driver would then do
> request_irq for all 32 irqs, and could call enable_irq and disable_irq
> to enable and disable them. The problem with this is that there isn't
> really a single packets-ready signal, but instead 16 of them. So If I
> go this route I would have 16(lines) x 32(cpus) = 512 interrupt
> numbers just for the networking hardware, which seems a bit excessive.

Does your hardware do flow-based queues? In this model you have
multiple rx queues and the hardware hashes incoming packets to a single
queue based on the addresses, ports, etc. This ensures that all the
packets of a single connection always get processed in the order they
arrived at the net device.

Typically in this model you have as many interrupts as queues
(presumably 16 in your case). Each queue is assigned an interrupt and
that interrupt is affined to a single core.

The intel igb driver is an example of one that uses this sort of design.

Chris

2009-10-22 22:24:48

by David Daney

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

Chris Friesen wrote:
> On 10/22/2009 03:40 PM, David Daney wrote:
>
>> The main problem I have encountered is how to fit the interrupt
>> management into the kernel framework. Currently the interrupt source
>> is connected to a single irq number. I request_irq, and then manage
>> the masking and unmasking on a per cpu basis by directly manipulating
>> the interrupt controller's affinity/routing registers. This goes
>> behind the back of all the kernel's standard interrupt management
>> routines. I am looking for a better approach.
>>
>> One thing that comes to mind is that I could assign a different
>> interrupt number per cpu to the interrupt signal. So instead of
>> having one irq I would have 32 of them. The driver would then do
>> request_irq for all 32 irqs, and could call enable_irq and disable_irq
>> to enable and disable them. The problem with this is that there isn't
>> really a single packets-ready signal, but instead 16 of them. So If I
>> go this route I would have 16(lines) x 32(cpus) = 512 interrupt
>> numbers just for the networking hardware, which seems a bit excessive.
>
> Does your hardware do flow-based queues? In this model you have
> multiple rx queues and the hardware hashes incoming packets to a single
> queue based on the addresses, ports, etc. This ensures that all the
> packets of a single connection always get processed in the order they
> arrived at the net device.
>

Indeed, this is exactly what we have.


> Typically in this model you have as many interrupts as queues
> (presumably 16 in your case). Each queue is assigned an interrupt and
> that interrupt is affined to a single core.

Certainly this is one mode of operation that should be supported, but I
would also like to be able to go for raw throughput and have as many
cores as possible reading from a single queue (like I currently have).

>
> The intel igb driver is an example of one that uses this sort of design.
>

Thanks, I will look at that driver.

David Daney

2009-10-23 07:59:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

David Daney <[email protected]> writes:

> Chris Friesen wrote:
>> On 10/22/2009 03:40 PM, David Daney wrote:
>>
>>> The main problem I have encountered is how to fit the interrupt
>>> management into the kernel framework. Currently the interrupt source
>>> is connected to a single irq number. I request_irq, and then manage
>>> the masking and unmasking on a per cpu basis by directly manipulating
>>> the interrupt controller's affinity/routing registers. This goes
>>> behind the back of all the kernel's standard interrupt management
>>> routines. I am looking for a better approach.
>>>
>>> One thing that comes to mind is that I could assign a different
>>> interrupt number per cpu to the interrupt signal. So instead of
>>> having one irq I would have 32 of them. The driver would then do
>>> request_irq for all 32 irqs, and could call enable_irq and disable_irq
>>> to enable and disable them. The problem with this is that there isn't
>>> really a single packets-ready signal, but instead 16 of them. So If I
>>> go this route I would have 16(lines) x 32(cpus) = 512 interrupt
>>> numbers just for the networking hardware, which seems a bit excessive.
>>
>> Does your hardware do flow-based queues? In this model you have
>> multiple rx queues and the hardware hashes incoming packets to a single
>> queue based on the addresses, ports, etc. This ensures that all the
>> packets of a single connection always get processed in the order they
>> arrived at the net device.
>>
>
> Indeed, this is exactly what we have.
>
>
>> Typically in this model you have as many interrupts as queues
>> (presumably 16 in your case). Each queue is assigned an interrupt and
>> that interrupt is affined to a single core.
>
> Certainly this is one mode of operation that should be supported, but I would
> also like to be able to go for raw throughput and have as many cores as possible
> reading from a single queue (like I currently have).

I believe will detect false packet drops and ask for unnecessary
retransmits if you have multiple cores processing a single queue,
because you are processing the packets out of order.

Eric

2009-10-23 17:28:11

by Jesse Brandeburg

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

On Fri, Oct 23, 2009 at 12:59 AM, Eric W. Biederman
<[email protected]> wrote:
> David Daney <[email protected]> writes:
>> Certainly this is one mode of operation that should be supported, but I would
>> also like to be able to go for raw throughput and have as many cores as possible
>> reading from a single queue (like I currently have).
>
> I believe will detect false packet drops and ask for unnecessary
> retransmits if you have multiple cores processing a single queue,
> because you are processing the packets out of order.

So, the way the default linux kernel configures today's many core
server systems is to leave the affinity mask by default at 0xffffffff,
and most current Intel hardware based on 5000 (older core cpus), or
5500 chipset (used with Core i7 processors) that I have seen will
allow for round robin interrupts by default. This kind of sucks for
the above unless you run irqbalance or set smp_affinity by hand.

Yes, I know Arjan and others will say you should always run
irqbalance, but some people don't and some distros don't ship it
enabled by default (or their version doesn't work for one reason or
another) The question is should the kernel work better by default
*without* irqbalance loaded, or does it not matter?

I don't believe we should re-enable the kernel irq balancer, but
should we consider only setting a single bit in each new interrupt's
irq affinity? Doing it with a random spread for the initial affinity
would be better than setting them all to one.

2009-10-23 23:22:38

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

Jesse Brandeburg <[email protected]> writes:

> On Fri, Oct 23, 2009 at 12:59 AM, Eric W. Biederman
> <[email protected]> wrote:
>> David Daney <[email protected]> writes:
>>> Certainly this is one mode of operation that should be supported, but I would
>>> also like to be able to go for raw throughput and have as many cores as possible
>>> reading from a single queue (like I currently have).
>>
>> I believe will detect false packet drops and ask for unnecessary
>> retransmits if you have multiple cores processing a single queue,
>> because you are processing the packets out of order.
>
> So, the way the default linux kernel configures today's many core
> server systems is to leave the affinity mask by default at 0xffffffff,
> and most current Intel hardware based on 5000 (older core cpus), or
> 5500 chipset (used with Core i7 processors) that I have seen will
> allow for round robin interrupts by default. This kind of sucks for
> the above unless you run irqbalance or set smp_affinity by hand.

On x86 if you have > 8 cores the hardware does not support any form of
irq balancing. You do have an interesting point.

How often and how much does irq balancing hurt us.

> Yes, I know Arjan and others will say you should always run
> irqbalance, but some people don't and some distros don't ship it
> enabled by default (or their version doesn't work for one reason or
> another)

irqbalance is actually more likely to move irqs than the hardware.
I have heard promises it won't move network irqs but I have seen
the opposite behavior.

> The question is should the kernel work better by default
> *without* irqbalance loaded, or does it not matter?

Good question. I would aim for the kernel to work better by default.
Ideally we should have a coupling between which sockets applications have
open, which cpus those applications run on, and which core the irqs arrive
at.

> I don't believe we should re-enable the kernel irq balancer, but
> should we consider only setting a single bit in each new interrupt's
> irq affinity? Doing it with a random spread for the initial affinity
> would be better than setting them all to one.

Not a bad idea. The practical problem is that we usually have the irqs
setup before we have the additional cpus. But that isn't entirely true,
I'm thinking of mostly pre-acpi rules. With ACPI we do some kind of
on-demand setup of the gsi in the device initialization.

How irq threads interact also ways in here.

Eric

2009-10-24 03:19:08

by David Miller

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

From: Jesse Brandeburg <[email protected]>
Date: Fri, 23 Oct 2009 10:28:10 -0700

> Yes, I know Arjan and others will say you should always run
> irqbalance, but some people don't and some distros don't ship it
> enabled by default (or their version doesn't work for one reason or
> another) The question is should the kernel work better by default
> *without* irqbalance loaded, or does it not matter?

I think requiring irqbalanced for optimal behavior is more
than reasonable.

And since we explicitly took that policy logic out of the
kernel it makes absolutely no sense to put it back there.

It's policy, and policy is (largely) userspace.

2009-10-24 13:23:18

by David Miller

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

From: David Daney <[email protected]>
Date: Thu, 22 Oct 2009 15:24:24 -0700

> Certainly this is one mode of operation that should be supported, but
> I would also like to be able to go for raw throughput and have as many
> cores as possible reading from a single queue (like I currently have).

You can't do this, at least within the same flow, since as you even
mention in your original posting this can result in packet reordering
which we must avoid as much as is possible.

2009-10-24 13:26:07

by David Miller

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

From: [email protected] (Eric W. Biederman)
Date: Fri, 23 Oct 2009 16:22:36 -0700

> irqbalance is actually more likely to move irqs than the hardware.
> I have heard promises it won't move network irqs but I have seen
> the opposite behavior.

It knows what network devices are named, and looks for those keys
in /proc/interrupts. Anything names 'ethN' will not be moved and
if you name them on a per-queue basis properly (ie. 'ethN-RX1' etc.)
it will flat distribute those interrupts amongst the cpus in the
machine.

So if you're doing "silly stuff" and naming your devices by some other
convention, you would end up defeating the detations built into
irqbalanced.

Actually, let's not even guess, go check out the sources of the
irqbalanced running on your system and make sure it has the network
device logic in it. :-)

2009-12-16 22:08:16

by Chetan Loke

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

>>
>> Does your hardware do flow-based queues?  In this model you have
>> multiple rx queues and the hardware hashes incoming packets to a single
>> queue based on the addresses, ports, etc. This ensures that all the
>> packets of a single connection always get processed in the order they
>> arrived at the net device.
>>
>
> Indeed, this is exactly what we have.
>
>
>> Typically in this model you have as many interrupts as queues
>> (presumably 16 in your case).  Each queue is assigned an interrupt and
>> that interrupt is affined to a single core.
>

> Certainly this is one mode of operation that should be supported, but I
> would also like to be able to go for raw throughput and have as many cores
> as possible reading from a single queue (like I currently have).
>
Well, you could let the NIC firmware(f/w) handle this. The f/w would
know which interrupt was just injected recently.In other words it
would have a history of which CPU's would be available. So if some
previously interrupted CPU isn't making good progress then the
firmware should route the incoming response packets to a different
queue. This way some other CPU will pick it up.


> David Daney
> --
Chetan Loke

2009-12-16 22:42:43

by David Daney

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

Chetan Loke wrote:
>>> Does your hardware do flow-based queues? In this model you have
>>> multiple rx queues and the hardware hashes incoming packets to a single
>>> queue based on the addresses, ports, etc. This ensures that all the
>>> packets of a single connection always get processed in the order they
>>> arrived at the net device.
>>>
>> Indeed, this is exactly what we have.
>>
>>
>>> Typically in this model you have as many interrupts as queues
>>> (presumably 16 in your case). Each queue is assigned an interrupt and
>>> that interrupt is affined to a single core.
>
>> Certainly this is one mode of operation that should be supported, but I
>> would also like to be able to go for raw throughput and have as many cores
>> as possible reading from a single queue (like I currently have).
>>
> Well, you could let the NIC firmware(f/w) handle this. The f/w would
> know which interrupt was just injected recently.In other words it
> would have a history of which CPU's would be available. So if some
> previously interrupted CPU isn't making good progress then the
> firmware should route the incoming response packets to a different
> queue. This way some other CPU will pick it up.
>


It isn's a NIC. There is no firmware. The system interrupt hardware is
what it is and cannot be changed.

My current implementation still has a single input queue configured and
I get a maskable interrupt on a single CPU when packets are available.
If the queue depth increases above a given threshold, I optionally send
an IPI to another CPU to enable NAPI polling on that CPU.

Currently I have a module parameter that controls the maximum number of
CPUs that will have NAPI polling enabled.

This allows me to get multiple CPUs doing receive processing without
having to hack into the lower levels of the system's interrupt
processing code to try to do interrupt steering. Since all the
interrupt service routine was doing was call netif_rx_schedule(), I can
simply do this via smp_call_function_single().

David Daney

2009-12-16 23:01:51

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

On Wed, 16 Dec 2009 14:30:36 -0800
David Daney <[email protected]> wrote:

> Chetan Loke wrote:
> >>> Does your hardware do flow-based queues? In this model you have
> >>> multiple rx queues and the hardware hashes incoming packets to a single
> >>> queue based on the addresses, ports, etc. This ensures that all the
> >>> packets of a single connection always get processed in the order they
> >>> arrived at the net device.
> >>>
> >> Indeed, this is exactly what we have.
> >>
> >>
> >>> Typically in this model you have as many interrupts as queues
> >>> (presumably 16 in your case). Each queue is assigned an interrupt and
> >>> that interrupt is affined to a single core.
> >
> >> Certainly this is one mode of operation that should be supported, but I
> >> would also like to be able to go for raw throughput and have as many cores
> >> as possible reading from a single queue (like I currently have).
> >>
> > Well, you could let the NIC firmware(f/w) handle this. The f/w would
> > know which interrupt was just injected recently.In other words it
> > would have a history of which CPU's would be available. So if some
> > previously interrupted CPU isn't making good progress then the
> > firmware should route the incoming response packets to a different
> > queue. This way some other CPU will pick it up.
> >
>
>
> It isn's a NIC. There is no firmware. The system interrupt hardware is
> what it is and cannot be changed.
>
> My current implementation still has a single input queue configured and
> I get a maskable interrupt on a single CPU when packets are available.
> If the queue depth increases above a given threshold, I optionally send
> an IPI to another CPU to enable NAPI polling on that CPU.
>
> Currently I have a module parameter that controls the maximum number of
> CPUs that will have NAPI polling enabled.
>
> This allows me to get multiple CPUs doing receive processing without
> having to hack into the lower levels of the system's interrupt
> processing code to try to do interrupt steering. Since all the
> interrupt service routine was doing was call netif_rx_schedule(), I can
> simply do this via smp_call_function_single().

Better to look into receive packet steering patches that are still
under review (rather than reinventing it just for your driver)

--

2009-12-16 23:37:08

by David Daney

[permalink] [raw]
Subject: Re: Irq architecture for multi-core network driver.

Stephen Hemminger wrote:
> On Wed, 16 Dec 2009 14:30:36 -0800
> David Daney <[email protected]> wrote:
>
>> Chetan Loke wrote:
>>>>> Does your hardware do flow-based queues? In this model you have
>>>>> multiple rx queues and the hardware hashes incoming packets to a single
>>>>> queue based on the addresses, ports, etc. This ensures that all the
>>>>> packets of a single connection always get processed in the order they
>>>>> arrived at the net device.
>>>>>
>>>> Indeed, this is exactly what we have.
>>>>
>>>>
>>>>> Typically in this model you have as many interrupts as queues
>>>>> (presumably 16 in your case). Each queue is assigned an interrupt and
>>>>> that interrupt is affined to a single core.
>>>> Certainly this is one mode of operation that should be supported, but I
>>>> would also like to be able to go for raw throughput and have as many cores
>>>> as possible reading from a single queue (like I currently have).
>>>>
>>> Well, you could let the NIC firmware(f/w) handle this. The f/w would
>>> know which interrupt was just injected recently.In other words it
>>> would have a history of which CPU's would be available. So if some
>>> previously interrupted CPU isn't making good progress then the
>>> firmware should route the incoming response packets to a different
>>> queue. This way some other CPU will pick it up.
>>>
>>
>> It isn's a NIC. There is no firmware. The system interrupt hardware is
>> what it is and cannot be changed.
>>
>> My current implementation still has a single input queue configured and
>> I get a maskable interrupt on a single CPU when packets are available.
>> If the queue depth increases above a given threshold, I optionally send
>> an IPI to another CPU to enable NAPI polling on that CPU.
>>
>> Currently I have a module parameter that controls the maximum number of
>> CPUs that will have NAPI polling enabled.
>>
>> This allows me to get multiple CPUs doing receive processing without
>> having to hack into the lower levels of the system's interrupt
>> processing code to try to do interrupt steering. Since all the
>> interrupt service routine was doing was call netif_rx_schedule(), I can
>> simply do this via smp_call_function_single().
>
> Better to look into receive packet steering patches that are still
> under review (rather than reinventing it just for your driver)
>

Indeed. Although it turns out that I can do packet steering in hardware
across up to 16 queues each with their own irq and thus dedicated CPU.
So it is unclear to me if the receive packet steering patches offer
much benefit to this hardware.

One concern is the ability to forward as many packets as possible from a
very low number of flows (between 1 and 4). Since it is an artificial
benchmark, we can arbitrarily say that packet reordering is allowed.
The simple hack to do NAPI polling on all CPUs from a single queue gives
good results. There is no need to remind me that packet reordering
should be avoided, I already know this.

David Daney