I got some comments. Special thanks to Stephen Hemminger for teaching me on
what reorder is and some other comments. Also thank other guys who raised comments.
v2 has some improvements.
1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing_cpu. Admin
could use it to configure the binding between RX and cpu number. So it's convenient
for drivers to use the new capability.
2) Delete function netif_rx_queue;
3) Optimize ipi notification. There is no new notification when destination's
input_pkt_alien_queue isn't empty.
4) Did lots of testing, mostly focusing on slab allocator (slab/slub/slqb) and use
SLUB with big slub_max_order currently.
---
Subject: net: hand off skb list to other cpu to submit to upper layer
From: Zhang Yanmin <[email protected]>
Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC.
I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds
packets by pktgen. The 2nd receives the packets from one NIC and forwards them out
from the 2nd NIC.
Initial testing showed cpu cache sharing has impact on speed. As NICs supports
multi-queue, I bind the queues to different logical cpu of different physical cpu
while considering cache sharing carefully. I could get about 30~40% improvement;
Comparing with sending speed on the 1st machine, the forward speed is still not good,
only about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when
interrupt arrives. When ip_forward=1, receiver collects a packet and forwards it out
immediately. So although IXGBE collects packets with NAPI, the forwarding really has
much impact on collection. As IXGBE runs very fast, it drops packets quickly. The better
way for receiving cpu is doing nothing than just collecting packets.
Currently kernel has backlog to support a similar capability, but process_backlog still
runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to
softnet_data. Receving cpu collects packets and link them into skb list, then delivers
the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list
from input_pkt_alien_queue when input_pkt_queue is empty.
I tested my patch on top of 2.6.28.5. The improvement is about 43%.
Some questions:
1) Reorder: My method wouldn't introduce reorder issue, because we use N:1 mapping between
RX queue and cpu number.
2) If there is no free cpu to work on packet collection: It depends on cpu resource
allocation. We could allocate more RX queue to the same cpu. With my new testing, the
forwarding speed could be at about 4.8M pps (packets per second and packet size is 60Byte)
on Nehalem machine, and 8 packet processing cpus almost have no idle time while receiving cpu
idle is about 50%. I just have 4 old NIC and couldn't test more on this question.
3) packet delaying: I didn't calculate it or measure it and might measure it later. The
forwarding speed is close to 270M bytes/s. At least sar shows mostly receiving matches
forwarding. But at sending side, the sending speed is bigger than forwarding speed, although
my method decreases the difference largely.
4) 10G NICs other than IXGBE: I have no other 10G NICs now.
5) Other kinds of machines working as forwarder: I test it between 1 2*4 stoakley and
2*4*2 Nehalem. I reversed the testing and found the improvement on stoakley is less than 30%,
not so big as on Nehalem.
6) Memory utilization: My nehalem machine has 12GB memory. To reach the maximum speed,
I tried netdev_max_backlog=400000. That consumes 10GB memory sometimes.
7) Any impact if driver enables the new capability but admin doesn't configure it: I didn't
measure the speed difference now.
8) If receiving cpu collects packets very fast and processing cpu is slow: We can start many
RX queues on the receiving cpu and bind them to different processing cpu.
Current patch is against 2.6.29-rc7.
Signed-off-by: Zhang Yanmin <[email protected]>
---
--- linux-2.6.29-rc7/include/linux/netdevice.h 2009-03-09 15:20:49.000000000 +0800
+++ linux-2.6.29-rc7_backlog/include/linux/netdevice.h 2009-03-11 10:17:08.000000000 +0800
@@ -1119,6 +1119,9 @@ static inline int unregister_gifconf(uns
/*
* Incoming packets are placed on per-cpu queues so that
* no locking is needed.
+ * To speed up fast network, sometimes place incoming packets
+ * to other cpu queues. Use input_pkt_alien_queue.lock to
+ * protect input_pkt_alien_queue.
*/
struct softnet_data
{
@@ -1127,6 +1130,7 @@ struct softnet_data
struct list_head poll_list;
struct sk_buff *completion_queue;
+ struct sk_buff_head input_pkt_alien_queue;
struct napi_struct backlog;
};
@@ -1368,6 +1372,8 @@ extern void dev_kfree_skb_irq(struct sk_
extern void dev_kfree_skb_any(struct sk_buff *skb);
#define HAVE_NETIF_RX 1
+extern int raise_netif_irq(int cpu,
+ struct sk_buff_head *skb_queue);
extern int netif_rx(struct sk_buff *skb);
extern int netif_rx_ni(struct sk_buff *skb);
#define HAVE_NETIF_RECEIVE_SKB 1
--- linux-2.6.29-rc7/net/core/dev.c 2009-03-09 15:20:50.000000000 +0800
+++ linux-2.6.29-rc7_backlog/net/core/dev.c 2009-03-11 10:27:57.000000000 +0800
@@ -1997,6 +1997,114 @@ int netif_rx_ni(struct sk_buff *skb)
EXPORT_SYMBOL(netif_rx_ni);
+static void net_drop_skb(struct sk_buff_head *skb_queue)
+{
+ struct sk_buff *skb = __skb_dequeue(skb_queue);
+
+ while (skb) {
+ __get_cpu_var(netdev_rx_stat).dropped++;
+ kfree_skb(skb);
+ skb = __skb_dequeue(skb_queue);
+ }
+}
+
+static int net_backlog_local_merge(struct sk_buff_head *skb_queue)
+{
+ struct softnet_data *queue;
+ unsigned long flags;
+
+ queue = &__get_cpu_var(softnet_data);
+ if (queue->input_pkt_queue.qlen + skb_queue->qlen <=
+ netdev_max_backlog) {
+
+ local_irq_save(flags);
+ if (!queue->input_pkt_queue.qlen)
+ napi_schedule(&queue->backlog);
+ skb_queue_splice_tail_init(skb_queue, &queue->input_pkt_queue);
+ local_irq_restore(flags);
+
+ return 0;
+ } else {
+ net_drop_skb(skb_queue);
+ return 1;
+ }
+}
+
+static void net_napi_backlog(void *data)
+{
+ struct softnet_data *queue = &__get_cpu_var(softnet_data);
+
+ napi_schedule(&queue->backlog);
+ kfree(data);
+}
+
+static int net_backlog_notify_cpu(int cpu)
+{
+ struct call_single_data *data;
+
+ data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC);
+ if (!data)
+ return -1;
+
+ data->func = net_napi_backlog;
+ data->info = data;
+ data->flags = 0;
+ __smp_call_function_single(cpu, data);
+
+ return 0;
+}
+
+int raise_netif_irq(int cpu, struct sk_buff_head *skb_queue)
+{
+ unsigned long flags;
+ struct softnet_data *queue;
+ int retval, need_notify=0;
+
+ if (!skb_queue || skb_queue_empty(skb_queue))
+ return 0;
+
+ /*
+ * If cpu is offline, we queue skb back to
+ * the queue on current cpu.
+ */
+ if ((unsigned)cpu >= nr_cpu_ids ||
+ !cpu_online(cpu) ||
+ cpu == smp_processor_id()) {
+ net_backlog_local_merge(skb_queue);
+ return 0;
+ }
+
+ queue = &per_cpu(softnet_data, cpu);
+ if (queue->input_pkt_alien_queue.qlen > netdev_max_backlog)
+ goto failed1;
+
+ spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags);
+ if (skb_queue_empty(&queue->input_pkt_alien_queue))
+ need_notify = 1;
+ skb_queue_splice_tail_init(skb_queue,
+ &queue->input_pkt_alien_queue);
+ spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock,
+ flags);
+
+ if (need_notify) {
+ retval = net_backlog_notify_cpu(cpu);
+ if (unlikely(retval))
+ goto failed2;
+ }
+
+ return 0;
+
+failed2:
+ spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags);
+ skb_queue_splice_tail_init(&queue->input_pkt_alien_queue, skb_queue);
+ spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock,
+ flags);
+failed1:
+ net_drop_skb(skb_queue);
+
+ return 1;
+}
+
static void net_tx_action(struct softirq_action *h)
{
struct softnet_data *sd = &__get_cpu_var(softnet_data);
@@ -2336,6 +2444,13 @@ static void flush_backlog(void *arg)
struct net_device *dev = arg;
struct softnet_data *queue = &__get_cpu_var(softnet_data);
struct sk_buff *skb, *tmp;
+ unsigned long flags;
+
+ spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags);
+ skb_queue_splice_tail_init(
+ &queue->input_pkt_alien_queue,
+ &queue->input_pkt_queue );
+ spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, flags);
skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp)
if (skb->dev == dev) {
@@ -2594,9 +2709,19 @@ static int process_backlog(struct napi_s
local_irq_disable();
skb = __skb_dequeue(&queue->input_pkt_queue);
if (!skb) {
- __napi_complete(napi);
- local_irq_enable();
- break;
+ if (!skb_queue_empty(&queue->input_pkt_alien_queue)) {
+ spin_lock(&queue->input_pkt_alien_queue.lock);
+ skb_queue_splice_tail_init(
+ &queue->input_pkt_alien_queue,
+ &queue->input_pkt_queue );
+ spin_unlock(&queue->input_pkt_alien_queue.lock);
+
+ skb = __skb_dequeue(&queue->input_pkt_queue);
+ } else {
+ __napi_complete(napi);
+ local_irq_enable();
+ break;
+ }
}
local_irq_enable();
@@ -4985,6 +5110,11 @@ static int dev_cpu_callback(struct notif
local_irq_enable();
/* Process offline CPU's input_pkt_queue */
+ spin_lock(&oldsd->input_pkt_alien_queue.lock);
+ skb_queue_splice_tail_init(&oldsd->input_pkt_alien_queue,
+ &oldsd->input_pkt_queue);
+ spin_unlock(&oldsd->input_pkt_alien_queue.lock);
+
while ((skb = __skb_dequeue(&oldsd->input_pkt_queue)))
netif_rx(skb);
@@ -5184,10 +5314,13 @@ static int __init net_dev_init(void)
struct softnet_data *queue;
queue = &per_cpu(softnet_data, i);
+
skb_queue_head_init(&queue->input_pkt_queue);
queue->completion_queue = NULL;
INIT_LIST_HEAD(&queue->poll_list);
+ skb_queue_head_init(&queue->input_pkt_alien_queue);
+
queue->backlog.poll = process_backlog;
queue->backlog.weight = weight_p;
queue->backlog.gro_list = NULL;
@@ -5247,6 +5380,7 @@ EXPORT_SYMBOL(netdev_set_master);
EXPORT_SYMBOL(netdev_state_change);
EXPORT_SYMBOL(netif_receive_skb);
EXPORT_SYMBOL(netif_rx);
+EXPORT_SYMBOL(raise_netif_irq);
EXPORT_SYMBOL(register_gifconf);
EXPORT_SYMBOL(register_netdevice);
EXPORT_SYMBOL(register_netdevice_notifier);
"Zhang, Yanmin" <[email protected]> writes:
> I got some comments. Special thanks to Stephen Hemminger for teaching me on
> what reorder is and some other comments. Also thank other guys who raised comments.
>
> v2 has some improvements.
> 1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing_cpu. Admin
> could use it to configure the binding between RX and cpu number. So it's convenient
> for drivers to use the new capability.
Seems very inconvenient to have to configure this by hand. How about
auto selecting one that shares the same LLC or somesuch? Passing
data to anything with the same LLC should be cheap enough.
BTW the standard idea to balance processing over multiple CPUs was to
use MSI-X to multiple CPUs. and just use the hash function on the
NIC. Have you considered this for forwarding too? The trick here would
be to try to avoid reordering inside streams as far as possible, but
since the NIC hash should work on flow basis that should be ok.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote:
> "Zhang, Yanmin" <[email protected]> writes:
>
> > I got some comments. Special thanks to Stephen Hemminger for teaching me on
> > what reorder is and some other comments. Also thank other guys who raised comments.
>
>
> >
> > v2 has some improvements.
> > 1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing_cpu. Admin
> > could use it to configure the binding between RX and cpu number. So it's convenient
> > for drivers to use the new capability.
>
> Seems very inconvenient to have to configure this by hand.
A little, but not too much, especially when we consider there is interrupt binding.
> How about
> auto selecting one that shares the same LLC or somesuch?
There are 2 kinds of LLC sharing here.
1) RX/TX share the LLC;
2) All RX share the LLC of some cpus and TX share the LLC of other cpus.
Item 1) is important, but sometimes item 2) is also important when the sending speed is
very high and huge data is on flight which flushes cpu cache quickly.
It's hard to distinguish the 2 different scenarioes automatically.
> Passing
> data to anything with the same LLC should be cheap enough.
Yes, when the data isn't huge. My forwarding testing currently could reach at 270M bytes per
second on Nehalem and I wish higher if I could get the latest NICs.
> BTW the standard idea to balance processing over multiple CPUs was to
> use MSI-X to multiple CPUs.
Yes. My method still depends on MSI-X and multi-queue. One difference is I just need less than
CPU_NUM interrupt numbers as there are only some cpus working on packet receiving.
> and just use the hash function on the
> NIC.
Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something
like hash function to decide the RX queue number based on SRC/DST?
> Have you considered this for forwarding too?
Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could
define that all packets received from a RX queue should be sent out from a specific TX queue.
So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But
sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant
bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the
same time.
> The trick here would
> be to try to avoid reordering inside streams as far as possible,
It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu
work on packet receiving dedicately. If they work on other things, NIC might drop packets
quickly.
The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface,
driver developers need implement it with parameters which are painful.
> but
> since the NIC hash should work on flow basis that should be ok.
Yes, hardware is good at preventing reorder. My method doesn't change the order in software
layer.
Thanks Andi.
On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote:
[...]
> > and just use the hash function on the
> > NIC.
> Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something
> like hash function to decide the RX queue number based on SRC/DST?
Yes, that's exactly what they do. This feature is sometimes called
Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft
requires Windows drivers performing RSS to provide the hash value to the
networking stack, so Linux drivers for the same hardware should be able
to do so too.
> > Have you considered this for forwarding too?
> Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could
> define that all packets received from a RX queue should be sent out from a specific TX queue.
The choice of TX queue can be based on the RX hash so that configuration
is usually unnecessary.
> So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But
> sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant
> bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the
> same time.
>
> > The trick here would
> > be to try to avoid reordering inside streams as far as possible,
> It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu
> work on packet receiving dedicately. If they work on other things, NIC might drop packets
> quickly.
Aggressive power-saving causes far greater latency than context-
switching under Linux. I believe most 10G NICs have large RX FIFOs to
mitigate against this. Ethernet flow control also helps to prevent
packet loss.
> The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface,
> driver developers need implement it with parameters which are painful.
[...]
Or through the ethtool API, which already has some multiqueue control
operations.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
On Thu, Mar 12, 2009 at 04:16:32PM +0800, Zhang, Yanmin wrote:
>
> > Seems very inconvenient to have to configure this by hand.
> A little, but not too much, especially when we consider there is interrupt binding.
Interrupt binding is something popular for benchmarks, but most users
don't (and shouldn't need to) care. Having it work well out of the box
without special configuration is very important.
>
> > How about
> > auto selecting one that shares the same LLC or somesuch?
> There are 2 kinds of LLC sharing here.
> 1) RX/TX share the LLC;
> 2) All RX share the LLC of some cpus and TX share the LLC of other cpus.
>
> Item 1) is important, but sometimes item 2) is also important when the sending speed is
> very high and huge data is on flight which flushes cpu cache quickly.
> It's hard to distinguish the 2 different scenarioes automatically.
Why is it hard if you know the CPUs?
> > and just use the hash function on the
> > NIC.
> Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something
> like hash function to decide the RX queue number based on SRC/DST?
There's a Microsoft spec for a standard hash function that does this
on NICs and all the serious ones support it these days. The hash
is normally used to select a MSI-X target based on the input header.
I think if that works your manual target shouldn't be necessary.
> > The trick here would
> > be to try to avoid reordering inside streams as far as possible,
> It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu
Point was that any solution shouldn't add more reordering. But when a RSS
hash is used there is no reordering on stream basis.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, 2009-03-12 at 14:08 +0000, Ben Hutchings wrote:
> On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote:
> [...]
> > > and just use the hash function on the
> > > NIC.
> > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something
> > like hash function to decide the RX queue number based on SRC/DST?
>
> Yes, that's exactly what they do. This feature is sometimes called
> Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft
> requires Windows drivers performing RSS to provide the hash value to the
> networking stack, so Linux drivers for the same hardware should be able
> to do so too.
Oh, I didn't know the background. I need study more about network.
Thanks for explain it.
>
> > > Have you considered this for forwarding too?
> > Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could
> > define that all packets received from a RX queue should be sent out from a specific TX queue.
>
> The choice of TX queue can be based on the RX hash so that configuration
> is usually unnecessary.
I agree. I double checked the latest codes of tree net-next-2.6 and function skb_tx_hash
is enough.
>
> > So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But
> > sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant
> > bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the
> > same time.
> >
> > > The trick here would
> > > be to try to avoid reordering inside streams as far as possible,
> > It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu
> > work on packet receiving dedicately. If they work on other things, NIC might drop packets
> > quickly.
>
> Aggressive power-saving causes far greater latency than context-
> switching under Linux.
Yes when NIC is free mostly. When NIC is busy, it wouldn't enter power-saving mode.
Performance testing is used to turn off all power-saving modes. :)
> I believe most 10G NICs have large RX FIFOs to
> mitigate against this. Ethernet flow control also helps to prevent
> packet loss.
I guess NIC might allocate resources evenly for all queues, at least by default. If considering
packet sending burst with the same SRC/DST, a specific queue might be full quickly. I
instrumented driver and kernel to print out packet receiving and forwarding. As The latest IXGBE
driver gets a packet and forwards it immediately, I think most packets are dropped by hardware
because cpu doesn't collects packets quickly when the specific receiving queue is full. By
comparing the sending speed and forwarding speed, we could get the dropping rate easily.
My experiment shows receving cpu idle is more than 50% and cpu does often collect all packets
till the specific queue is empty. I think that's because pktgen switches to a new SRC/DST to
produce another burst to fill other queues quickly.
It's hard to say cpu is slower than NIC because they work on different parts of the full
receiving/processing procedures. But we need cpu collect packets ASAP.
> > The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface,
> > driver developers need implement it with parameters which are painful.
> [...]
>
> Or through the ethtool API, which already has some multiqueue control
> operations.
That's an alternative approach to configure it. If checking the sample patch on driver,
we can find the change is very small.
Thanks for your kind comments.
Yanmin
On Thu, 2009-03-12 at 15:34 +0100, Andi Kleen wrote:
> On Thu, Mar 12, 2009 at 04:16:32PM +0800, Zhang, Yanmin wrote:
> >
> > > Seems very inconvenient to have to configure this by hand.
> > A little, but not too much, especially when we consider there is interrupt binding.
>
> Interrupt binding is something popular for benchmarks, but most users
> don't (and shouldn't need to) care. Having it work well out of the box
> without special configuration is very important.
Thanks Andi. You tell the truth. Now I understand why David Miller is working
on auto TX selection.
One thing I want to clarify is, with the default configuration, the processing path
still goes to current automation selection. That means my method has little impact
on current automation selection with default configuration, except a small cache miss.
Another exception is IXGBE prefers to getting one packet and sending one packet
immediately instead of backlog.
Even when turning on the new capability to separate packet receiving and packet
processing, TX selection is still following current automatic selection. The difference
is we use different cpu. Driver still could record RX number into skb which is used
when sending out.
>
> >
> > > How about
> > > auto selecting one that shares the same LLC or somesuch?
> > There are 2 kinds of LLC sharing here.
> > 1) RX/TX share the LLC;
> > 2) All RX share the LLC of some cpus and TX share the LLC of other cpus.
> >
> > Item 1) is important, but sometimes item 2) is also important when the sending speed is
> > very high and huge data is on flight which flushes cpu cache quickly.
> > It's hard to distinguish the 2 different scenarioes automatically.
>
> Why is it hard if you know the CPUs?
RX binding depends on interrupt binding totally. If the MSI-X interrupt is sent to cpu A,
cpu A will collect the packets on the RX queue. By default, interrupt isn't bound.
Software knows the LLC sharing of cpu A. If cpu A receives the interrupt, it couldn't just
throw packets to other cpus which share its LLC, because it doesn't know whether other cpus
are collecting packets from other RX queues now.
>
> > > and just use the hash function on the
> > > NIC.
> > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something
> > like hash function to decide the RX queue number based on SRC/DST?
>
> There's a Microsoft spec for a standard hash function that does this
> on NICs and all the serious ones support it these days. The hash
> is normally used to select a MSI-X target based on the input header.
Thanks for the explanation. The capability defined by the spec is to choose
a MSI-X number and provides a hint when sending a cloned packet out. Does the NIC
know how cpu is busy? I assume not. So the hash is trying to distribute packets
into RX queues evenly while also avoiding reorder.
We might say irqbalance could balance workload so we expect cpu workload is
even. My testing shows such evenly distribution of packets on all cpu isn't
good at performance.
>
> I think if that works your manual target shouldn't be necessary.
Here are 2 targets with my method. The one is packet collecting cpu and the other
is packet processing cpu.
As NIC doesn't know how busy cpu is, why can't we separate the processing?
>
> > > The trick here would
> > > be to try to avoid reordering inside streams as far as possible,
> > It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu
>
> Point was that any solution shouldn't add more reordering. But when a RSS
> hash is used there is no reordering on stream basis.
Yes.
Thanks again.
Yanmin
On Thu, Mar 12, 2009 at 11:43 PM, Zhang, Yanmin
<[email protected]> wrote:
>
> On Thu, 2009-03-12 at 14:08 +0000, Ben Hutchings wrote:
> > On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote:
> > [...]
> > > > ?and just use the hash function on the
> > > > NIC.
> > > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something
> > > like hash function to decide the RX queue number based on SRC/DST?
> >
> > Yes, that's exactly what they do. ?This feature is sometimes called
> > Receive-Side Scaling (RSS) which is Microsoft's name for it. ?Microsoft
> > requires Windows drivers performing RSS to provide the hash value to the
> > networking stack, so Linux drivers for the same hardware should be able
> > to do so too.
> Oh, I didn't know the background. I need study more about network.
> Thanks for explain it.
>
You'll definitely want to look at the hardware provided hash. We've
been using a 10G NIC which provides a Toeplitz hash (the one defined
by Microsoft) and a software RSS-like capability to move packets from
an interrupting CPU to another for processing. The hash could be used
to index to a set of CPUs, but we also use the hash as a connection
identifier to key into a lookup table to steer packets to the CPU
where the application is running based on the running CPU of the last
recvmsg. Using the device provided hash in this manner is a HUGE win,
as opposed to taking cache misses to get 4-tuple from packet itself to
compute a hash. I posted some patches a while back on our work if
you're interested.
We also using multiple RX queues of the 10G device in concert with
pretty good results. We have noticed that the interrupt overheads
substantially mitigate the benefits. In fact, I would say the
software packet steering has provided the greater benefit (and it's
very useful on our many 1G NICS that don't have multiq!).
Tom
From: Tom Herbert <[email protected]>
Date: Fri, 13 Mar 2009 10:06:56 -0700
> You'll definitely want to look at the hardware provided hash. We've
> been using a 10G NIC which provides a Toeplitz hash (the one defined
> by Microsoft) and a software RSS-like capability to move packets from
> an interrupting CPU to another for processing. The hash could be used
> to index to a set of CPUs, but we also use the hash as a connection
> identifier to key into a lookup table to steer packets to the CPU
> where the application is running based on the running CPU of the last
> recvmsg. Using the device provided hash in this manner is a HUGE win,
> as opposed to taking cache misses to get 4-tuple from packet itself to
> compute a hash. I posted some patches a while back on our work if
> you're interested.
I never understood this.
If you don't let the APIC move the interrupt around, the individual
MSI-X interrupts will steer packets to individual specific CPUS and as
a result the scheduler will migrate tasks over to those cpus since the
wakeup events keep occuring there.
On Fri, Mar 13, 2009 at 11:51 AM, David Miller <[email protected]> wrote:
>
> From: Tom Herbert <[email protected]>
> Date: Fri, 13 Mar 2009 10:06:56 -0700
>
> > You'll definitely want to look at the hardware provided hash. We've
> > been using a 10G NIC which provides a Toeplitz hash (the one defined
> > by Microsoft) and a software RSS-like capability to move packets from
> > an interrupting CPU to another for processing. The hash could be used
> > to index to a set of CPUs, but we also use the hash as a connection
> > identifier to key into a lookup table to steer packets to the CPU
> > where the application is running based on the running CPU of the last
> > recvmsg. Using the device provided hash in this manner is a HUGE win,
> > as opposed to taking cache misses to get 4-tuple from packet itself to
> > compute a hash. I posted some patches a while back on our work if
> > you're interested.
>
> I never understood this.
>
> If you don't let the APIC move the interrupt around, the individual
> MSI-X interrupts will steer packets to individual specific CPUS and as
> a result the scheduler will migrate tasks over to those cpus since the
> wakeup events keep occuring there.
We are trying to follow the decisions scheduler as opposed to leading
it. This works on very loaded systems, with applications binding to
cpusets, with threads that are receiving on multiple sockets. I
suppose it might be compelling if a NIC could steer packets per flow,
instead of by a hash...
From: Tom Herbert <[email protected]>
Date: Fri, 13 Mar 2009 13:58:53 -0700
> We are trying to follow the decisions scheduler as opposed to
> leading it. This works on very loaded systems, with applications
> binding to cpusets, with threads that are receiving on multiple
> sockets. I suppose it might be compelling if a NIC could steer
> packets per flow, instead of by a hash...
If the hash is good is will distribute the load properly.
If the NIC is sophisticated enough (Sun's Neptune chipset is)
you can even group interrupt distribution by traffic type
and even bind specific ports to interrupt groups.
I really detest all of these software hacks that add overhead
to solve problems the hardware can solve for us.
>
> If the hash is good is will distribute the load properly.
>
> If the NIC is sophisticated enough (Sun's Neptune chipset is)
> you can even group interrupt distribution by traffic type
> and even bind specific ports to interrupt groups.
>
> I really detest all of these software hacks that add overhead
> to solve problems the hardware can solve for us.
>
I appreciate this philosophy, but unfortunately I don't have the
luxury of working with a NIC that solves these problems. The reality
may be that we're trying to squeeze performance out of crappy hardware
to scale on multi-core. Left alone we couldn't get the stack to
scale, but with these "destable hacks" we've gotten 3X or so
improvement in packets per second across both our dumb 1G and 10G
NICs. These gains have translated into tangible application
performance gains, so we'll probably continue to have interest in this
area of development at least for the foreseeable future.
On Fri, 2009-03-13 at 14:01 -0700, Tom Herbert wrote:
> On Fri, Mar 13, 2009 at 11:51 AM, David Miller <[email protected]> wrote:
> >
> > From: Tom Herbert <[email protected]>
> > Date: Fri, 13 Mar 2009 10:06:56 -0700
> >
> > > You'll definitely want to look at the hardware provided hash. We've
> > > been using a 10G NIC which provides a Toeplitz hash (the one defined
> > > by Microsoft) and a software RSS-like capability to move packets from
> > > an interrupting CPU to another for processing. The hash could be used
> > > to index to a set of CPUs, but we also use the hash as a connection
> > > identifier to key into a lookup table to steer packets to the CPU
> > > where the application is running based on the running CPU of the last
> > > recvmsg. Using the device provided hash in this manner is a HUGE win,
> > > as opposed to taking cache misses to get 4-tuple from packet itself to
> > > compute a hash. I posted some patches a while back on our work if
> > > you're interested.
> >
> > I never understood this.
> >
> > If you don't let the APIC move the interrupt around, the individual
> > MSI-X interrupts will steer packets to individual specific CPUS and as
> > a result the scheduler will migrate tasks over to those cpus since the
> > wakeup events keep occuring there.
>
> We are trying to follow the decisions scheduler as opposed to leading
> it. This works on very loaded systems, with applications binding to
> cpusets, with threads that are receiving on multiple sockets. I
> suppose it might be compelling if a NIC could steer packets per flow,
> instead of by a hash...
Depending on the NIC, RX queue selection may be done using a large
number of bits of the hash value and an indirection table or by matching
against specific values in the headers. The SFC4000 supports both of
these, though limited to TCP/IPv4 and UDP/IPv4. I think Neptune may be
more flexible. Of course, both indirection table entries and filter
table entries will be limited resources in any NIC, so allocating these
wholly automatically is an interesting challenge.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
On Fri, 13 Mar 2009 22:10:59 +0000
Ben Hutchings <[email protected]> wrote:
> On Fri, 2009-03-13 at 14:01 -0700, Tom Herbert wrote:
> > On Fri, Mar 13, 2009 at 11:51 AM, David Miller <[email protected]> wrote:
> > >
> > > From: Tom Herbert <[email protected]>
> > > Date: Fri, 13 Mar 2009 10:06:56 -0700
> > >
> > > > You'll definitely want to look at the hardware provided hash. We've
> > > > been using a 10G NIC which provides a Toeplitz hash (the one defined
> > > > by Microsoft) and a software RSS-like capability to move packets from
> > > > an interrupting CPU to another for processing. The hash could be used
> > > > to index to a set of CPUs, but we also use the hash as a connection
> > > > identifier to key into a lookup table to steer packets to the CPU
> > > > where the application is running based on the running CPU of the last
> > > > recvmsg. Using the device provided hash in this manner is a HUGE win,
> > > > as opposed to taking cache misses to get 4-tuple from packet itself to
> > > > compute a hash. I posted some patches a while back on our work if
> > > > you're interested.
> > >
> > > I never understood this.
> > >
> > > If you don't let the APIC move the interrupt around, the individual
> > > MSI-X interrupts will steer packets to individual specific CPUS and as
> > > a result the scheduler will migrate tasks over to those cpus since the
> > > wakeup events keep occuring there.
> >
> > We are trying to follow the decisions scheduler as opposed to leading
> > it. This works on very loaded systems, with applications binding to
> > cpusets, with threads that are receiving on multiple sockets. I
> > suppose it might be compelling if a NIC could steer packets per flow,
> > instead of by a hash...
>
> Depending on the NIC, RX queue selection may be done using a large
> number of bits of the hash value and an indirection table or by matching
> against specific values in the headers. The SFC4000 supports both of
> these, though limited to TCP/IPv4 and UDP/IPv4. I think Neptune may be
> more flexible. Of course, both indirection table entries and filter
> table entries will be limited resources in any NIC, so allocating these
> wholly automatically is an interesting challenge.
>
> Ben.
>
The problem is that without hardware support, handing off the packet
may take more effort than processing it. Especially when cache line
has to bounce to other CPU and trying to keep up with DoS attacks.
It all depends how much processing is required, and the architecture
of the system. The tradeoff would change over time based on processing
speed and optimizing the receive/firewall code.
From: Tom Herbert <[email protected]>
Date: Fri, 13 Mar 2009 14:59:55 -0700
> I appreciate this philosophy, but unfortunately I don't have the
> luxury of working with a NIC that solves these problems. The reality
> may be that we're trying to squeeze performance out of crappy hardware
> to scale on multi-core. Left alone we couldn't get the stack to
> scale, but with these "destable hacks" we've gotten 3X or so
^^^^^^^^
Spelling.
> improvement in packets per second across both our dumb 1G and 10G
> NICs
Do these NICs at least support multiqueue?
On Fri, Mar 13, 2009 at 03:19:13PM -0700, David Miller wrote:
>
> > improvement in packets per second across both our dumb 1G and 10G
> > NICs
>
> Do these NICs at least support multiqueue?
I don't think they do. See the lsat paragraph in Tom's first
email.
I think we all agree that hacks such as these are onlhy useful
for NICs that either don't support mq or if the number of rx
queues is too small.
The question is how much do we love these NICs :)
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>> I appreciate this philosophy, but unfortunately I don't have the
>> luxury of working with a NIC that solves these problems. The reality
>> may be that we're trying to squeeze performance out of crappy hardware
>> to scale on multi-core. Left alone we couldn't get the stack to
>> scale, but with these "destable hacks" we've gotten 3X or so
> ^^^^^^^^
>
> Spelling.
>
>> improvement in packets per second across both our dumb 1G and 10G
>> NICs
>
> Do these NICs at least support multiqueue?
>
Yes, we are using a 10G NIC that supports multi-queue. The number of
RX queues supported is half the number of cores on our platform, so
that is going to limit the parallelism. With multi-queue turned on we
do see about 4X improvement in pps over just using a single queue;
this is about the same improvement we see using a single queue with
our software steering techniques (this particular device provides the
Toeplitz hash). Enabling HW multi-queue has somewhat higher CPU
utilization though, the extra device interrupt load is not coming for
free. We actually use the HW multi-queue in conjunction with our
software steering to get maximum pps (about 20% more).
> Yes, we are using a 10G NIC that supports multi-queue. The number of
> RX queues supported is half the number of cores on our platform, so
> that is going to limit the parallelism. With multi-queue turned on we
The standard wisdom is that you don't necessarily need to transmit
to each core, but rather to each shared mid or least level cache.
Once the data is cache hot (or cache near) distributing it further
in software is comparable cheap.
So this means you don't necessarily need as many queues as cores,
but more as many as big caches.
-Andi
--
[email protected] -- Speaking for myself only.
> We are trying to follow the decisions scheduler as opposed to leading it.
> This works on very loaded systems, with applications binding to cpusets,
One possible solution would be then to just not bind to cpusets and
give the scheduler the freedom it needs instead?
-Andi
--
[email protected] -- Speaking for myself only.
From: Tom Herbert <[email protected]>
Date: Fri, 13 Mar 2009 17:24:10 -0700
> Enabling HW multi-queue has somewhat higher CPU
> utilization though, the extra device interrupt load is not coming for
> free. We actually use the HW multi-queue in conjunction with our
> software steering to get maximum pps (about 20% more).
This is a non-intuitive observation. Using HW multiqueue should be
cheaper than doing it in software, right?
On Fri, Mar 13, 2009 at 07:19:51PM -0700, David Miller wrote:
> From: Tom Herbert <[email protected]>
> Date: Fri, 13 Mar 2009 17:24:10 -0700
>
> > Enabling HW multi-queue has somewhat higher CPU
> > utilization though, the extra device interrupt load is not coming for
> > free. We actually use the HW multi-queue in conjunction with our
> > software steering to get maximum pps (about 20% more).
>
> This is a non-intuitive observation. Using HW multiqueue should be
> cheaper than doing it in software, right?
Shared caches can play games with the numbers, we need to look
at this a bit more.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Fri, Mar 13, 2009 at 7:19 PM, David Miller <[email protected]> wrote:
> From: Tom Herbert <[email protected]>
> Date: Fri, 13 Mar 2009 17:24:10 -0700
>
>> Enabling HW multi-queue has somewhat higher CPU
>> utilization though, the extra device interrupt load is not coming for
>> free. ?We actually use the HW multi-queue in conjunction with our
>> software steering to get maximum pps (about 20% more).
>
> This is a non-intuitive observation. ?Using HW multiqueue should be
> cheaper than doing it in software, right?
>
I suppose it may be counter-intuitive, but I am not making a general
claim. I would only suggest that these software hacks could be a very
good approximation or substitute for hardware functionality. This is
a generic way to get more performance out of deficient or lower end
NICs.
From: Tom Herbert <[email protected]>
Date: Sat, 14 Mar 2009 11:15:21 -0700
> I suppose it may be counter-intuitive, but I am not making a general
> claim. I would only suggest that these software hacks could be a very
> good approximation or substitute for hardware functionality. This is
> a generic way to get more performance out of deficient or lower end
> NICs.
They certainly could. Why don't you post the current version
of your patches so we have something concrete to discuss?
On Fri, 2009-03-13 at 10:06 -0700, Tom Herbert wrote:
> On Thu, Mar 12, 2009 at 11:43 PM, Zhang, Yanmin
> <[email protected]> wrote:
> >
> > On Thu, 2009-03-12 at 14:08 +0000, Ben Hutchings wrote:
> > > On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote:
> > > > On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote:
> > > Yes, that's exactly what they do. This feature is sometimes called
> > > Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft
> > > requires Windows drivers performing RSS to provide the hash value to the
> > > networking stack, so Linux drivers for the same hardware should be able
> > > to do so too.
> > Oh, I didn't know the background. I need study more about network.
> > Thanks for explain it.
> >
>
> You'll definitely want to look at the hardware provided hash. We've
> been using a 10G NIC which provides a Toeplitz hash (the one defined
> by Microsoft) and a software RSS-like capability to move packets from
> an interrupting CPU to another for processing. The hash could be used
> to index to a set of CPUs, but we also use the hash as a connection
> identifier to key into a lookup table to steer packets to the CPU
> where the application is running based on the running CPU of the last
> recvmsg.
Your scenario is different from mine. My case is ip_forward which happens
in kernel and there is no application participating in the forwarding.
I might test the application communication on 10G NIC with my method later.
On Sat, Mar 14, 2009 at 11:45 AM, David Miller <[email protected]> wrote:
>
> From: Tom Herbert <[email protected]>
> Date: Sat, 14 Mar 2009 11:15:21 -0700
>
> > I suppose it may be counter-intuitive, but I am not making a general
> > claim. ?I would only suggest that these software hacks could be a very
> > good approximation or substitute for hardware functionality. ?This is
> > a generic way to get more performance out of deficient or lower end
> > NICs.
>
> They certainly could. ?Why don't you post the current version
> of your patches so we have something concrete to discuss?
I'll do that.