LinuxLists.cc - Re: [RFC v1] hand off skb list to other cpu to submit to upper layer

2009-03-04 09:28:30

Subject: Re: [RFC v1] hand off skb list to other cpu to submit to upper layer

On Tue, 2009-02-24 at 23:31 -0800, David Miller wrote:
> From: "Zhang, Yanmin" <[email protected]>
> Date: Wed, 25 Feb 2009 15:20:23 +0800
>
> > If the machines might have a couple of NICs and every NIC has CPU_NUM queues,
> > binding them evenly might cause more cache-miss/ping-pong. I didn't test
> > multiple receiving NICs scenario as I couldn't get enough hardware.
>
> In the net-next-2.6 tree, since we mark incoming packets with
> skb_record_rx_queue() properly, we'll make a more favorable choice of
> TX queue.
Thanks for your pointer. I cloned net-next-2.6 tree. skb_record_rx_queue is a smart
idea to implement an auto TX selection.

There is no NIC multi-queue standard or RFC available. At least I didn't find it
by google.

Both the new skb_record_rx_queue and current kernel have an assumption on
multi-queue. The assumption is it's best to send out packets from the TX of the
same number of queue like the one of RX if the receved packets are related to
the out packets. Or more direct speaking is we need send packets on the same cpu on
which we receive them. The start point is that could reduce skb and data cache miss.

With slow NIC, the assumption is right. But with high-speed NIC, especially 10G
NIC, the assumption seems not ok.

Here is a simple calculation with real testing/data with Nehalem machine and Bensley
machine. There are 2 machines with the testing driven by pktgen.

send packets
Machine A ==============> Machine B

forward pkts back
<==============

With Nehalem machines, I can get 4 million pps (packets per second) and per packet consists
of 60 bytes. So the speed is about 240MBytes/s. Nehalem has 2 sockets and every socket has
4 core and 8 logical cpu. All logical cpu share the last level cache 8Mbytes. That means
every physical cpu receives 120M bytes per second which is 8 times of last level cache
size.

With Bensley machine, I can get 1.2M pps, or 72MBytes. That machine has 2 sockets and every
socket has a qual-core cpu. Every dual-core share the last level cache 6MByte. That means
every dual-core gets 18M bytes per second, which is 3 times of last level cache size.

So with both bensley and Nehalem, the cache is flushed very quickly with 10G NIC testing.

Some other kinds of machines might have bigger cache. For example, my Montvale Itanium has
2 sockets, and every socket has a qual-core cpu plus multi-thread. Every dual-core shares
the last level cache 12M. But the cache is stll flushed at least twice per second.

If checking NIC drivers, we can find drivers touch very limited fields of sk_buff when
collecting packets from NIC.

It is said 20G or 30G NIC are under producing.

So with high-speed 10G NIC, the old assumption seems not working.

In the other hand, which part causes most cache foot print and cache miss? I don't think
drivers do so because the receiving cpu only touch some fields of sb_buff before sending
to upper layer.

My patch throws packets to specific cpu controlled by configuration, which doesn't
cause much cache ping-pong. After receving cpu throws packets to 2nd cpu, it doesn't need them
again. The 2nd cpu has cache-miss, but it doesn't cause cache ping-pong.

My patch doesn't always disagree with skb_record_rx_queue.
1) It can be configured by admin;
2) We can call skb_record_rx_queue or similiar functions at the 2nd cpu (the real cpu to
process the packets by process_backlog); So later on cache footprint won't be wasted when
forwarding packets out;

>
> You may want to figure out what that isn't behaving well in your
> case.

I did check kernel, including slab ( I tried slab/slub/slqb and use slub now) tuning, and
instrumented IXGBE driver. Besides careful multi-queue/interrupt binding, another way is
just to use my patch to promote speed for more than 40% on both Nehalem and Bensley.

>
> I don't think we should do any kind of software spreading for such
> capable hardware,
> it defeats the whole point of supporting the
> multiqueue features.
There is no NIC multi-queue standard or RFC.

Jesse is worried about we might allocate free cores for the packet collection while a
real environment keeps cpu all busy. I added more pressure on sending machine, and got
better performance on forwarding machine and the forwarding machine's cpu are busier
than before. Some logical cpu idle is near to 0. But I only have a couple of 10G NIC,
and couldn't add more pressure to make all cpu busy.

Thanks again, for your comments and patience.

Yanmin

2009-03-04 09:40:10

by David Miller

[permalink] [raw]

Subject: Re: [RFC v1] hand off skb list to other cpu to submit to upper layer

From: "Zhang, Yanmin" <[email protected]>
Date: Wed, 04 Mar 2009 17:27:48 +0800

> Both the new skb_record_rx_queue and current kernel have an
> assumption on multi-queue. The assumption is it's best to send out
> packets from the TX of the same number of queue like the one of RX
> if the receved packets are related to the out packets. Or more
> direct speaking is we need send packets on the same cpu on which we
> receive them. The start point is that could reduce skb and data
> cache miss.

We have to use the same TX queue for all packets for the same
connection flow (same src/dst IP address and ports) otherwise
we introduce reordering.

Herbert brought this up, now I have explicitly brought this up,
and you cannot ignore this issue.

You must not knowingly reorder packets, and using different TX
queues for packets within the same flow does that.

2009-03-05 01:05:21

by Yanmin Zhang

[permalink] [raw]

Subject: Re: [RFC v1] hand off skb list to other cpu to submit to upper layer

On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
> From: "Zhang, Yanmin" <[email protected]>
> Date: Wed, 04 Mar 2009 17:27:48 +0800
>
> > Both the new skb_record_rx_queue and current kernel have an
> > assumption on multi-queue. The assumption is it's best to send out
> > packets from the TX of the same number of queue like the one of RX
> > if the receved packets are related to the out packets. Or more
> > direct speaking is we need send packets on the same cpu on which we
> > receive them. The start point is that could reduce skb and data
> > cache miss.
>
> We have to use the same TX queue for all packets for the same
> connection flow (same src/dst IP address and ports) otherwise
> we introduce reordering.
> Herbert brought this up, now I have explicitly brought this up,
> and you cannot ignore this issue.
Thanks. Stephen Hemminger brought it up and explained what reorder
is. I answered in a reply (sorry for not clear) that mostly we need spread
packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
received from RX 8 will be spreaded to TX 0 always.

>
> You must not knowingly reorder packets, and using different TX
> queues for packets within the same flow does that.
Thanks for you rexplanation which is really consistent with Stephen's speaking.

2009-03-05 02:41:11

by Yanmin Zhang

[permalink] [raw]

Subject: Re: [RFC v1] hand off skb list to other cpu to submit to upper layer

On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
> > From: "Zhang, Yanmin" <[email protected]>
> > Date: Wed, 04 Mar 2009 17:27:48 +0800
> >
> > > Both the new skb_record_rx_queue and current kernel have an
> > > assumption on multi-queue. The assumption is it's best to send out
> > > packets from the TX of the same number of queue like the one of RX
> > > if the receved packets are related to the out packets. Or more
> > > direct speaking is we need send packets on the same cpu on which we
> > > receive them. The start point is that could reduce skb and data
> > > cache miss.
> >
> > We have to use the same TX queue for all packets for the same
> > connection flow (same src/dst IP address and ports) otherwise
> > we introduce reordering.
> > Herbert brought this up, now I have explicitly brought this up,
> > and you cannot ignore this issue.
> Thanks. Stephen Hemminger brought it up and explained what reorder
> is. I answered in a reply (sorry for not clear) that mostly we need spread
> packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
> received from RX 8 will be spreaded to TX 0 always.
To make it clearer, I used 1:1 mapping binding when running testing
on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder
issue. I also worked out a new patch on the failover path to just drop
packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't
cause reorder.

>
>
> >
> > You must not knowingly reorder packets, and using different TX
> > queues for packets within the same flow does that.
> Thanks for you rexplanation which is really consistent with Stephen's speaking.

2009-03-05 07:32:26

by Jens Låås

[permalink] [raw]

Subject: Re: [RFC v1] hand off skb list to other cpu to submit to upper layer

2009/3/5, Zhang, Yanmin <[email protected]>:
> On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
> > > From: "Zhang, Yanmin" <[email protected]>
> > > Date: Wed, 04 Mar 2009 17:27:48 +0800
> > >
> > > > Both the new skb_record_rx_queue and current kernel have an
> > > > assumption on multi-queue. The assumption is it's best to send out
> > > > packets from the TX of the same number of queue like the one of RX
> > > > if the receved packets are related to the out packets. Or more
> > > > direct speaking is we need send packets on the same cpu on which we
> > > > receive them. The start point is that could reduce skb and data
> > > > cache miss.
> > >
> > > We have to use the same TX queue for all packets for the same
> > > connection flow (same src/dst IP address and ports) otherwise
> > > we introduce reordering.
> > > Herbert brought this up, now I have explicitly brought this up,
> > > and you cannot ignore this issue.
> > Thanks. Stephen Hemminger brought it up and explained what reorder
> > is. I answered in a reply (sorry for not clear) that mostly we need spread
> > packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
> > received from RX 8 will be spreaded to TX 0 always.
>
> To make it clearer, I used 1:1 mapping binding when running testing
> on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder
> issue. I also worked out a new patch on the failover path to just drop
> packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't
> cause reorder.
>

We have not seen this problem in our testing.
We do keep the skb processing with the same CPU from RX to TX.
This is done via setting affinity for queues and using custom select_queue.

+static u16 select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+ if( dev->real_num_tx_queues && skb_rx_queue_recorded(skb) )
+ return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
+
+ return smp_processor_id() % dev->real_num_tx_queues;
+}
+

The hash based default for selecting TX-queue generates an uneven
spread that is hard to follow with correct affinity.

We have not been able to generate quite as much traffic from the sender.

Sender: (64 byte pkts)
eth5 4.5 k bit/s 3 pps 1233.9 M bit/s 2.632 M pps

Router:
eth0 1077.2 M bit/s 2.298 M pps 1.7 k bit/s 1 pps
eth1 744 bit/s 1 pps 1076.3 M bit/s 2.296 M pps

Im not sure I like the proposed concept since it decouples RX
processing from receiving.
There is no point collecting lots of packets just to drop them later
in the qdisc.
Infact this is bad for performance, we just consume cpu for nothing.

It is important to have as strong correlation as possible between RX
and TX so we dont receive more pkts than we can handle. Better to drop
on the interface.

We might start thinking of a way for userland to set the policy for
multiq mapping.

Cheers,
Jens Låås

> > >
> > > You must not knowingly reorder packets, and using different TX
> > > queues for packets within the same flow does that.
> > Thanks for you rexplanation which is really consistent with Stephen's speaking.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2009-03-05 09:24:46

by Yanmin Zhang

[permalink] [raw]

Subject: Re: [RFC v1] hand off skb list to other cpu to submit to upper layer

On Thu, 2009-03-05 at 08:32 +0100, Jens Låås wrote:
> 2009/3/5, Zhang, Yanmin <[email protected]>:
> > On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote:
> > > > From: "Zhang, Yanmin" <[email protected]>
> > > > Date: Wed, 04 Mar 2009 17:27:48 +0800
> > > >
> > > > > Both the new skb_record_rx_queue and current kernel have an
> > > > > assumption on multi-queue. The assumption is it's best to send out
> > > > > packets from the TX of the same number of queue like the one of RX
> > > > > if the receved packets are related to the out packets. Or more
> > > > > direct speaking is we need send packets on the same cpu on which we
> > > > > receive them. The start point is that could reduce skb and data
> > > > > cache miss.
> > > >
> > > > We have to use the same TX queue for all packets for the same
> > > > connection flow (same src/dst IP address and ports) otherwise
> > > > we introduce reordering.
> > > > Herbert brought this up, now I have explicitly brought this up,
> > > > and you cannot ignore this issue.
> > > Thanks. Stephen Hemminger brought it up and explained what reorder
> > > is. I answered in a reply (sorry for not clear) that mostly we need spread
> > > packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets
> > > received from RX 8 will be spreaded to TX 0 always.
> >
> > To make it clearer, I used 1:1 mapping binding when running testing
> > on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder
> > issue. I also worked out a new patch on the failover path to just drop
> > packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't
> > cause reorder.
> >
>

> We have not seen this problem in our
Thanks for your valuable input. We need more data on high-speed NIC.

> We do keep the skb processing with the same CPU from RX to TX.
That's a normal point. I did so when I began to investigate why forward
speed is far slower than sending speed with 10G NIC.

> This is done via setting affinity for queues and using custom select_queue.
>
> +static u16 select_queue(struct net_device *dev, struct sk_buff *skb)
> +{
> + if( dev->real_num_tx_queues && skb_rx_queue_recorded(skb) )
> + return skb_get_rx_queue(skb) % dev->real_num_tx_queues;
> +
> + return smp_processor_id() % dev->real_num_tx_queues;
> +}
> +
Yes, with the function and every NIC has CPU_NUM queues, skb is processed
with the same cpu from RX to TX.

>
> The hash based default for selecting TX-queue generates an uneven
> spread that is hard to follow with correct affinity.
>
> We have not been able to generate quite as much traffic from the sender.
pktgen of the latest kernel supports multi-thread on the same device. If you
just starts one thread, the speed is limited. Could you try 4 or 8 threads? Perhaps
speed could double then.

>
> Sender: (64 byte pkts)
> eth5 4.5 k bit/s 3 pps 1233.9 M bit/s 2.632 M pps
I'm a little confused with the data. Do the first 2 mean IN and last 2 mean OUT?

What kind of NIC and machines are they? How big is the last level cache of the cpu?

>
> Router:
> eth0 1077.2 M bit/s 2.298 M pps 1.7 k bit/s 1 pps
> eth1 744 bit/s 1 pps 1076.3 M bit/s 2.296 M pps
The forward speed is quite close to the sending speed of the Sender. It seems
your machine needn't my patch.

My original case is the sending speed is 1.4M pps with careful cpu binding considering
cpu cache sharing. With my patch, the result becomes 2M pps and the sending speed is
2.36M pps. The NICs I am using are not latest.

>
> Im not sure I like the proposed concept since it decouples RX
> processing from receiving.
> There is no point collecting lots of packets just to drop them later
> in the qdisc.
> Infact this is bad for performance, we just consume cpu for nothing.
Yes, if the skb processing cpu is very busy, and we choose to drop skb there instead of
by driver or NIC hardware, performance might be worse.

A small change on my patch and driver could reduce that possibility. Checking qlen before
collecting the 64 packets (assume driver collects 64 packets per NAPI loop). If qlen is
larger than netdev_max_backlog, driver could just return without real collection.

We need data to distinguish good or bad.

> It is important to have as strong correlation as possible between RX
> and TX so we dont receive more pkts than we can handle. Better to drop
> on the interface.
With my above small change, interface would drop packets.

>
> We might start thinking of a way for userland to set the policy for
> multiq mapping.
I also think so.

I did more testing with different slab allocator as slab has big impact on
performance. SLQB has very different behavior from SLUB. It seems SLQB (try2) need
improve NUMA allocation/free. At least I use slub_min_objects=64 and slub_max_order=6
to get the best result on my machine.

Thanks for your comments.

> > > >
> > > > You must not knowingly reorder packets, and using different TX
> > > > queues for packets within the same flow does that.
> > > Thanks for you rexplanation which is really consistent with Stephen's speaking.
> >