From: PJ Waskiewicz <[email protected]>
This patch adds the CPU index to the upper 8 bits of the queue_mapping in
the skb. This will support 256 CPUs and 256 Tx queues. The reason for
this is the qdisc layer can obscure which CPU is generating a certain flow
of packets, so network drivers don't have any insight which CPU generated a
particular packet. If the driver knows which CPU generated the packet,
then it could adjust Rx filtering in the hardware to redirect the packet
back to the CPU who owns the process that generated this packet.
Preventing the cache miss and reschedule of a process to a different CPU is
a big win in network performance, especially at 10 gigabit speeds.
Signed-off-by: PJ Waskiewicz <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
---
net/sched/sch_prio.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 4aa2b45..84bbd10 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -86,6 +86,7 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
}
#endif
+ skb->queue_mapping |= (get_cpu() << 8);
if ((ret = qdisc->enqueue(skb, qdisc)) == NET_XMIT_SUCCESS) {
sch->bstats.bytes += skb->len;
sch->bstats.packets++;
From: PJ Waskiewicz <[email protected]>
This patch will cause the consumers of skb->queue_mapping to only use the
lower 8 bits, so drivers can use the upper 8 bits for CPU index.
Signed-off-by: PJ Waskiewicz <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>
---
drivers/net/ixgbe/ixgbe_main.c | 2 +-
include/linux/skbuff.h | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 0d37c90..7b39394 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -3244,7 +3244,7 @@ static int ixgbe_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
unsigned int nr_frags = skb_shinfo(skb)->nr_frags;
len -= skb->data_len;
#ifdef CONFIG_NETDEVICES_MULTIQUEUE
- r_idx = (adapter->num_tx_queues - 1) & skb->queue_mapping;
+ r_idx = (adapter->num_tx_queues - 1) & (skb->queue_mapping & 0xff);
#endif
tx_ring = &adapter->tx_ring[r_idx];
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 299ec4b..0246827 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1672,14 +1672,14 @@ static inline void skb_init_secmark(struct sk_buff *skb)
static inline void skb_set_queue_mapping(struct sk_buff *skb, u16 queue_mapping)
{
#ifdef CONFIG_NETDEVICES_MULTIQUEUE
- skb->queue_mapping = queue_mapping;
+ skb->queue_mapping = (queue_mapping & 0xff);
#endif
}
static inline u16 skb_get_queue_mapping(struct sk_buff *skb)
{
#ifdef CONFIG_NETDEVICES_MULTIQUEUE
- return skb->queue_mapping;
+ return (skb->queue_mapping & 0xff);
#else
return 0;
#endif
@@ -1688,7 +1688,7 @@ static inline u16 skb_get_queue_mapping(struct sk_buff *skb)
static inline void skb_copy_queue_mapping(struct sk_buff *to, const struct sk_buff *from)
{
#ifdef CONFIG_NETDEVICES_MULTIQUEUE
- to->queue_mapping = from->queue_mapping;
+ to->queue_mapping = (from->queue_mapping & 0xff);
#endif
}
From: Jeff Kirsher <[email protected]>
Date: Tue, 24 Jun 2008 16:27:12 -0700
> From: PJ Waskiewicz <[email protected]>
>
> This patch adds the CPU index to the upper 8 bits of the queue_mapping in
> the skb. This will support 256 CPUs and 256 Tx queues. The reason for
> this is the qdisc layer can obscure which CPU is generating a certain flow
> of packets, so network drivers don't have any insight which CPU generated a
> particular packet. If the driver knows which CPU generated the packet,
> then it could adjust Rx filtering in the hardware to redirect the packet
> back to the CPU who owns the process that generated this packet.
> Preventing the cache miss and reschedule of a process to a different CPU is
> a big win in network performance, especially at 10 gigabit speeds.
>
> Signed-off-by: PJ Waskiewicz <[email protected]>
> Signed-off-by: Jeff Kirsher <[email protected]>
Well:
1) All of this multiqueue stuff is going away with my changes in
net-tx-2.6
2) This CPU number is going to be wrong half of the time.
For #2, when TCP is processing ACKs, it sends packets out from
whichever CPU the ACK landed on.
So you could end up retargetting the RX flow back and forth between
cpus. The cpu that sends the packet into the device layer is far
from the cpu that should be used to target the RX flow at.
So this idea is pretty seriously flawed and the infrastructure it's
using is going away anyways.
It also adds all sorts of arbitray limitations (256 queues and cpus? I
already have systems coming to me with more than 256 cpus)
>Well:
>
>1) All of this multiqueue stuff is going away with my changes in
> net-tx-2.6
Do you have a guestimate of which kernel you'll be targeting these
changes? I'm very interested in this schedule.
>2) This CPU number is going to be wrong half of the time.
>
>For #2, when TCP is processing ACKs, it sends packets out from
>whichever CPU the ACK landed on.
>
>So you could end up retargetting the RX flow back and forth between
>cpus. The cpu that sends the packet into the device layer is far
>from the cpu that should be used to target the RX flow at.
>
>So this idea is pretty seriously flawed and the infrastructure it's
>using is going away anyways.
The whole point is to identify the flow and redirect your Rx filtering
to the CPU it came from. So if you did this properly, the ACKs would be
landing on the CPU that originally opened the TCP flow. I don't follow
why the scheduler would move the process if the Rx traffic is returning
to that core.
I also understand the infrastructure is going away. But without knowing
when things are being changed, we can't be expected to delay sending in
patches to this area of the kernel if there is something we see a
potential benefit for. We have seen benefit in our labs of keeping
flows on dedicated CPUs and Tx/Rx queue pairs. I think the concept is
also sensible. But if my method of getting it working in the kernel is
flawed, then that's fine. Is there an alternative we can use, or is
this something we can influence the new Tx flow design with?
>It also adds all sorts of arbitray limitations (256 queues and cpus? I
>already have systems coming to me with more than 256 cpus)
I was trying to fit this into existing infrastructure to support the
more common case, which is probably 4-64 CPUs. Plus I'm not aware of
any NICs out there that support upwards of 256 Tx queues in the physical
function, but I guess I wouldn't be surprised if there are some on the
market.
-PJ Waskiewicz
From: "Waskiewicz Jr, Peter P" <[email protected]>
Date: Wed, 25 Jun 2008 11:31:01 -0700
> >Well:
> >
> >1) All of this multiqueue stuff is going away with my changes in
> > net-tx-2.6
>
> Do you have a guestimate of which kernel you'll be targeting these
> changes? I'm very interested in this schedule.
I want it to hit 2.6.27
> The whole point is to identify the flow and redirect your Rx filtering
> to the CPU it came from. So if you did this properly, the ACKs would be
> landing on the CPU that originally opened the TCP flow. I don't follow
> why the scheduler would move the process if the Rx traffic is returning
> to that core.
The process scheduler monitors where wakeup events occur.
If predominantly they arrive on a difference cpu from where
the process is currently scheduled, the scheduler might decide
to move the process there.
This RX retargetting idea is also flawed from another perspective.
Let's say you have a server thread polling on a lot of
connections. And let's say that all of these connections
become extremely active all of a sudden.
If you RX steer all of those flows to the same cpu we might
as well not do any RX hashing at all. The whole point is
to distribute the network stack input processing across
the machine as much as possible.
If you try to steer RX traffic to cpus that processes are running
on, you will undo almost all of the intended benefit of RX hashing
in many common scnenerios.
Don't second guess the process scheduler we have, it will or
should do the right thing in these situations because it can
see what is going on in ways that the NIC and the networking
alone simply cannot. It knows about cache effects and tradeoffs
of moving tasks between cores and NUMA nodes.