2007-02-09 00:23:15

by Kok, Auke

[permalink] [raw]
Subject: [PATCH 0/2 REVIEW] Multiple transmit/receive queue kernel


Hi,

The following two patches are submitted for review. Please provide feedback
and comments as usual.

This set of patches implements a generic API for multiqueue-capable
devices to have network stack support to assign multiple flows into each
queue on the NIC. It provides an interface for non-multiqueue devices
to coexist in a machine running with both types of devices, as well as
providing a modified queuing discipline to assign multiple flows in the
stack to individual queues on the NIC. This is necessary to alleviate
more head-of-line-blocking issues with different priority flows, where
today QoS only prevents HOLB within the stack, but not in the device.

Some Intel PRO/1000 adapters support this hardware feature, so this
series includes a patch to the e1000 driver to enable this hardware
feature for those devices. You will need to setup the queue's using tc.
Limited queue statistics are available through ethtool for devices that
support multiple queues.

Regards,

Auke Kok
Peter Waskiewicz Jr.

---
These patches apply against commit b892afd1e60132a981b963929e352eabf3306ba2
of branch 'master' from torvalds/linux-2.6.

You can pull these patches through git:

git-pull git://lost.foo-projects.org/~ahkok/git/netdev-2.6 multiqueue

---
Summary:
Peter Waskiewicz Jr <[email protected]>:
NET: Multiple queue network device support
e1000: Implement the new kernel API for multiqueue TX support.

---
drivers/net/Kconfig | 13 +
drivers/net/e1000/e1000.h | 23 +++
drivers/net/e1000/e1000_ethtool.c | 43 ++++++
drivers/net/e1000/e1000_main.c | 269 +++++++++++++++++++++++++++++++-------
include/linux/netdevice.h | 73 ++++++++++
include/net/sch_generic.h | 3
net/Kconfig | 20 ++
net/core/dev.c | 51 +++++++
net/sched/sch_generic.c | 117 ++++++++++++++++
net/sched/sch_prio.c | 106 ++++++++++++++
10 files changed, 668 insertions(+), 50 deletions(-)


2007-02-09 00:23:37

by Kok, Auke

[permalink] [raw]
Subject: [PATCH 2/2] e1000: Implement the new kernel API for multiqueue TX support.


From: Peter Waskiewicz Jr. <[email protected]>

Several newer e1000 chipsets support multiple RX and TX queues. Most
commonly, 82571's and ESB2LAN support 2 rx and 2 rx queues.

Signed-off-by: Peter Waskiewicz Jr. <[email protected]>
Signed-off-by: Auke Kok <[email protected]>
---

drivers/net/Kconfig | 13 ++
drivers/net/e1000/e1000.h | 23 +++
drivers/net/e1000/e1000_ethtool.c | 43 ++++++
drivers/net/e1000/e1000_main.c | 269 +++++++++++++++++++++++++++++++------
4 files changed, 304 insertions(+), 44 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index ad92b6a..2d758ab 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -1988,6 +1988,19 @@ config E1000_DISABLE_PACKET_SPLIT

If in doubt, say N.

+config E1000_MQ
+ bool "Enable Tx/Rx Multiqueue Support (EXPERIMENTAL)"
+ depends on E1000 && NET_MULTI_QUEUE_DEVICE && EXPERIMENTAL
+ help
+ Say Y here if you want to enable multiqueue support for supported
+ e1000 devices. This will enable both transmit and receive queues
+ on devices that support them.
+
+ In order to fully utilize the Tx queue support, use the SCH_PRIO
+ queueing discipline.
+
+ If in doubt, say N.
+
source "drivers/net/ixp2000/Kconfig"

config MYRI_SBUS
diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 689f158..cfcdc9d 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -108,6 +108,10 @@ struct e1000_adapter;
#define E1000_MIN_RXD 80
#define E1000_MAX_82544_RXD 4096

+#ifdef CONFIG_E1000_MQ
+#define E1000_MAX_TX_QUEUES 4
+#endif
+
/* this is the size past which hardware will drop packets when setting LPE=0 */
#define MAXIMUM_ETHERNET_VLAN_SIZE 1522

@@ -168,6 +172,12 @@ struct e1000_buffer {
uint16_t next_to_watch;
};

+#ifdef CONFIG_E1000_MQ
+struct e1000_queue_stats {
+ uint64_t packets;
+ uint64_t bytes;
+};
+#endif

struct e1000_ps_page { struct page *ps_page[PS_PAGE_BUFFERS]; };
struct e1000_ps_page_dma { uint64_t ps_page_dma[PS_PAGE_BUFFERS]; };
@@ -188,9 +198,16 @@ struct e1000_tx_ring {
/* array of buffer information structs */
struct e1000_buffer *buffer_info;

+#ifdef CONFIG_E1000_MQ
+ /* for tx ring cleanup - needed for multiqueue */
+ spinlock_t tx_queue_lock;
+#endif
spinlock_t tx_lock;
uint16_t tdh;
uint16_t tdt;
+#ifdef CONFIG_E1000_MQ
+ struct e1000_queue_stats tx_stats;
+#endif
boolean_t last_tx_tso;
};

@@ -218,6 +235,9 @@ struct e1000_rx_ring {

uint16_t rdh;
uint16_t rdt;
+#ifdef CONFIG_E1000_MQ
+ struct e1000_queue_stats rx_stats;
+#endif
};

#define E1000_DESC_UNUSED(R) \
@@ -271,6 +291,9 @@ struct e1000_adapter {

/* TX */
struct e1000_tx_ring *tx_ring; /* One per active queue */
+#ifdef CONFIG_E1000_MQ
+ struct e1000_tx_ring **cpu_tx_ring; /* per-cpu */
+#endif
unsigned int restart_queue;
unsigned long tx_queue_len;
uint32_t txd_cmd;
diff --git a/drivers/net/e1000/e1000_ethtool.c b/drivers/net/e1000/e1000_ethtool.c
index 44ebc72..c8c1500 100644
--- a/drivers/net/e1000/e1000_ethtool.c
+++ b/drivers/net/e1000/e1000_ethtool.c
@@ -105,7 +105,14 @@ static const struct e1000_stats e1000_gstrings_stats[] = {
{ "dropped_smbus", E1000_STAT(stats.mgpdc) },
};

+#ifdef CONFIG_E1000_MQ
+#define E1000_QUEUE_STATS_LEN \
+ (((struct e1000_adapter *)netdev->priv)->num_tx_queues + \
+ ((struct e1000_adapter *)netdev->priv)->num_rx_queues) \
+ * (sizeof(struct e1000_queue_stats) / sizeof(uint64_t))
+#else
#define E1000_QUEUE_STATS_LEN 0
+#endif
#define E1000_GLOBAL_STATS_LEN \
sizeof(e1000_gstrings_stats) / sizeof(struct e1000_stats)
#define E1000_STATS_LEN (E1000_GLOBAL_STATS_LEN + E1000_QUEUE_STATS_LEN)
@@ -1909,6 +1916,11 @@ e1000_get_ethtool_stats(struct net_device *netdev,
struct ethtool_stats *stats, uint64_t *data)
{
struct e1000_adapter *adapter = netdev_priv(netdev);
+#ifdef CONFIG_E1000_MQ
+ uint64_t *queue_stat;
+ int stat_count = sizeof(struct e1000_queue_stats) / sizeof(uint64_t);
+ int j, k;
+#endif
int i;

e1000_update_stats(adapter);
@@ -1917,12 +1929,29 @@ e1000_get_ethtool_stats(struct net_device *netdev,
data[i] = (e1000_gstrings_stats[i].sizeof_stat ==
sizeof(uint64_t)) ? *(uint64_t *)p : *(uint32_t *)p;
}
+#ifdef CONFIG_E1000_MQ
+ for (j = 0; j < adapter->num_tx_queues; j++) {
+ queue_stat = (uint64_t *)&adapter->tx_ring[j].tx_stats;
+ for (k = 0; k < stat_count; k++)
+ data[i + k] = queue_stat[k];
+ i += k;
+ }
+ for (j = 0; j < adapter->num_rx_queues; j++) {
+ queue_stat = (uint64_t *)&adapter->rx_ring[j].rx_stats;
+ for (k = 0; k < stat_count; k++)
+ data[i + k] = queue_stat[k];
+ i += k;
+ }
+#endif
/* BUG_ON(i != E1000_STATS_LEN); */
}

static void
e1000_get_strings(struct net_device *netdev, uint32_t stringset, uint8_t *data)
{
+#ifdef CONFIG_E1000_MQ
+ struct e1000_adapter *adapter = netdev_priv(netdev);
+#endif
uint8_t *p = data;
int i;

@@ -1937,6 +1966,20 @@ e1000_get_strings(struct net_device *netdev, uint32_t stringset, uint8_t *data)
ETH_GSTRING_LEN);
p += ETH_GSTRING_LEN;
}
+#ifdef CONFIG_E1000_MQ
+ for (i = 0; i < adapter->num_tx_queues; i++) {
+ sprintf(p, "tx_queue_%u_packets", i);
+ p += ETH_GSTRING_LEN;
+ sprintf(p, "tx_queue_%u_bytes", i);
+ p += ETH_GSTRING_LEN;
+ }
+ for (i = 0; i < adapter->num_rx_queues; i++) {
+ sprintf(p, "rx_queue_%u_packets", i);
+ p+= ETH_GSTRING_LEN;
+ sprintf(p, "rx_queue_%u_bytes", i);
+ p += ETH_GSTRING_LEN;
+ }
+#endif
/* BUG_ON(p - data != E1000_STATS_LEN * ETH_GSTRING_LEN); */
break;
}
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 619c892..a53f065 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -28,6 +28,10 @@

#include "e1000.h"
#include <net/ip6_checksum.h>
+#ifdef CONFIG_E1000_MQ
+#include <linux/cpu.h>
+#include <linux/smp.h>
+#endif

char e1000_driver_name[] = "e1000";
static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -137,6 +141,9 @@ static void e1000_exit_module(void);
static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent);
static void __devexit e1000_remove(struct pci_dev *pdev);
static int e1000_alloc_queues(struct e1000_adapter *adapter);
+#ifdef CONFIG_E1000_MQ
+static void e1000_setup_queue_mapping(struct e1000_adapter *adapter);
+#endif
static int e1000_sw_init(struct e1000_adapter *adapter);
static int e1000_open(struct net_device *netdev);
static int e1000_close(struct net_device *netdev);
@@ -153,7 +160,13 @@ static void e1000_set_multi(struct net_device *netdev);
static void e1000_update_phy_info(unsigned long data);
static void e1000_watchdog(unsigned long data);
static void e1000_82547_tx_fifo_stall(unsigned long data);
+static int e1000_xmit_frame_ring(struct sk_buff *skb, struct net_device *netdev,
+ struct e1000_tx_ring *tx_ring);
static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+#ifdef CONFIG_E1000_MQ
+static int e1000_subqueue_xmit_frame(struct sk_buff *skb,
+ struct net_device *netdev, int queue);
+#endif
static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
static int e1000_set_mac(struct net_device *netdev, void *p);
@@ -547,6 +560,10 @@ e1000_up(struct e1000_adapter *adapter)
E1000_DESC_UNUSED(ring));
}

+#ifdef CONFIG_E1000_MQ
+ e1000_setup_queue_mapping(adapter);
+#endif
+
adapter->tx_queue_len = netdev->tx_queue_len;

#ifdef CONFIG_E1000_NAPI
@@ -900,7 +917,12 @@ e1000_probe(struct pci_dev *pdev,
pci_set_master(pdev);

err = -ENOMEM;
+#ifdef CONFIG_E1000_MQ
+ netdev = alloc_etherdev(sizeof(struct e1000_adapter) +
+ (sizeof(struct net_device_subqueue) * E1000_MAX_TX_QUEUES));
+#else
netdev = alloc_etherdev(sizeof(struct e1000_adapter));
+#endif
if (!netdev)
goto err_alloc_etherdev;

@@ -934,6 +956,9 @@ e1000_probe(struct pci_dev *pdev,
netdev->open = &e1000_open;
netdev->stop = &e1000_close;
netdev->hard_start_xmit = &e1000_xmit_frame;
+#ifdef CONFIG_E1000_MQ
+ netdev->hard_start_subqueue_xmit = &e1000_subqueue_xmit_frame;
+#endif
netdev->get_stats = &e1000_get_stats;
netdev->set_multicast_list = &e1000_set_multi;
netdev->set_mac_address = &e1000_set_mac;
@@ -1317,8 +1342,44 @@ e1000_sw_init(struct e1000_adapter *adapter)
hw->master_slave = E1000_MASTER_SLAVE;
}

+#ifdef CONFIG_E1000_MQ
+ /* Number of supported queues.
+ * TODO: It's assumed num_rx_queues >= num_tx_queues, since multi-rx
+ * queues are much more interesting. Is it worth coding for the
+ * possibility (however improbable) of num_tx_queues > num_rx_queues?
+ */
+ switch (hw->mac_type) {
+ case e1000_82571:
+ case e1000_82572:
+ case e1000_80003es2lan:
+ adapter->num_tx_queues = 2;
+ adapter->num_rx_queues = 2;
+ break;
+
+ default:
+ /* All hardware before 82571 only have 1 queue each for Rx/Tx.
+ * However, the 82571 family does not have MSI-X, so multi-
+ * queue isn't enabled.
+ * It'd be wise not to mess with this default case. :) */
+ adapter->num_tx_queues = 1;
+ adapter->num_rx_queues = 1;
+ netdev->egress_subqueue_count = 0;
+ break;
+ }
+ adapter->num_rx_queues = min(adapter->num_rx_queues, num_online_cpus());
+ adapter->num_tx_queues = min(adapter->num_tx_queues, num_online_cpus());
+
+ if ((adapter->num_tx_queues > 1) || (adapter->num_rx_queues > 1)) {
+ netdev->egress_subqueue = (struct net_device_subqueue *)((void *)adapter + sizeof(struct e1000_adapter));
+ netdev->egress_subqueue_count = adapter->num_tx_queues;
+ DPRINTK(DRV, INFO, "Multiqueue Enabled: RX queues = %u, "
+ "TX queues = %u\n", adapter->num_rx_queues,
+ adapter->num_tx_queues);
+ }
+#else
adapter->num_tx_queues = 1;
adapter->num_rx_queues = 1;
+#endif

if (e1000_alloc_queues(adapter)) {
DPRINTK(PROBE, ERR, "Unable to allocate memory for queues\n");
@@ -1334,13 +1395,16 @@ e1000_sw_init(struct e1000_adapter *adapter)
set_bit(__LINK_STATE_START, &adapter->polling_netdev[i].state);
}
spin_lock_init(&adapter->tx_queue_lock);
+#ifdef CONFIG_E1000_MQ
+ for (i = 0; i < adapter->num_tx_queues; i++)
+ spin_lock_init(&adapter->tx_ring[i].tx_queue_lock);
+#endif
#endif

atomic_set(&adapter->irq_sem, 1);
spin_lock_init(&adapter->stats_lock);

set_bit(__E1000_DOWN, &adapter->flags);
-
return 0;
}

@@ -1382,9 +1446,28 @@ e1000_alloc_queues(struct e1000_adapter *adapter)
}
memset(adapter->polling_netdev, 0, size);
#endif
+#ifdef CONFIG_E1000_MQ
+ adapter->cpu_tx_ring = alloc_percpu(struct e1000_tx_ring *);
+#endif

return E1000_SUCCESS;
}
+#ifdef CONFIG_E1000_MQ
+static void
+e1000_setup_queue_mapping(struct e1000_adapter *adapter)
+{
+ int i, cpu;
+
+ lock_cpu_hotplug();
+ i = 0;
+ for_each_online_cpu(cpu) {
+ *per_cpu_ptr(adapter->cpu_tx_ring, cpu) =
+ &adapter->tx_ring[i % adapter->num_tx_queues];
+ i++;
+ }
+ unlock_cpu_hotplug();
+}
+#endif

/**
* e1000_open - Called when a network interface is made active
@@ -1636,23 +1719,20 @@ e1000_configure_tx(struct e1000_adapter *adapter)
struct e1000_hw *hw = &adapter->hw;
uint32_t tdlen, tctl, tipg, tarc;
uint32_t ipgr1, ipgr2;
+ int i;

/* Setup the HW Tx Head and Tail descriptor pointers */
-
- switch (adapter->num_tx_queues) {
- case 1:
- default:
- tdba = adapter->tx_ring[0].dma;
- tdlen = adapter->tx_ring[0].count *
+ for (i = 0; i < adapter->num_tx_queues; i++) {
+ tdba = adapter->tx_ring[i].dma;
+ tdlen = adapter->tx_ring[i].count *
sizeof(struct e1000_tx_desc);
- E1000_WRITE_REG(hw, TDLEN, tdlen);
- E1000_WRITE_REG(hw, TDBAH, (tdba >> 32));
- E1000_WRITE_REG(hw, TDBAL, (tdba & 0x00000000ffffffffULL));
- E1000_WRITE_REG(hw, TDT, 0);
- E1000_WRITE_REG(hw, TDH, 0);
- adapter->tx_ring[0].tdh = ((hw->mac_type >= e1000_82543) ? E1000_TDH : E1000_82542_TDH);
- adapter->tx_ring[0].tdt = ((hw->mac_type >= e1000_82543) ? E1000_TDT : E1000_82542_TDT);
- break;
+ E1000_WRITE_REG(hw, TDLEN + (i << 8), tdlen);
+ E1000_WRITE_REG(hw, TDBAH + (i << 8), (tdba >> 32));
+ E1000_WRITE_REG(hw, TDBAL + (i << 8), (tdba & 0x00000000ffffffffULL));
+ E1000_WRITE_REG(hw, TDT + (i << 8), 0);
+ E1000_WRITE_REG(hw, TDH + (i << 8), 0);
+ adapter->tx_ring[i].tdh = ((hw->mac_type >= e1000_82543) ? E1000_TDH + (i << 8) : E1000_82542_TDH);
+ adapter->tx_ring[i].tdt = ((hw->mac_type >= e1000_82543) ? E1000_TDT + (i << 8) : E1000_82542_TDT);
}

/* Set the default values for the Tx Inter Packet Gap timer */
@@ -1999,6 +2079,7 @@ e1000_configure_rx(struct e1000_adapter *adapter)
uint64_t rdba;
struct e1000_hw *hw = &adapter->hw;
uint32_t rdlen, rctl, rxcsum, ctrl_ext;
+ int i;

if (adapter->rx_ps_pages) {
/* this is a 32 byte descriptor */
@@ -2042,20 +2123,67 @@ e1000_configure_rx(struct e1000_adapter *adapter)

/* Setup the HW Rx Head and Tail Descriptor Pointers and
* the Base and Length of the Rx Descriptor Ring */
- switch (adapter->num_rx_queues) {
- case 1:
- default:
- rdba = adapter->rx_ring[0].dma;
- E1000_WRITE_REG(hw, RDLEN, rdlen);
- E1000_WRITE_REG(hw, RDBAH, (rdba >> 32));
- E1000_WRITE_REG(hw, RDBAL, (rdba & 0x00000000ffffffffULL));
- E1000_WRITE_REG(hw, RDT, 0);
- E1000_WRITE_REG(hw, RDH, 0);
- adapter->rx_ring[0].rdh = ((hw->mac_type >= e1000_82543) ? E1000_RDH : E1000_82542_RDH);
- adapter->rx_ring[0].rdt = ((hw->mac_type >= e1000_82543) ? E1000_RDT : E1000_82542_RDT);
- break;
+ for (i = 0; i < adapter->num_rx_queues; i++) {
+ rdba = adapter->rx_ring[i].dma;
+ E1000_WRITE_REG(hw, RDLEN + (i << 8), rdlen);
+ E1000_WRITE_REG(hw, RDBAH + (i << 8), (rdba >> 32));
+ E1000_WRITE_REG(hw, RDBAL + (i << 8), (rdba & 0x00000000ffffffffULL));
+ E1000_WRITE_REG(hw, RDT + (i << 8), 0);
+ E1000_WRITE_REG(hw, RDH + (i << 8), 0);
+ adapter->rx_ring[i].rdh = ((hw->mac_type >= e1000_82543) ? E1000_RDH + (i << 8) : E1000_82542_RDH);
+ adapter->rx_ring[i].rdt = ((hw->mac_type >= e1000_82543) ? E1000_RDT + (i << 8) : E1000_82542_RDT);
}

+#ifdef CONFIG_E1000_MQ
+ if (adapter->num_rx_queues > 1) {
+ u32 random[10];
+ u32 reta, mrqc;
+ int i;
+
+ get_random_bytes(&random[0], 40);
+
+ switch (adapter->num_rx_queues) {
+ default:
+ reta = 0x00800080;
+ mrqc = E1000_MRQC_ENABLE_RSS_2Q;
+ break;
+ }
+
+ /* Fill out redirection table */
+ for (i = 0; i < 32; i++)
+ E1000_WRITE_REG_ARRAY(hw, RETA, i, reta);
+ /* Fill out hash function seeds */
+ for (i = 0; i < 10; i++)
+ E1000_WRITE_REG_ARRAY(hw, RSSRK, i, random[i]);
+
+ mrqc |= (E1000_MRQC_RSS_FIELD_IPV4 |
+ E1000_MRQC_RSS_FIELD_IPV4_TCP);
+
+ E1000_WRITE_REG(hw, MRQC, mrqc);
+
+ /* Multiqueue and packet checksumming are mutually exclusive. */
+ rxcsum = E1000_READ_REG(hw, RXCSUM);
+ rxcsum |= E1000_RXCSUM_PCSD;
+ E1000_WRITE_REG(hw, RXCSUM, rxcsum);
+ } else if (hw->mac_type >= e1000_82543) {
+ /* Enable 82543 Receive Checksum Offload for TCP and UDP */
+ rxcsum = E1000_READ_REG(hw, RXCSUM);
+ if (adapter->rx_csum == TRUE) {
+ rxcsum |= E1000_RXCSUM_TUOFL;
+
+ /* Enable 82571 IPv4 payload checksum for UDP fragments
+ * Must be used in conjunction with packet-split. */
+ if ((hw->mac_type >= e1000_82571) &&
+ (adapter->rx_ps_pages)) {
+ rxcsum |= E1000_RXCSUM_IPPCSE;
+ }
+ } else {
+ rxcsum &= ~E1000_RXCSUM_TUOFL;
+ /* don't need to clear IPPCSE as it defaults to 0 */
+ }
+ E1000_WRITE_REG(hw, RXCSUM, rxcsum);
+ }
+#else
/* Enable 82543 Receive Checksum Offload for TCP and UDP */
if (hw->mac_type >= e1000_82543) {
rxcsum = E1000_READ_REG(hw, RXCSUM);
@@ -2074,6 +2202,7 @@ e1000_configure_rx(struct e1000_adapter *adapter)
}
E1000_WRITE_REG(hw, RXCSUM, rxcsum);
}
+#endif

/* enable early receives on 82573, only takes effect if using > 2048
* byte total frame size. for example only for jumbo frames */
@@ -2555,6 +2684,9 @@ e1000_watchdog(unsigned long data)
struct e1000_tx_ring *txdr = adapter->tx_ring;
uint32_t link, tctl;
int32_t ret_val;
+#ifdef CONFIG_E1000_MQ
+ int i;
+#endif

ret_val = e1000_check_for_link(&adapter->hw);
if ((ret_val == E1000_ERR_PHY) &&
@@ -2652,6 +2784,12 @@ e1000_watchdog(unsigned long data)

netif_carrier_on(netdev);
netif_wake_queue(netdev);
+#ifdef CONFIG_E1000_MQ
+ if (netif_is_multiqueue(netdev))
+ for (i = 0; i < adapter->num_tx_queues; i++)
+ netif_wake_subqueue(netdev, i);
+#endif
+
mod_timer(&adapter->phy_info_timer, jiffies + 2 * HZ);
adapter->smartspeed = 0;
} else {
@@ -3249,10 +3387,10 @@ static int e1000_maybe_stop_tx(struct net_device *netdev,

#define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+e1000_xmit_frame_ring(struct sk_buff *skb, struct net_device *netdev,
+ struct e1000_tx_ring *tx_ring)
{
struct e1000_adapter *adapter = netdev_priv(netdev);
- struct e1000_tx_ring *tx_ring;
unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
unsigned int tx_flags = 0;
@@ -3265,12 +3403,6 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
unsigned int f;
len -= skb->data_len;

- /* This goes back to the question of how to logically map a tx queue
- * to a flow. Right now, performance is impacted slightly negatively
- * if using multiple tx queues. If the stack breaks away from a
- * single qdisc implementation, we can look at this again. */
- tx_ring = adapter->tx_ring;
-
if (unlikely(skb->len <= 0)) {
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
@@ -3425,6 +3557,31 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
return NETDEV_TX_OK;
}

+static int
+e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+ struct e1000_adapter *adapter = netdev_priv(netdev);
+ struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+
+ /* This goes back to the question of how to logically map a tx queue
+ * to a flow. Right now, performance is impacted slightly negatively
+ * if using multiple tx queues. If the stack breaks away from a
+ * single qdisc implementation, we can look at this again. */
+ return (e1000_xmit_frame_ring(skb, netdev, tx_ring));
+}
+
+#ifdef CONFIG_E1000_MQ
+static int
+e1000_subqueue_xmit_frame(struct sk_buff *skb, struct net_device *netdev,
+ int queue)
+{
+ struct e1000_adapter *adapter = netdev_priv(netdev);
+ struct e1000_tx_ring *tx_ring = &adapter->tx_ring[queue];
+
+ return (e1000_xmit_frame_ring(skb, netdev, tx_ring));
+}
+#endif
+
/**
* e1000_tx_timeout - Respond to a Tx Hang
* @netdev: network interface device structure
@@ -3924,6 +4081,7 @@ e1000_clean(struct net_device *poll_dev, int *budget)
struct e1000_adapter *adapter;
int work_to_do = min(*budget, poll_dev->quota);
int tx_cleaned = 0, work_done = 0;
+ int i;

/* Must NOT use netdev_priv macro here. */
adapter = poll_dev->priv;
@@ -3933,18 +4091,27 @@ e1000_clean(struct net_device *poll_dev, int *budget)
goto quit_polling;

/* e1000_clean is called per-cpu. This lock protects
- * tx_ring[0] from being cleaned by multiple cpus
+ * tx_ring[i] from being cleaned by multiple cpus
* simultaneously. A failure obtaining the lock means
- * tx_ring[0] is currently being cleaned anyway. */
- if (spin_trylock(&adapter->tx_queue_lock)) {
- tx_cleaned = e1000_clean_tx_irq(adapter,
- &adapter->tx_ring[0]);
- spin_unlock(&adapter->tx_queue_lock);
+ * tx_ring[i] is currently being cleaned anyway. */
+ for (i = 0; i < adapter->num_tx_queues; i++) {
+#ifdef CONFIG_E1000_MQ
+ if (spin_trylock(&adapter->tx_ring[i].tx_queue_lock)) {
+ tx_cleaned = e1000_clean_tx_irq(adapter,
+ &adapter->tx_ring[i]);
+ spin_unlock(&adapter->tx_ring[i].tx_queue_lock);
+ }
+#else
+ if (spin_trylock(&adapter->tx_queue_lock)) {
+ tx_cleaned = e1000_clean_tx_irq(adapter,
+ &adapter->tx_ring[i]);
+ spin_unlock(&adapter->tx_queue_lock);
+ }
+#endif
+ adapter->clean_rx(adapter, &adapter->rx_ring[i],
+ &work_done, work_to_do);
}

- adapter->clean_rx(adapter, &adapter->rx_ring[0],
- &work_done, work_to_do);
-
*budget -= work_done;
poll_dev->quota -= work_done;

@@ -3992,6 +4159,9 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter,
buffer_info = &tx_ring->buffer_info[i];
cleaned = (i == eop);

+#ifdef CONFIG_E1000_MQ
+ tx_ring->tx_stats.bytes += buffer_info->length;
+#endif
if (cleaned) {
struct sk_buff *skb = buffer_info->skb;
unsigned int segs, bytecount;
@@ -4008,6 +4178,9 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter,
if (unlikely(++i == tx_ring->count)) i = 0;
}

+#ifdef CONFIG_E1000_MQ
+ tx_ring->tx_stats.packets++;
+#endif
eop = tx_ring->buffer_info[i].next_to_watch;
eop_desc = E1000_TX_DESC(*tx_ring, eop);
#ifdef CONFIG_E1000_NAPI
@@ -4269,6 +4442,10 @@ e1000_clean_rx_irq(struct e1000_adapter *adapter,
}
#endif /* CONFIG_E1000_NAPI */
netdev->last_rx = jiffies;
+#ifdef CONFIG_E1000_MQ
+ rx_ring->rx_stats.packets++;
+ rx_ring->rx_stats.bytes += length;
+#endif

next_desc:
rx_desc->status = 0;
@@ -4375,6 +4552,10 @@ e1000_clean_rx_irq_ps(struct e1000_adapter *adapter,

/* Good Receive */
skb_put(skb, length);
+#ifdef CONFIG_E1000_MQ
+ rx_ring->rx_stats.packets++;
+ rx_ring->rx_stats.bytes += skb->len;
+#endif

{
/* this looks ugly, but it seems compiler issues make it



---
Auke Kok <[email protected]>

2007-02-09 00:23:37

by Kok, Auke

[permalink] [raw]
Subject: [PATCH 1/2] NET: Multiple queue network device support


From: Peter Waskiewicz Jr <[email protected]>

Added an API and associated supporting routines for multiqueue network
devices. This allows network devices supporting multiple TX queues to
configure each queue within the netdevice and manage each queue
independantly. Changes to the PRIO Qdisc also allow a user to map
multiple flows to individual TX queues, taking advantage of each queue
on the device.

Signed-off-by: Peter Waskiewicz Jr <[email protected]>
Signed-off-by: Auke Kok <[email protected]>
---

include/linux/netdevice.h | 73 ++++++++++++++++++++++++++++
include/net/sch_generic.h | 3 +
net/Kconfig | 20 ++++++++
net/core/dev.c | 51 ++++++++++++++++++++
net/sched/sch_generic.c | 117 +++++++++++++++++++++++++++++++++++++++++++++
net/sched/sch_prio.c | 106 ++++++++++++++++++++++++++++++++++++++---
6 files changed, 364 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2e37f50..c7f94a8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -106,6 +106,16 @@ struct netpoll_info;
#define MAX_HEADER (LL_MAX_HEADER + 48)
#endif

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+struct net_device_subqueue
+{
+ /* Give a lock and a flow control state for each queue */
+ unsigned long state;
+ spinlock_t queue_lock ____cacheline_aligned_in_smp;
+};
+#endif
+
/*
* Network device statistics. Akin to the 2.0 ether stats but
* with byte counters.
@@ -339,6 +349,10 @@ struct net_device
#define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
#define NETIF_F_ALL_CSUM (NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM)

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+#define NETIF_F_MULTI_QUEUE 8192 /* Has multiple TX/RX queues */
+#endif
+
struct net_device *next_sched;

/* Interface index. Unique device identifier */
@@ -532,6 +546,19 @@ struct net_device
struct device dev;
/* space for optional statistics and wireless sysfs groups */
struct attribute_group *sysfs_groups[3];
+
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ /* To retrieve statistics per subqueue - FOR FUTURE USE */
+ struct net_device_stats* (*get_subqueue_stats)(struct net_device *dev,
+ int queue_index);
+
+ /* The TX queue control structures */
+ struct net_device_subqueue *egress_subqueue;
+ int egress_subqueue_count;
+ int (*hard_start_subqueue_xmit)(struct sk_buff *skb,
+ struct net_device *dev,
+ int queue_index);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
};
#define to_net_dev(d) container_of(d, struct net_device, dev)

@@ -673,6 +700,52 @@ static inline int netif_running(const struct net_device *dev)
return test_bit(__LINK_STATE_START, &dev->state);
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+/*
+ * Routines to manage the subqueues on a device. We only need start
+ * stop, and a check if it's stopped. All other device management is
+ * done at the overall netdevice level.
+ * Also test the netif state if we're multi-queue at all.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, int queue_index)
+{
+ clear_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, int queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+ if (netpoll_trap())
+ return;
+#endif
+ set_bit(__LINK_STATE_XOFF, &dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ int queue_index)
+{
+ return test_bit(__LINK_STATE_XOFF,
+ &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, int queue_index)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+ if (netpoll_trap())
+ return;
+#endif
+ if (test_and_clear_bit(__LINK_STATE_XOFF,
+ &dev->egress_subqueue[queue_index].state))
+ __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev)
+{
+ return (!!(NETIF_F_MULTI_QUEUE & dev->features));
+}
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
+

/* Use this variant when it is known for sure that it
* is executing from interrupt context.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 8208639..8751ba6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -102,6 +102,9 @@ struct Qdisc_ops

int (*dump)(struct Qdisc *, struct sk_buff *);
int (*dump_stats)(struct Qdisc *, struct gnet_dump *);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int (*map_queue)(struct sk_buff *, struct Qdisc *);
+#endif

struct module *owner;
};
diff --git a/net/Kconfig b/net/Kconfig
index 7dfc949..796edb3 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -38,6 +38,26 @@ source "net/packet/Kconfig"
source "net/unix/Kconfig"
source "net/xfrm/Kconfig"

+config NET_MULTI_QUEUE_DEVICE
+ bool "Multiple queue network device support (EXPERIMENTAL)"
+ depends on NET_SCHED && EXPERIMENTAL
+ help
+ Saying Y here will add support for network devices that have more than
+ one physical TX queue and/or RX queue.
+
+ Multiple queue devices will require qdiscs that understand how to
+ queue to multiple targets. The default packet scheduler will default
+ to the first queue in a device. In other words, if you need the
+ ability to spread traffic across queues, your queueing discipline
+ needs to know how to do that.
+
+ Note that saying Y here will give preferential treatment to multiple
+ queue devices in the network stack. A slight drop in single-queue
+ device performance may be seen.
+
+ Say N here unless you know your network device supports multiple
+ TX and/or RX queues.
+
config INET
bool "TCP/IP networking"
---help---
diff --git a/net/core/dev.c b/net/core/dev.c
index 455d589..42b635c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1477,6 +1477,49 @@ gso:
skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
#endif
if (q->enqueue) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue_index;
+ /* If we're a multi-queue device, get a queue index to lock */
+ if (netif_is_multiqueue(dev))
+ {
+ /* Get the queue index and lock it. */
+ if (likely(q->ops->map_queue)) {
+ queue_index = q->ops->map_queue(skb, q);
+ spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ rc = q->enqueue(skb, q);
+ /*
+ * Unlock because the underlying qdisc
+ * may queue and send a packet from a
+ * different queue.
+ */
+ spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+ qdisc_run(dev);
+ rc = rc == NET_XMIT_BYPASS
+ ? NET_XMIT_SUCCESS : rc;
+ goto out;
+ } else {
+ printk(KERN_CRIT "Device %s is "
+ "multiqueue, but map_queue is "
+ "undefined in the qdisc!!\n",
+ dev->name);
+ kfree_skb(skb);
+ }
+ } else {
+ /* We're not a multi-queue device. */
+ spin_lock(&dev->queue_lock);
+ q = dev->qdisc;
+ if (q->enqueue) {
+ rc = q->enqueue(skb, q);
+ qdisc_run(dev);
+ spin_unlock(&dev->queue_lock);
+ rc = rc == NET_XMIT_BYPASS
+ ? NET_XMIT_SUCCESS : rc;
+ goto out;
+ }
+ spin_unlock(&dev->queue_lock);
+ }
+#else /* No multi-queue support compiled in */
+
/* Grab device queue */
spin_lock(&dev->queue_lock);
q = dev->qdisc;
@@ -1489,6 +1532,7 @@ gso:
goto out;
}
spin_unlock(&dev->queue_lock);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
}

/* The device has no queue. Common case for software devices:
@@ -2879,6 +2923,9 @@ int register_netdevice(struct net_device *dev)
struct hlist_head *head;
struct hlist_node *p;
int ret;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int i;
+#endif

BUG_ON(dev_boot_phase);
ASSERT_RTNL();
@@ -2894,6 +2941,10 @@ int register_netdevice(struct net_device *dev)
#ifdef CONFIG_NET_CLS_ACT
spin_lock_init(&dev->ingress_lock);
#endif
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ for (i = 0; i < dev->egress_subqueue_count; i++)
+ spin_lock_init(&dev->egress_subqueue[i].queue_lock);
+#endif

dev->iflink = -1;

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index bc116bd..62ca23e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -42,6 +42,7 @@
to data, participating in scheduling must be additionally
protected with dev->queue_lock spinlock.

+ For single-queue devices:
The idea is the following:
- enqueue, dequeue are serialized via top level device
spinlock dev->queue_lock.
@@ -51,6 +52,12 @@
hence this lock may be made without local bh disabling.

qdisc_tree_lock must be grabbed BEFORE dev->queue_lock!
+
+ For multi-queue devices:
+ - enqueue, dequeue are serialized with individual queue locks,
+ dev->egress_subqueue[queue_index].queue_lock.
+ - Everything else with tree protection with single queue devices apply
+ to multiqueue devices.
*/
DEFINE_RWLOCK(qdisc_tree_lock);

@@ -92,6 +99,9 @@ static inline int qdisc_restart(struct net_device *dev)
{
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue_index = 0;
+#endif

/* Dequeue packet */
if (((skb = dev->gso_skb)) || ((skb = q->dequeue(q)))) {
@@ -130,6 +140,72 @@ static inline int qdisc_restart(struct net_device *dev)
}

{
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev)) {
+ if (likely(q->ops->map_queue)) {
+ queue_index = q->ops->map_queue(skb, q);
+ } else {
+ printk(KERN_CRIT "Device %s is "
+ "multiqueue, but map_queue is "
+ "undefined in the qdisc!!\n",
+ dev->name);
+ goto requeue;
+ }
+ spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+ /* Check top level device, and any sub-device */
+ if ((!netif_queue_stopped(dev)) &&
+ (!netif_subqueue_stopped(dev, queue_index))) {
+ int ret;
+ ret = dev->hard_start_subqueue_xmit(skb, dev, queue_index);
+ if (ret == NETDEV_TX_OK) {
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+ return -1;
+ }
+ if (ret == NETDEV_TX_LOCKED && nolock) {
+ spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ goto collision;
+ }
+ }
+ /* NETDEV_TX_BUSY - we need to requeue */
+ /* Release the driver */
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+ spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ q = dev->qdisc;
+ }
+ else
+ {
+ /* We're a single-queue device */
+ /* And release queue */
+ spin_unlock(&dev->queue_lock);
+ if (!netif_queue_stopped(dev)) {
+ int ret;
+
+ ret = dev->hard_start_xmit(skb, dev);
+ if (ret == NETDEV_TX_OK) {
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+ spin_lock(&dev->queue_lock);
+ return -1;
+ }
+ if (ret == NETDEV_TX_LOCKED && nolock) {
+ spin_lock(&dev->queue_lock);
+ goto collision;
+ }
+ }
+ /* NETDEV_TX_BUSY - we need to requeue */
+ /* Release the driver */
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+ spin_lock(&dev->queue_lock);
+ q = dev->qdisc;
+ }
+#else /* CONFIG_NET_MULTI_QUEUE_DEVICE */
/* And release queue */
spin_unlock(&dev->queue_lock);

@@ -157,6 +233,7 @@ static inline int qdisc_restart(struct net_device *dev)
}
spin_lock(&dev->queue_lock);
q = dev->qdisc;
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
}

/* Device kicked us out :(
@@ -175,9 +252,17 @@ requeue:
else
q->ops->requeue(skb, q);
netif_schedule(dev);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev))
+ spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
return 1;
}
BUG_ON((int) q->q.qlen < 0);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev))
+ spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
return q->q.qlen;
}

@@ -287,12 +372,22 @@ static int noop_requeue(struct sk_buff *skb, struct Qdisc* qdisc)
return NET_XMIT_CN;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int noop_map_queue(struct sk_buff *skb, struct Qdisc *qdisc)
+{
+ return 0;
+}
+#endif
+
struct Qdisc_ops noop_qdisc_ops = {
.id = "noop",
.priv_size = 0,
.enqueue = noop_enqueue,
.dequeue = noop_dequeue,
.requeue = noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = noop_map_queue,
+#endif
.owner = THIS_MODULE,
};

@@ -310,6 +405,9 @@ static struct Qdisc_ops noqueue_qdisc_ops = {
.enqueue = noop_enqueue,
.dequeue = noop_dequeue,
.requeue = noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = noop_map_queue,
+#endif
.owner = THIS_MODULE,
};

@@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc* qdisc)
struct sk_buff_head *list = qdisc_priv(qdisc);

for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(qdisc->dev))
+ spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
if (!skb_queue_empty(list + prio)) {
qdisc->q.qlen--;
return __qdisc_dequeue_head(qdisc, list + prio);
}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(qdisc->dev))
+ spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
}

return NULL;
@@ -406,6 +512,14 @@ static int pfifo_fast_init(struct Qdisc *qdisc, struct rtattr *opt)
return 0;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int pfifo_fast_map_queue(struct sk_buff *skb, struct Qdisc *qdisc)
+{
+ /* Generic qdisc, so always return queue index 0. */
+ return 0;
+}
+#endif
+
static struct Qdisc_ops pfifo_fast_ops = {
.id = "pfifo_fast",
.priv_size = PFIFO_FAST_BANDS * sizeof(struct sk_buff_head),
@@ -415,6 +529,9 @@ static struct Qdisc_ops pfifo_fast_ops = {
.init = pfifo_fast_init,
.reset = pfifo_fast_reset,
.dump = pfifo_fast_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = pfifo_fast_map_queue,
+#endif
.owner = THIS_MODULE,
};

diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 2567b4c..0e24071 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -43,16 +43,19 @@ struct prio_sched_data
struct tcf_proto *filter_list;
u8 prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int band2queue[TC_PRIO_MAX + 1];
+#endif
};

-
static struct Qdisc *
-prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
+prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr, int *retband)
{
struct prio_sched_data *q = qdisc_priv(sch);
u32 band = skb->priority;
struct tcf_result res;

+ *retband = band;
*qerr = NET_XMIT_BYPASS;
if (TC_H_MAJ(skb->priority) != sch->handle) {
#ifdef CONFIG_NET_CLS_ACT
@@ -68,13 +71,16 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
#else
if (!q->filter_list || tc_classify(skb, q->filter_list, &res)) {
#endif
- if (TC_H_MAJ(band))
+ if (TC_H_MAJ(band)) {
band = 0;
+ *retband = 0;
+ }
return q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
+ *retband = band;
if (band > q->bands)
return q->queues[q->prio2band[0]];

@@ -86,8 +92,15 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
{
struct Qdisc *qdisc;
int ret;
+ int band; /* Unused in this function, just here for multi-queue */
+
+ /*
+ * NOTE: if a device is multiqueue, then it is imperative the
+ * underlying qdisc NOT be another PRIO qdisc. Otherwise we're going
+ * to deadlock. Fixme please.
+ */
+ qdisc = prio_classify(skb, sch, &ret, &band);

- qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
if (qdisc == NULL) {

@@ -108,14 +121,34 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
return ret;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int
+prio_map_queue(struct sk_buff *skb, struct Qdisc *sch)
+{
+ /*
+ * We'll classify the skb here, so return the Qdisc back through
+ * the function call. The return value is the queue we need to
+ * lock, which will be the value returned through band. This will help
+ * speed up enqueue, since it won't have to do the classify again.
+ */
+ int ret;
+ int band;
+ struct prio_sched_data *q = qdisc_priv(sch);
+
+ sch = prio_classify(skb, sch, &ret, &band);
+ return q->band2queue[band];
+}
+#endif

static int
prio_requeue(struct sk_buff *skb, struct Qdisc* sch)
{
struct Qdisc *qdisc;
int ret;
+ int band;
+
+ qdisc = prio_classify(skb, sch, &ret, &band);

- qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
if (qdisc == NULL) {
if (ret == NET_XMIT_BYPASS)
@@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
struct sk_buff *skb;
struct prio_sched_data *q = qdisc_priv(sch);
int prio;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue;
+#endif
struct Qdisc *qdisc;

+ /*
+ * If we're multiqueue, the basic approach is try the lock on each
+ * queue. If it's locked, either we're already dequeuing, or enqueue
+ * is doing something. Go to the next band if we're locked. Once we
+ * have a packet, unlock the queue. NOTE: the underlying qdisc CANNOT
+ * be a PRIO qdisc, otherwise we will deadlock. FIXME
+ */
for (prio = 0; prio < q->bands; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(sch->dev)) {
+ queue = q->band2queue[prio];
+ if (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
+ qdisc = q->queues[prio];
+ skb = qdisc->dequeue(qdisc);
+ if (skb) {
+ sch->q.qlen--;
+ skb->priority = prio;
+ spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+ return skb;
+ }
+ spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+ }
+ } else {
+ qdisc = q->queues[prio];
+ skb = qdisc->dequeue(qdisc);
+ if (skb) {
+ sch->q.qlen--;
+ skb->priority = prio;
+ return skb;
+ }
+ }
+#else
qdisc = q->queues[prio];
skb = qdisc->dequeue(qdisc);
if (skb) {
sch->q.qlen--;
+ skb->priority = prio;
return skb;
}
+#endif
}
return NULL;
-
}

static unsigned int prio_drop(struct Qdisc* sch)
@@ -205,6 +273,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
struct prio_sched_data *q = qdisc_priv(sch);
struct tc_prio_qopt *qopt = RTA_DATA(opt);
int i;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue;
+ int qmapoffset;
+#endif

if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
return -EINVAL;
@@ -247,6 +319,25 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
}
}
}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ /* setup queue to band mapping */
+ if (netif_is_multiqueue(sch->dev)) {
+ if (q->bands == sch->dev->egress_subqueue_count)
+ qmapoffset = 0;
+ else
+ qmapoffset = q->bands / sch->dev->egress_subqueue_count;
+
+ queue = 0;
+ for (i = 0; i < q->bands; i++) {
+ q->band2queue[i] = queue;
+ if ( (((i + 1) - queue) == qmapoffset) &&
+ (queue < (sch->dev->egress_subqueue_count - 1))) {
+ queue++;
+ }
+ }
+ }
+
+#endif
return 0;
}

@@ -430,6 +521,9 @@ static struct Qdisc_ops prio_qdisc_ops = {
.destroy = prio_destroy,
.change = prio_tune,
.dump = prio_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = prio_map_queue,
+#endif
.owner = THIS_MODULE,
};




---
Auke Kok <[email protected]>

2007-02-23 09:14:34

by Sreenivasa Honnur

[permalink] [raw]
Subject: RE: [PATCH 1/2] NET: Multiple queue network device support

Function "map_queue" returns queue_index as 0 always. There is no
support to return different queue numbers.

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Kok, Auke
Sent: Friday, February 09, 2007 5:40 AM
To: David Miller; Garzik, Jeff; [email protected];
[email protected]
Cc: Kok, Auke; Peter Waskiewicz Jr; Brandeburg, Jesse; Kok, Auke;
Ronciak, John
Subject: [PATCH 1/2] NET: Multiple queue network device support


From: Peter Waskiewicz Jr <[email protected]>

Added an API and associated supporting routines for multiqueue network
devices. This allows network devices supporting multiple TX queues to
configure each queue within the netdevice and manage each queue
independantly. Changes to the PRIO Qdisc also allow a user to map
multiple flows to individual TX queues, taking advantage of each queue
on the device.

Signed-off-by: Peter Waskiewicz Jr <[email protected]>
Signed-off-by: Auke Kok <[email protected]>
---

include/linux/netdevice.h | 73 ++++++++++++++++++++++++++++
include/net/sch_generic.h | 3 +
net/Kconfig | 20 ++++++++
net/core/dev.c | 51 ++++++++++++++++++++
net/sched/sch_generic.c | 117
+++++++++++++++++++++++++++++++++++++++++++++
net/sched/sch_prio.c | 106
++++++++++++++++++++++++++++++++++++++---
6 files changed, 364 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index
2e37f50..c7f94a8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -106,6 +106,16 @@ struct netpoll_info; #define MAX_HEADER
(LL_MAX_HEADER + 48) #endif

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+struct net_device_subqueue
+{
+ /* Give a lock and a flow control state for each queue */
+ unsigned long state;
+ spinlock_t queue_lock ____cacheline_aligned_in_smp;
+};
+#endif
+
/*
* Network device statistics. Akin to the 2.0 ether stats but
* with byte counters.
@@ -339,6 +349,10 @@ struct net_device
#define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
#define NETIF_F_ALL_CSUM (NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM)

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+#define NETIF_F_MULTI_QUEUE 8192 /* Has multiple TX/RX queues */
+#endif
+
struct net_device *next_sched;

/* Interface index. Unique device identifier */
@@ -532,6 +546,19 @@ struct net_device
struct device dev;
/* space for optional statistics and wireless sysfs groups */
struct attribute_group *sysfs_groups[3];
+
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ /* To retrieve statistics per subqueue - FOR FUTURE USE */
+ struct net_device_stats* (*get_subqueue_stats)(struct net_device
*dev,
+ int
queue_index);
+
+ /* The TX queue control structures */
+ struct net_device_subqueue *egress_subqueue;
+ int egress_subqueue_count;
+ int (*hard_start_subqueue_xmit)(struct
sk_buff *skb,
+ struct
net_device *dev,
+ int
queue_index);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
};
#define to_net_dev(d) container_of(d, struct net_device, dev)

@@ -673,6 +700,52 @@ static inline int netif_running(const struct
net_device *dev)
return test_bit(__LINK_STATE_START, &dev->state); }

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+/*
+ * Routines to manage the subqueues on a device. We only need start
+ * stop, and a check if it's stopped. All other device management is
+ * done at the overall netdevice level.
+ * Also test the netif state if we're multi-queue at all.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, int
+queue_index) {
+ clear_bit(__LINK_STATE_XOFF,
+&dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, int
+queue_index) { #ifdef CONFIG_NETPOLL_TRAP
+ if (netpoll_trap())
+ return;
+#endif
+ set_bit(__LINK_STATE_XOFF,
&dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ int queue_index) {
+ return test_bit(__LINK_STATE_XOFF,
+ &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, int
+queue_index) { #ifdef CONFIG_NETPOLL_TRAP
+ if (netpoll_trap())
+ return;
+#endif
+ if (test_and_clear_bit(__LINK_STATE_XOFF,
+
&dev->egress_subqueue[queue_index].state))
+ __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev) {
+ return (!!(NETIF_F_MULTI_QUEUE & dev->features)); } #endif /*
+CONFIG_NET_MULTI_QUEUE_DEVICE */
+

/* Use this variant when it is known for sure that it
* is executing from interrupt context.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index
8208639..8751ba6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -102,6 +102,9 @@ struct Qdisc_ops

int (*dump)(struct Qdisc *, struct sk_buff
*);
int (*dump_stats)(struct Qdisc *, struct
gnet_dump *);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int (*map_queue)(struct sk_buff *, struct
Qdisc *);
+#endif

struct module *owner;
};
diff --git a/net/Kconfig b/net/Kconfig
index 7dfc949..796edb3 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -38,6 +38,26 @@ source "net/packet/Kconfig"
source "net/unix/Kconfig"
source "net/xfrm/Kconfig"

+config NET_MULTI_QUEUE_DEVICE
+ bool "Multiple queue network device support (EXPERIMENTAL)"
+ depends on NET_SCHED && EXPERIMENTAL
+ help
+ Saying Y here will add support for network devices that have
more than
+ one physical TX queue and/or RX queue.
+
+ Multiple queue devices will require qdiscs that understand how
to
+ queue to multiple targets. The default packet scheduler will
default
+ to the first queue in a device. In other words, if you need
the
+ ability to spread traffic across queues, your queueing
discipline
+ needs to know how to do that.
+
+ Note that saying Y here will give preferential treatment to
multiple
+ queue devices in the network stack. A slight drop in
single-queue
+ device performance may be seen.
+
+ Say N here unless you know your network device supports
multiple
+ TX and/or RX queues.
+
config INET
bool "TCP/IP networking"
---help---
diff --git a/net/core/dev.c b/net/core/dev.c index 455d589..42b635c
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1477,6 +1477,49 @@ gso:
skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
#endif
if (q->enqueue) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue_index;
+ /* If we're a multi-queue device, get a queue index to
lock */
+ if (netif_is_multiqueue(dev))
+ {
+ /* Get the queue index and lock it. */
+ if (likely(q->ops->map_queue)) {
+ queue_index = q->ops->map_queue(skb, q);
+
spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ rc = q->enqueue(skb, q);
+ /*
+ * Unlock because the underlying qdisc
+ * may queue and send a packet from a
+ * different queue.
+ */
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+ qdisc_run(dev);
+ rc = rc == NET_XMIT_BYPASS
+ ? NET_XMIT_SUCCESS : rc;
+ goto out;
+ } else {
+ printk(KERN_CRIT "Device %s is "
+ "multiqueue, but map_queue is "
+ "undefined in the qdisc!!\n",
+ dev->name);
+ kfree_skb(skb);
+ }
+ } else {
+ /* We're not a multi-queue device. */
+ spin_lock(&dev->queue_lock);
+ q = dev->qdisc;
+ if (q->enqueue) {
+ rc = q->enqueue(skb, q);
+ qdisc_run(dev);
+ spin_unlock(&dev->queue_lock);
+ rc = rc == NET_XMIT_BYPASS
+ ? NET_XMIT_SUCCESS : rc;
+ goto out;
+ }
+ spin_unlock(&dev->queue_lock);
+ }
+#else /* No multi-queue support compiled in */
+
/* Grab device queue */
spin_lock(&dev->queue_lock);
q = dev->qdisc;
@@ -1489,6 +1532,7 @@ gso:
goto out;
}
spin_unlock(&dev->queue_lock);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
}

/* The device has no queue. Common case for software devices:
@@ -2879,6 +2923,9 @@ int register_netdevice(struct net_device *dev)
struct hlist_head *head;
struct hlist_node *p;
int ret;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int i;
+#endif

BUG_ON(dev_boot_phase);
ASSERT_RTNL();
@@ -2894,6 +2941,10 @@ int register_netdevice(struct net_device *dev)
#ifdef CONFIG_NET_CLS_ACT
spin_lock_init(&dev->ingress_lock);
#endif
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ for (i = 0; i < dev->egress_subqueue_count; i++)
+ spin_lock_init(&dev->egress_subqueue[i].queue_lock);
+#endif

dev->iflink = -1;

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index
bc116bd..62ca23e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -42,6 +42,7 @@
to data, participating in scheduling must be additionally
protected with dev->queue_lock spinlock.

+ For single-queue devices:
The idea is the following:
- enqueue, dequeue are serialized via top level device
spinlock dev->queue_lock.
@@ -51,6 +52,12 @@
hence this lock may be made without local bh disabling.

qdisc_tree_lock must be grabbed BEFORE dev->queue_lock!
+
+ For multi-queue devices:
+ - enqueue, dequeue are serialized with individual queue locks,
+ dev->egress_subqueue[queue_index].queue_lock.
+ - Everything else with tree protection with single queue devices
apply
+ to multiqueue devices.
*/
DEFINE_RWLOCK(qdisc_tree_lock);

@@ -92,6 +99,9 @@ static inline int qdisc_restart(struct net_device
*dev) {
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue_index = 0;
+#endif

/* Dequeue packet */
if (((skb = dev->gso_skb)) || ((skb = q->dequeue(q)))) { @@
-130,6 +140,72 @@ static inline int qdisc_restart(struct net_device
*dev)
}

{
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev)) {
+ if (likely(q->ops->map_queue)) {
+ queue_index =
q->ops->map_queue(skb, q);
+ } else {
+ printk(KERN_CRIT "Device %s is "
+ "multiqueue, but
map_queue is "
+ "undefined in the
qdisc!!\n",
+ dev->name);
+ goto requeue;
+ }
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+ /* Check top level device, and any
sub-device */
+ if ((!netif_queue_stopped(dev)) &&
+ (!netif_subqueue_stopped(dev,
queue_index))) {
+ int ret;
+ ret =
dev->hard_start_subqueue_xmit(skb, dev, queue_index);
+ if (ret == NETDEV_TX_OK) {
+ if (!nolock) {
+
netif_tx_unlock(dev);
+ }
+ return -1;
+ }
+ if (ret == NETDEV_TX_LOCKED &&
nolock) {
+
spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ goto collision;
+ }
+ }
+ /* NETDEV_TX_BUSY - we need to requeue
*/
+ /* Release the driver */
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+
spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ q = dev->qdisc;
+ }
+ else
+ {
+ /* We're a single-queue device */
+ /* And release queue */
+ spin_unlock(&dev->queue_lock);
+ if (!netif_queue_stopped(dev)) {
+ int ret;
+
+ ret = dev->hard_start_xmit(skb,
dev);
+ if (ret == NETDEV_TX_OK) {
+ if (!nolock) {
+
netif_tx_unlock(dev);
+ }
+
spin_lock(&dev->queue_lock);
+ return -1;
+ }
+ if (ret == NETDEV_TX_LOCKED &&
nolock) {
+
spin_lock(&dev->queue_lock);
+ goto collision;
+ }
+ }
+ /* NETDEV_TX_BUSY - we need to requeue
*/
+ /* Release the driver */
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+ spin_lock(&dev->queue_lock);
+ q = dev->qdisc;
+ }
+#else /* CONFIG_NET_MULTI_QUEUE_DEVICE */
/* And release queue */
spin_unlock(&dev->queue_lock);

@@ -157,6 +233,7 @@ static inline int qdisc_restart(struct net_device
*dev)
}
spin_lock(&dev->queue_lock);
q = dev->qdisc;
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
}

/* Device kicked us out :(
@@ -175,9 +252,17 @@ requeue:
else
q->ops->requeue(skb, q);
netif_schedule(dev);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev))
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
return 1;
}
BUG_ON((int) q->q.qlen < 0);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev))
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
return q->q.qlen;
}

@@ -287,12 +372,22 @@ static int noop_requeue(struct sk_buff *skb,
struct Qdisc* qdisc)
return NET_XMIT_CN;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int noop_map_queue(struct sk_buff *skb, struct Qdisc *qdisc) {
+ return 0;
+}
+#endif
+
struct Qdisc_ops noop_qdisc_ops = {
.id = "noop",
.priv_size = 0,
.enqueue = noop_enqueue,
.dequeue = noop_dequeue,
.requeue = noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = noop_map_queue,
+#endif
.owner = THIS_MODULE,
};

@@ -310,6 +405,9 @@ static struct Qdisc_ops noqueue_qdisc_ops = {
.enqueue = noop_enqueue,
.dequeue = noop_dequeue,
.requeue = noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = noop_map_queue,
+#endif
.owner = THIS_MODULE,
};

@@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct
Qdisc* qdisc)
struct sk_buff_head *list = qdisc_priv(qdisc);

for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(qdisc->dev))
+
spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
if (!skb_queue_empty(list + prio)) {
qdisc->q.qlen--;
return __qdisc_dequeue_head(qdisc, list + prio);
}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(qdisc->dev))
+
spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
}

return NULL;
@@ -406,6 +512,14 @@ static int pfifo_fast_init(struct Qdisc *qdisc,
struct rtattr *opt)
return 0;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int pfifo_fast_map_queue(struct sk_buff *skb, struct Qdisc
+*qdisc) {
+ /* Generic qdisc, so always return queue index 0. */
+ return 0;
+}
+#endif
+
static struct Qdisc_ops pfifo_fast_ops = {
.id = "pfifo_fast",
.priv_size = PFIFO_FAST_BANDS * sizeof(struct
sk_buff_head),
@@ -415,6 +529,9 @@ static struct Qdisc_ops pfifo_fast_ops = {
.init = pfifo_fast_init,
.reset = pfifo_fast_reset,
.dump = pfifo_fast_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = pfifo_fast_map_queue,
+#endif
.owner = THIS_MODULE,
};

diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index
2567b4c..0e24071 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -43,16 +43,19 @@ struct prio_sched_data
struct tcf_proto *filter_list;
u8 prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int band2queue[TC_PRIO_MAX + 1];
+#endif
};

-
static struct Qdisc *
-prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
+prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr, int
+*retband)
{
struct prio_sched_data *q = qdisc_priv(sch);
u32 band = skb->priority;
struct tcf_result res;

+ *retband = band;
*qerr = NET_XMIT_BYPASS;
if (TC_H_MAJ(skb->priority) != sch->handle) { #ifdef
CONFIG_NET_CLS_ACT @@ -68,13 +71,16 @@ prio_classify(struct sk_buff
*skb, struct Qdisc *sch, int *qerr) #else
if (!q->filter_list || tc_classify(skb, q->filter_list,
&res)) { #endif
- if (TC_H_MAJ(band))
+ if (TC_H_MAJ(band)) {
band = 0;
+ *retband = 0;
+ }
return
q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
+ *retband = band;
if (band > q->bands)
return q->queues[q->prio2band[0]];

@@ -86,8 +92,15 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
{
struct Qdisc *qdisc;
int ret;
+ int band; /* Unused in this function, just here for multi-queue
*/
+
+ /*
+ * NOTE: if a device is multiqueue, then it is imperative the
+ * underlying qdisc NOT be another PRIO qdisc. Otherwise we're
going
+ * to deadlock. Fixme please.
+ */
+ qdisc = prio_classify(skb, sch, &ret, &band);

- qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
if (qdisc == NULL) {

@@ -108,14 +121,34 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc
*sch)
return ret;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int
+prio_map_queue(struct sk_buff *skb, struct Qdisc *sch) {
+ /*
+ * We'll classify the skb here, so return the Qdisc back through
+ * the function call. The return value is the queue we need to
+ * lock, which will be the value returned through band. This
will help
+ * speed up enqueue, since it won't have to do the classify
again.
+ */
+ int ret;
+ int band;
+ struct prio_sched_data *q = qdisc_priv(sch);
+
+ sch = prio_classify(skb, sch, &ret, &band);
+ return q->band2queue[band];
+}
+#endif

static int
prio_requeue(struct sk_buff *skb, struct Qdisc* sch) {
struct Qdisc *qdisc;
int ret;
+ int band;
+
+ qdisc = prio_classify(skb, sch, &ret, &band);

- qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
if (qdisc == NULL) {
if (ret == NET_XMIT_BYPASS)
@@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
struct sk_buff *skb;
struct prio_sched_data *q = qdisc_priv(sch);
int prio;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue;
+#endif
struct Qdisc *qdisc;

+ /*
+ * If we're multiqueue, the basic approach is try the lock on
each
+ * queue. If it's locked, either we're already dequeuing, or
enqueue
+ * is doing something. Go to the next band if we're locked.
Once we
+ * have a packet, unlock the queue. NOTE: the underlying qdisc
CANNOT
+ * be a PRIO qdisc, otherwise we will deadlock. FIXME
+ */
for (prio = 0; prio < q->bands; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(sch->dev)) {
+ queue = q->band2queue[prio];
+ if
(spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
+ qdisc = q->queues[prio];
+ skb = qdisc->dequeue(qdisc);
+ if (skb) {
+ sch->q.qlen--;
+ skb->priority = prio;
+
spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+ return skb;
+ }
+
spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+ }
+ } else {
+ qdisc = q->queues[prio];
+ skb = qdisc->dequeue(qdisc);
+ if (skb) {
+ sch->q.qlen--;
+ skb->priority = prio;
+ return skb;
+ }
+ }
+#else
qdisc = q->queues[prio];
skb = qdisc->dequeue(qdisc);
if (skb) {
sch->q.qlen--;
+ skb->priority = prio;
return skb;
}
+#endif
}
return NULL;
-
}

static unsigned int prio_drop(struct Qdisc* sch) @@ -205,6 +273,10 @@
static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
struct prio_sched_data *q = qdisc_priv(sch);
struct tc_prio_qopt *qopt = RTA_DATA(opt);
int i;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue;
+ int qmapoffset;
+#endif

if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
return -EINVAL;
@@ -247,6 +319,25 @@ static int prio_tune(struct Qdisc *sch, struct
rtattr *opt)
}
}
}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ /* setup queue to band mapping */
+ if (netif_is_multiqueue(sch->dev)) {
+ if (q->bands == sch->dev->egress_subqueue_count)
+ qmapoffset = 0;
+ else
+ qmapoffset = q->bands /
sch->dev->egress_subqueue_count;
+
+ queue = 0;
+ for (i = 0; i < q->bands; i++) {
+ q->band2queue[i] = queue;
+ if ( (((i + 1) - queue) == qmapoffset) &&
+ (queue < (sch->dev->egress_subqueue_count -
1))) {
+ queue++;
+ }
+ }
+ }
+
+#endif
return 0;
}

@@ -430,6 +521,9 @@ static struct Qdisc_ops prio_qdisc_ops = {
.destroy = prio_destroy,
.change = prio_tune,
.dump = prio_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = prio_map_queue,
+#endif
.owner = THIS_MODULE,
};




---
Auke Kok <[email protected]>

2007-02-23 09:15:27

by Sreenivasa Honnur

[permalink] [raw]
Subject: RE: [PATCH 1/2] NET: Multiple queue network device support

Fucntion "map_queue" returns queue index as '0'. There is no support to
return different queue indexes.

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Kok, Auke
Sent: Friday, February 09, 2007 5:40 AM
To: David Miller; Garzik, Jeff; [email protected];
[email protected]
Cc: Kok, Auke; Peter Waskiewicz Jr; Brandeburg, Jesse; Kok, Auke;
Ronciak, John
Subject: [PATCH 1/2] NET: Multiple queue network device support


From: Peter Waskiewicz Jr <[email protected]>

Added an API and associated supporting routines for multiqueue network
devices. This allows network devices supporting multiple TX queues to
configure each queue within the netdevice and manage each queue
independantly. Changes to the PRIO Qdisc also allow a user to map
multiple flows to individual TX queues, taking advantage of each queue
on the device.

Signed-off-by: Peter Waskiewicz Jr <[email protected]>
Signed-off-by: Auke Kok <[email protected]>
---

include/linux/netdevice.h | 73 ++++++++++++++++++++++++++++
include/net/sch_generic.h | 3 +
net/Kconfig | 20 ++++++++
net/core/dev.c | 51 ++++++++++++++++++++
net/sched/sch_generic.c | 117
+++++++++++++++++++++++++++++++++++++++++++++
net/sched/sch_prio.c | 106
++++++++++++++++++++++++++++++++++++++---
6 files changed, 364 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index
2e37f50..c7f94a8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -106,6 +106,16 @@ struct netpoll_info; #define MAX_HEADER
(LL_MAX_HEADER + 48) #endif

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+struct net_device_subqueue
+{
+ /* Give a lock and a flow control state for each queue */
+ unsigned long state;
+ spinlock_t queue_lock ____cacheline_aligned_in_smp;
+};
+#endif
+
/*
* Network device statistics. Akin to the 2.0 ether stats but
* with byte counters.
@@ -339,6 +349,10 @@ struct net_device
#define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
#define NETIF_F_ALL_CSUM (NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM)

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+#define NETIF_F_MULTI_QUEUE 8192 /* Has multiple TX/RX queues */
+#endif
+
struct net_device *next_sched;

/* Interface index. Unique device identifier */
@@ -532,6 +546,19 @@ struct net_device
struct device dev;
/* space for optional statistics and wireless sysfs groups */
struct attribute_group *sysfs_groups[3];
+
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ /* To retrieve statistics per subqueue - FOR FUTURE USE */
+ struct net_device_stats* (*get_subqueue_stats)(struct net_device
*dev,
+ int
queue_index);
+
+ /* The TX queue control structures */
+ struct net_device_subqueue *egress_subqueue;
+ int egress_subqueue_count;
+ int (*hard_start_subqueue_xmit)(struct
sk_buff *skb,
+ struct
net_device *dev,
+ int
queue_index);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
};
#define to_net_dev(d) container_of(d, struct net_device, dev)

@@ -673,6 +700,52 @@ static inline int netif_running(const struct
net_device *dev)
return test_bit(__LINK_STATE_START, &dev->state); }

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+
+/*
+ * Routines to manage the subqueues on a device. We only need start
+ * stop, and a check if it's stopped. All other device management is
+ * done at the overall netdevice level.
+ * Also test the netif state if we're multi-queue at all.
+ */
+static inline void netif_start_subqueue(struct net_device *dev, int
+queue_index) {
+ clear_bit(__LINK_STATE_XOFF,
+&dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_stop_subqueue(struct net_device *dev, int
+queue_index) { #ifdef CONFIG_NETPOLL_TRAP
+ if (netpoll_trap())
+ return;
+#endif
+ set_bit(__LINK_STATE_XOFF,
&dev->egress_subqueue[queue_index].state);
+}
+
+static inline int netif_subqueue_stopped(const struct net_device *dev,
+ int queue_index) {
+ return test_bit(__LINK_STATE_XOFF,
+ &dev->egress_subqueue[queue_index].state);
+}
+
+static inline void netif_wake_subqueue(struct net_device *dev, int
+queue_index) { #ifdef CONFIG_NETPOLL_TRAP
+ if (netpoll_trap())
+ return;
+#endif
+ if (test_and_clear_bit(__LINK_STATE_XOFF,
+
&dev->egress_subqueue[queue_index].state))
+ __netif_schedule(dev);
+}
+
+static inline int netif_is_multiqueue(const struct net_device *dev) {
+ return (!!(NETIF_F_MULTI_QUEUE & dev->features)); } #endif /*
+CONFIG_NET_MULTI_QUEUE_DEVICE */
+

/* Use this variant when it is known for sure that it
* is executing from interrupt context.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index
8208639..8751ba6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -102,6 +102,9 @@ struct Qdisc_ops

int (*dump)(struct Qdisc *, struct sk_buff
*);
int (*dump_stats)(struct Qdisc *, struct
gnet_dump *);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int (*map_queue)(struct sk_buff *, struct
Qdisc *);
+#endif

struct module *owner;
};
diff --git a/net/Kconfig b/net/Kconfig
index 7dfc949..796edb3 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -38,6 +38,26 @@ source "net/packet/Kconfig"
source "net/unix/Kconfig"
source "net/xfrm/Kconfig"

+config NET_MULTI_QUEUE_DEVICE
+ bool "Multiple queue network device support (EXPERIMENTAL)"
+ depends on NET_SCHED && EXPERIMENTAL
+ help
+ Saying Y here will add support for network devices that have
more than
+ one physical TX queue and/or RX queue.
+
+ Multiple queue devices will require qdiscs that understand how
to
+ queue to multiple targets. The default packet scheduler will
default
+ to the first queue in a device. In other words, if you need
the
+ ability to spread traffic across queues, your queueing
discipline
+ needs to know how to do that.
+
+ Note that saying Y here will give preferential treatment to
multiple
+ queue devices in the network stack. A slight drop in
single-queue
+ device performance may be seen.
+
+ Say N here unless you know your network device supports
multiple
+ TX and/or RX queues.
+
config INET
bool "TCP/IP networking"
---help---
diff --git a/net/core/dev.c b/net/core/dev.c index 455d589..42b635c
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1477,6 +1477,49 @@ gso:
skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
#endif
if (q->enqueue) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue_index;
+ /* If we're a multi-queue device, get a queue index to
lock */
+ if (netif_is_multiqueue(dev))
+ {
+ /* Get the queue index and lock it. */
+ if (likely(q->ops->map_queue)) {
+ queue_index = q->ops->map_queue(skb, q);
+
spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ rc = q->enqueue(skb, q);
+ /*
+ * Unlock because the underlying qdisc
+ * may queue and send a packet from a
+ * different queue.
+ */
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+ qdisc_run(dev);
+ rc = rc == NET_XMIT_BYPASS
+ ? NET_XMIT_SUCCESS : rc;
+ goto out;
+ } else {
+ printk(KERN_CRIT "Device %s is "
+ "multiqueue, but map_queue is "
+ "undefined in the qdisc!!\n",
+ dev->name);
+ kfree_skb(skb);
+ }
+ } else {
+ /* We're not a multi-queue device. */
+ spin_lock(&dev->queue_lock);
+ q = dev->qdisc;
+ if (q->enqueue) {
+ rc = q->enqueue(skb, q);
+ qdisc_run(dev);
+ spin_unlock(&dev->queue_lock);
+ rc = rc == NET_XMIT_BYPASS
+ ? NET_XMIT_SUCCESS : rc;
+ goto out;
+ }
+ spin_unlock(&dev->queue_lock);
+ }
+#else /* No multi-queue support compiled in */
+
/* Grab device queue */
spin_lock(&dev->queue_lock);
q = dev->qdisc;
@@ -1489,6 +1532,7 @@ gso:
goto out;
}
spin_unlock(&dev->queue_lock);
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
}

/* The device has no queue. Common case for software devices:
@@ -2879,6 +2923,9 @@ int register_netdevice(struct net_device *dev)
struct hlist_head *head;
struct hlist_node *p;
int ret;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int i;
+#endif

BUG_ON(dev_boot_phase);
ASSERT_RTNL();
@@ -2894,6 +2941,10 @@ int register_netdevice(struct net_device *dev)
#ifdef CONFIG_NET_CLS_ACT
spin_lock_init(&dev->ingress_lock);
#endif
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ for (i = 0; i < dev->egress_subqueue_count; i++)
+ spin_lock_init(&dev->egress_subqueue[i].queue_lock);
+#endif

dev->iflink = -1;

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index
bc116bd..62ca23e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -42,6 +42,7 @@
to data, participating in scheduling must be additionally
protected with dev->queue_lock spinlock.

+ For single-queue devices:
The idea is the following:
- enqueue, dequeue are serialized via top level device
spinlock dev->queue_lock.
@@ -51,6 +52,12 @@
hence this lock may be made without local bh disabling.

qdisc_tree_lock must be grabbed BEFORE dev->queue_lock!
+
+ For multi-queue devices:
+ - enqueue, dequeue are serialized with individual queue locks,
+ dev->egress_subqueue[queue_index].queue_lock.
+ - Everything else with tree protection with single queue devices
apply
+ to multiqueue devices.
*/
DEFINE_RWLOCK(qdisc_tree_lock);

@@ -92,6 +99,9 @@ static inline int qdisc_restart(struct net_device
*dev) {
struct Qdisc *q = dev->qdisc;
struct sk_buff *skb;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue_index = 0;
+#endif

/* Dequeue packet */
if (((skb = dev->gso_skb)) || ((skb = q->dequeue(q)))) { @@
-130,6 +140,72 @@ static inline int qdisc_restart(struct net_device
*dev)
}

{
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev)) {
+ if (likely(q->ops->map_queue)) {
+ queue_index =
q->ops->map_queue(skb, q);
+ } else {
+ printk(KERN_CRIT "Device %s is "
+ "multiqueue, but
map_queue is "
+ "undefined in the
qdisc!!\n",
+ dev->name);
+ goto requeue;
+ }
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+ /* Check top level device, and any
sub-device */
+ if ((!netif_queue_stopped(dev)) &&
+ (!netif_subqueue_stopped(dev,
queue_index))) {
+ int ret;
+ ret =
dev->hard_start_subqueue_xmit(skb, dev, queue_index);
+ if (ret == NETDEV_TX_OK) {
+ if (!nolock) {
+
netif_tx_unlock(dev);
+ }
+ return -1;
+ }
+ if (ret == NETDEV_TX_LOCKED &&
nolock) {
+
spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ goto collision;
+ }
+ }
+ /* NETDEV_TX_BUSY - we need to requeue
*/
+ /* Release the driver */
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+
spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
+ q = dev->qdisc;
+ }
+ else
+ {
+ /* We're a single-queue device */
+ /* And release queue */
+ spin_unlock(&dev->queue_lock);
+ if (!netif_queue_stopped(dev)) {
+ int ret;
+
+ ret = dev->hard_start_xmit(skb,
dev);
+ if (ret == NETDEV_TX_OK) {
+ if (!nolock) {
+
netif_tx_unlock(dev);
+ }
+
spin_lock(&dev->queue_lock);
+ return -1;
+ }
+ if (ret == NETDEV_TX_LOCKED &&
nolock) {
+
spin_lock(&dev->queue_lock);
+ goto collision;
+ }
+ }
+ /* NETDEV_TX_BUSY - we need to requeue
*/
+ /* Release the driver */
+ if (!nolock) {
+ netif_tx_unlock(dev);
+ }
+ spin_lock(&dev->queue_lock);
+ q = dev->qdisc;
+ }
+#else /* CONFIG_NET_MULTI_QUEUE_DEVICE */
/* And release queue */
spin_unlock(&dev->queue_lock);

@@ -157,6 +233,7 @@ static inline int qdisc_restart(struct net_device
*dev)
}
spin_lock(&dev->queue_lock);
q = dev->qdisc;
+#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
}

/* Device kicked us out :(
@@ -175,9 +252,17 @@ requeue:
else
q->ops->requeue(skb, q);
netif_schedule(dev);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev))
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
return 1;
}
BUG_ON((int) q->q.qlen < 0);
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(dev))
+
spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
+#endif
return q->q.qlen;
}

@@ -287,12 +372,22 @@ static int noop_requeue(struct sk_buff *skb,
struct Qdisc* qdisc)
return NET_XMIT_CN;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int noop_map_queue(struct sk_buff *skb, struct Qdisc *qdisc) {
+ return 0;
+}
+#endif
+
struct Qdisc_ops noop_qdisc_ops = {
.id = "noop",
.priv_size = 0,
.enqueue = noop_enqueue,
.dequeue = noop_dequeue,
.requeue = noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = noop_map_queue,
+#endif
.owner = THIS_MODULE,
};

@@ -310,6 +405,9 @@ static struct Qdisc_ops noqueue_qdisc_ops = {
.enqueue = noop_enqueue,
.dequeue = noop_dequeue,
.requeue = noop_requeue,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = noop_map_queue,
+#endif
.owner = THIS_MODULE,
};

@@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct
Qdisc* qdisc)
struct sk_buff_head *list = qdisc_priv(qdisc);

for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(qdisc->dev))
+
spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
if (!skb_queue_empty(list + prio)) {
qdisc->q.qlen--;
return __qdisc_dequeue_head(qdisc, list + prio);
}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(qdisc->dev))
+
spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock);
+#endif
}

return NULL;
@@ -406,6 +512,14 @@ static int pfifo_fast_init(struct Qdisc *qdisc,
struct rtattr *opt)
return 0;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int pfifo_fast_map_queue(struct sk_buff *skb, struct Qdisc
+*qdisc) {
+ /* Generic qdisc, so always return queue index 0. */
+ return 0;
+}
+#endif
+
static struct Qdisc_ops pfifo_fast_ops = {
.id = "pfifo_fast",
.priv_size = PFIFO_FAST_BANDS * sizeof(struct
sk_buff_head),
@@ -415,6 +529,9 @@ static struct Qdisc_ops pfifo_fast_ops = {
.init = pfifo_fast_init,
.reset = pfifo_fast_reset,
.dump = pfifo_fast_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = pfifo_fast_map_queue,
+#endif
.owner = THIS_MODULE,
};

diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index
2567b4c..0e24071 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -43,16 +43,19 @@ struct prio_sched_data
struct tcf_proto *filter_list;
u8 prio2band[TC_PRIO_MAX+1];
struct Qdisc *queues[TCQ_PRIO_BANDS];
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int band2queue[TC_PRIO_MAX + 1];
+#endif
};

-
static struct Qdisc *
-prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
+prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr, int
+*retband)
{
struct prio_sched_data *q = qdisc_priv(sch);
u32 band = skb->priority;
struct tcf_result res;

+ *retband = band;
*qerr = NET_XMIT_BYPASS;
if (TC_H_MAJ(skb->priority) != sch->handle) { #ifdef
CONFIG_NET_CLS_ACT @@ -68,13 +71,16 @@ prio_classify(struct sk_buff
*skb, struct Qdisc *sch, int *qerr) #else
if (!q->filter_list || tc_classify(skb, q->filter_list,
&res)) { #endif
- if (TC_H_MAJ(band))
+ if (TC_H_MAJ(band)) {
band = 0;
+ *retband = 0;
+ }
return
q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid;
}
band = TC_H_MIN(band) - 1;
+ *retband = band;
if (band > q->bands)
return q->queues[q->prio2band[0]];

@@ -86,8 +92,15 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch)
{
struct Qdisc *qdisc;
int ret;
+ int band; /* Unused in this function, just here for multi-queue
*/
+
+ /*
+ * NOTE: if a device is multiqueue, then it is imperative the
+ * underlying qdisc NOT be another PRIO qdisc. Otherwise we're
going
+ * to deadlock. Fixme please.
+ */
+ qdisc = prio_classify(skb, sch, &ret, &band);

- qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
if (qdisc == NULL) {

@@ -108,14 +121,34 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc
*sch)
return ret;
}

+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+static int
+prio_map_queue(struct sk_buff *skb, struct Qdisc *sch) {
+ /*
+ * We'll classify the skb here, so return the Qdisc back through
+ * the function call. The return value is the queue we need to
+ * lock, which will be the value returned through band. This
will help
+ * speed up enqueue, since it won't have to do the classify
again.
+ */
+ int ret;
+ int band;
+ struct prio_sched_data *q = qdisc_priv(sch);
+
+ sch = prio_classify(skb, sch, &ret, &band);
+ return q->band2queue[band];
+}
+#endif

static int
prio_requeue(struct sk_buff *skb, struct Qdisc* sch) {
struct Qdisc *qdisc;
int ret;
+ int band;
+
+ qdisc = prio_classify(skb, sch, &ret, &band);

- qdisc = prio_classify(skb, sch, &ret);
#ifdef CONFIG_NET_CLS_ACT
if (qdisc == NULL) {
if (ret == NET_XMIT_BYPASS)
@@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
struct sk_buff *skb;
struct prio_sched_data *q = qdisc_priv(sch);
int prio;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue;
+#endif
struct Qdisc *qdisc;

+ /*
+ * If we're multiqueue, the basic approach is try the lock on
each
+ * queue. If it's locked, either we're already dequeuing, or
enqueue
+ * is doing something. Go to the next band if we're locked.
Once we
+ * have a packet, unlock the queue. NOTE: the underlying qdisc
CANNOT
+ * be a PRIO qdisc, otherwise we will deadlock. FIXME
+ */
for (prio = 0; prio < q->bands; prio++) {
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ if (netif_is_multiqueue(sch->dev)) {
+ queue = q->band2queue[prio];
+ if
(spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
+ qdisc = q->queues[prio];
+ skb = qdisc->dequeue(qdisc);
+ if (skb) {
+ sch->q.qlen--;
+ skb->priority = prio;
+
spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+ return skb;
+ }
+
spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
+ }
+ } else {
+ qdisc = q->queues[prio];
+ skb = qdisc->dequeue(qdisc);
+ if (skb) {
+ sch->q.qlen--;
+ skb->priority = prio;
+ return skb;
+ }
+ }
+#else
qdisc = q->queues[prio];
skb = qdisc->dequeue(qdisc);
if (skb) {
sch->q.qlen--;
+ skb->priority = prio;
return skb;
}
+#endif
}
return NULL;
-
}

static unsigned int prio_drop(struct Qdisc* sch) @@ -205,6 +273,10 @@
static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
struct prio_sched_data *q = qdisc_priv(sch);
struct tc_prio_qopt *qopt = RTA_DATA(opt);
int i;
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ int queue;
+ int qmapoffset;
+#endif

if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
return -EINVAL;
@@ -247,6 +319,25 @@ static int prio_tune(struct Qdisc *sch, struct
rtattr *opt)
}
}
}
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ /* setup queue to band mapping */
+ if (netif_is_multiqueue(sch->dev)) {
+ if (q->bands == sch->dev->egress_subqueue_count)
+ qmapoffset = 0;
+ else
+ qmapoffset = q->bands /
sch->dev->egress_subqueue_count;
+
+ queue = 0;
+ for (i = 0; i < q->bands; i++) {
+ q->band2queue[i] = queue;
+ if ( (((i + 1) - queue) == qmapoffset) &&
+ (queue < (sch->dev->egress_subqueue_count -
1))) {
+ queue++;
+ }
+ }
+ }
+
+#endif
return 0;
}

@@ -430,6 +521,9 @@ static struct Qdisc_ops prio_qdisc_ops = {
.destroy = prio_destroy,
.change = prio_tune,
.dump = prio_dump,
+#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
+ .map_queue = prio_map_queue,
+#endif
.owner = THIS_MODULE,
};




---
Auke Kok <[email protected]>

2007-02-23 19:05:58

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: RE: [PATCH 1/2] NET: Multiple queue network device support

map_queue will always return 0 in the pfifo_fast Qdisc. This is because
pfifo_fast is the default scheduler when a device is brought up. I
didn't want to make any assumptions about a device's functionality, so
using queue 0 seems the least dangerous.

However, in the prio qdisc, there is a small problem with the queue to
band assignment logic. If you have the same number of bands as queues
(e.g. 4 prio bands and 4 Tx queues), then you will have all bands
assigned to queue 0. I have a fix for that, which I plan to send when I
finish up a documentation piece for this patch's repost.

Thanks for the feedback,

PJ Waskiewicz

> -----Original Message-----
> From: Sreenivasa Honnur [mailto:[email protected]]
> Sent: Friday, February 23, 2007 1:01 AM
> To: Kok, Auke-jan H; David Miller; Garzik, Jeff;
> [email protected]; [email protected]
> Cc: Waskiewicz Jr, Peter P; Brandeburg, Jesse; Kok, Auke;
> Ronciak, John
> Subject: RE: [PATCH 1/2] NET: Multiple queue network device support
>
> Fucntion "map_queue" returns queue index as '0'. There is no
> support to return different queue indexes.
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]
> On Behalf Of Kok, Auke
> Sent: Friday, February 09, 2007 5:40 AM
> To: David Miller; Garzik, Jeff; [email protected];
> [email protected]
> Cc: Kok, Auke; Peter Waskiewicz Jr; Brandeburg, Jesse; Kok,
> Auke; Ronciak, John
> Subject: [PATCH 1/2] NET: Multiple queue network device support
>
>
> From: Peter Waskiewicz Jr <[email protected]>
>
> Added an API and associated supporting routines for
> multiqueue network devices. This allows network devices
> supporting multiple TX queues to configure each queue within
> the netdevice and manage each queue independantly. Changes
> to the PRIO Qdisc also allow a user to map multiple flows to
> individual TX queues, taking advantage of each queue on the device.
>
> Signed-off-by: Peter Waskiewicz Jr <[email protected]>
> Signed-off-by: Auke Kok <[email protected]>
> ---
>
> include/linux/netdevice.h | 73 ++++++++++++++++++++++++++++
> include/net/sch_generic.h | 3 +
> net/Kconfig | 20 ++++++++
> net/core/dev.c | 51 ++++++++++++++++++++
> net/sched/sch_generic.c | 117
> +++++++++++++++++++++++++++++++++++++++++++++
> net/sched/sch_prio.c | 106
> ++++++++++++++++++++++++++++++++++++++---
> 6 files changed, 364 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/netdevice.h
> b/include/linux/netdevice.h index
> 2e37f50..c7f94a8 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -106,6 +106,16 @@ struct netpoll_info; #define MAX_HEADER
> (LL_MAX_HEADER + 48) #endif
>
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +
> +struct net_device_subqueue
> +{
> + /* Give a lock and a flow control state for each queue */
> + unsigned long state;
> + spinlock_t queue_lock ____cacheline_aligned_in_smp;
> +};
> +#endif
> +
> /*
> * Network device statistics. Akin to the 2.0 ether stats but
> * with byte counters.
> @@ -339,6 +349,10 @@ struct net_device
> #define NETIF_F_GEN_CSUM (NETIF_F_NO_CSUM | NETIF_F_HW_CSUM)
> #define NETIF_F_ALL_CSUM (NETIF_F_IP_CSUM | NETIF_F_GEN_CSUM)
>
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +#define NETIF_F_MULTI_QUEUE 8192 /* Has multiple TX/RX queues */
> +#endif
> +
> struct net_device *next_sched;
>
> /* Interface index. Unique device identifier */
> @@ -532,6 +546,19 @@ struct net_device
> struct device dev;
> /* space for optional statistics and wireless sysfs groups */
> struct attribute_group *sysfs_groups[3];
> +
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + /* To retrieve statistics per subqueue - FOR FUTURE USE */
> + struct net_device_stats* (*get_subqueue_stats)(struct net_device
> *dev,
> + int
> queue_index);
> +
> + /* The TX queue control structures */
> + struct net_device_subqueue *egress_subqueue;
> + int egress_subqueue_count;
> + int (*hard_start_subqueue_xmit)(struct
> sk_buff *skb,
> + struct
> net_device *dev,
> + int
> queue_index);
> +#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
> };
> #define to_net_dev(d) container_of(d, struct net_device, dev)
>
> @@ -673,6 +700,52 @@ static inline int netif_running(const
> struct net_device *dev)
> return test_bit(__LINK_STATE_START, &dev->state); }
>
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +
> +/*
> + * Routines to manage the subqueues on a device. We only need start
> + * stop, and a check if it's stopped. All other device management is
> + * done at the overall netdevice level.
> + * Also test the netif state if we're multi-queue at all.
> + */
> +static inline void netif_start_subqueue(struct net_device *dev, int
> +queue_index) {
> + clear_bit(__LINK_STATE_XOFF,
> +&dev->egress_subqueue[queue_index].state);
> +}
> +
> +static inline void netif_stop_subqueue(struct net_device *dev, int
> +queue_index) { #ifdef CONFIG_NETPOLL_TRAP
> + if (netpoll_trap())
> + return;
> +#endif
> + set_bit(__LINK_STATE_XOFF,
> &dev->egress_subqueue[queue_index].state);
> +}
> +
> +static inline int netif_subqueue_stopped(const struct
> net_device *dev,
> + int queue_index) {
> + return test_bit(__LINK_STATE_XOFF,
> + &dev->egress_subqueue[queue_index].state);
> +}
> +
> +static inline void netif_wake_subqueue(struct net_device *dev, int
> +queue_index) { #ifdef CONFIG_NETPOLL_TRAP
> + if (netpoll_trap())
> + return;
> +#endif
> + if (test_and_clear_bit(__LINK_STATE_XOFF,
> +
> &dev->egress_subqueue[queue_index].state))
> + __netif_schedule(dev);
> +}
> +
> +static inline int netif_is_multiqueue(const struct net_device *dev) {
> + return (!!(NETIF_F_MULTI_QUEUE & dev->features)); } #endif /*
> +CONFIG_NET_MULTI_QUEUE_DEVICE */
> +
>
> /* Use this variant when it is known for sure that it
> * is executing from interrupt context.
> diff --git a/include/net/sch_generic.h
> b/include/net/sch_generic.h index
> 8208639..8751ba6 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -102,6 +102,9 @@ struct Qdisc_ops
>
> int (*dump)(struct Qdisc *, struct sk_buff
> *);
> int (*dump_stats)(struct Qdisc *, struct
> gnet_dump *);
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + int (*map_queue)(struct sk_buff *, struct
> Qdisc *);
> +#endif
>
> struct module *owner;
> };
> diff --git a/net/Kconfig b/net/Kconfig
> index 7dfc949..796edb3 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -38,6 +38,26 @@ source "net/packet/Kconfig"
> source "net/unix/Kconfig"
> source "net/xfrm/Kconfig"
>
> +config NET_MULTI_QUEUE_DEVICE
> + bool "Multiple queue network device support (EXPERIMENTAL)"
> + depends on NET_SCHED && EXPERIMENTAL
> + help
> + Saying Y here will add support for network devices that have
> more than
> + one physical TX queue and/or RX queue.
> +
> + Multiple queue devices will require qdiscs that understand how
> to
> + queue to multiple targets. The default packet scheduler will
> default
> + to the first queue in a device. In other words, if you need
> the
> + ability to spread traffic across queues, your queueing
> discipline
> + needs to know how to do that.
> +
> + Note that saying Y here will give preferential treatment to
> multiple
> + queue devices in the network stack. A slight drop in
> single-queue
> + device performance may be seen.
> +
> + Say N here unless you know your network device supports
> multiple
> + TX and/or RX queues.
> +
> config INET
> bool "TCP/IP networking"
> ---help---
> diff --git a/net/core/dev.c b/net/core/dev.c index 455d589..42b635c
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1477,6 +1477,49 @@ gso:
> skb->tc_verd = SET_TC_AT(skb->tc_verd,AT_EGRESS);
> #endif
> if (q->enqueue) {
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + int queue_index;
> + /* If we're a multi-queue device, get a queue index to
> lock */
> + if (netif_is_multiqueue(dev))
> + {
> + /* Get the queue index and lock it. */
> + if (likely(q->ops->map_queue)) {
> + queue_index = q->ops->map_queue(skb, q);
> +
> spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> + rc = q->enqueue(skb, q);
> + /*
> + * Unlock because the underlying qdisc
> + * may queue and send a packet from a
> + * different queue.
> + */
> +
> spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> + qdisc_run(dev);
> + rc = rc == NET_XMIT_BYPASS
> + ? NET_XMIT_SUCCESS : rc;
> + goto out;
> + } else {
> + printk(KERN_CRIT "Device %s is "
> + "multiqueue, but map_queue is "
> + "undefined in the qdisc!!\n",
> + dev->name);
> + kfree_skb(skb);
> + }
> + } else {
> + /* We're not a multi-queue device. */
> + spin_lock(&dev->queue_lock);
> + q = dev->qdisc;
> + if (q->enqueue) {
> + rc = q->enqueue(skb, q);
> + qdisc_run(dev);
> + spin_unlock(&dev->queue_lock);
> + rc = rc == NET_XMIT_BYPASS
> + ? NET_XMIT_SUCCESS : rc;
> + goto out;
> + }
> + spin_unlock(&dev->queue_lock);
> + }
> +#else /* No multi-queue support compiled in */
> +
> /* Grab device queue */
> spin_lock(&dev->queue_lock);
> q = dev->qdisc;
> @@ -1489,6 +1532,7 @@ gso:
> goto out;
> }
> spin_unlock(&dev->queue_lock);
> +#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
> }
>
> /* The device has no queue. Common case for software devices:
> @@ -2879,6 +2923,9 @@ int register_netdevice(struct net_device *dev)
> struct hlist_head *head;
> struct hlist_node *p;
> int ret;
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + int i;
> +#endif
>
> BUG_ON(dev_boot_phase);
> ASSERT_RTNL();
> @@ -2894,6 +2941,10 @@ int register_netdevice(struct
> net_device *dev) #ifdef CONFIG_NET_CLS_ACT
> spin_lock_init(&dev->ingress_lock);
> #endif
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + for (i = 0; i < dev->egress_subqueue_count; i++)
> + spin_lock_init(&dev->egress_subqueue[i].queue_lock);
> +#endif
>
> dev->iflink = -1;
>
> diff --git a/net/sched/sch_generic.c
> b/net/sched/sch_generic.c index bc116bd..62ca23e 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -42,6 +42,7 @@
> to data, participating in scheduling must be additionally
> protected with dev->queue_lock spinlock.
>
> + For single-queue devices:
> The idea is the following:
> - enqueue, dequeue are serialized via top level device
> spinlock dev->queue_lock.
> @@ -51,6 +52,12 @@
> hence this lock may be made without local bh disabling.
>
> qdisc_tree_lock must be grabbed BEFORE dev->queue_lock!
> +
> + For multi-queue devices:
> + - enqueue, dequeue are serialized with individual queue locks,
> + dev->egress_subqueue[queue_index].queue_lock.
> + - Everything else with tree protection with single queue devices
> apply
> + to multiqueue devices.
> */
> DEFINE_RWLOCK(qdisc_tree_lock);
>
> @@ -92,6 +99,9 @@ static inline int qdisc_restart(struct net_device
> *dev) {
> struct Qdisc *q = dev->qdisc;
> struct sk_buff *skb;
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + int queue_index = 0;
> +#endif
>
> /* Dequeue packet */
> if (((skb = dev->gso_skb)) || ((skb = q->dequeue(q)))) { @@
> -130,6 +140,72 @@ static inline int qdisc_restart(struct net_device
> *dev)
> }
>
> {
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + if (netif_is_multiqueue(dev)) {
> + if (likely(q->ops->map_queue)) {
> + queue_index =
> q->ops->map_queue(skb, q);
> + } else {
> + printk(KERN_CRIT "Device %s is "
> + "multiqueue, but
> map_queue is "
> + "undefined in the
> qdisc!!\n",
> + dev->name);
> + goto requeue;
> + }
> +
> spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> + /* Check top level device, and any
> sub-device */
> + if ((!netif_queue_stopped(dev)) &&
> + (!netif_subqueue_stopped(dev,
> queue_index))) {
> + int ret;
> + ret =
> dev->hard_start_subqueue_xmit(skb, dev, queue_index);
> + if (ret == NETDEV_TX_OK) {
> + if (!nolock) {
> +
> netif_tx_unlock(dev);
> + }
> + return -1;
> + }
> + if (ret == NETDEV_TX_LOCKED &&
> nolock) {
> +
> spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> + goto collision;
> + }
> + }
> + /* NETDEV_TX_BUSY - we need to requeue
> */
> + /* Release the driver */
> + if (!nolock) {
> + netif_tx_unlock(dev);
> + }
> +
> spin_lock(&dev->egress_subqueue[queue_index].queue_lock);
> + q = dev->qdisc;
> + }
> + else
> + {
> + /* We're a single-queue device */
> + /* And release queue */
> + spin_unlock(&dev->queue_lock);
> + if (!netif_queue_stopped(dev)) {
> + int ret;
> +
> + ret = dev->hard_start_xmit(skb,
> dev);
> + if (ret == NETDEV_TX_OK) {
> + if (!nolock) {
> +
> netif_tx_unlock(dev);
> + }
> +
> spin_lock(&dev->queue_lock);
> + return -1;
> + }
> + if (ret == NETDEV_TX_LOCKED &&
> nolock) {
> +
> spin_lock(&dev->queue_lock);
> + goto collision;
> + }
> + }
> + /* NETDEV_TX_BUSY - we need to requeue
> */
> + /* Release the driver */
> + if (!nolock) {
> + netif_tx_unlock(dev);
> + }
> + spin_lock(&dev->queue_lock);
> + q = dev->qdisc;
> + }
> +#else /* CONFIG_NET_MULTI_QUEUE_DEVICE */
> /* And release queue */
> spin_unlock(&dev->queue_lock);
>
> @@ -157,6 +233,7 @@ static inline int qdisc_restart(struct net_device
> *dev)
> }
> spin_lock(&dev->queue_lock);
> q = dev->qdisc;
> +#endif /* CONFIG_NET_MULTI_QUEUE_DEVICE */
> }
>
> /* Device kicked us out :(
> @@ -175,9 +252,17 @@ requeue:
> else
> q->ops->requeue(skb, q);
> netif_schedule(dev);
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + if (netif_is_multiqueue(dev))
> +
> spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> +#endif
> return 1;
> }
> BUG_ON((int) q->q.qlen < 0);
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + if (netif_is_multiqueue(dev))
> +
> spin_unlock(&dev->egress_subqueue[queue_index].queue_lock);
> +#endif
> return q->q.qlen;
> }
>
> @@ -287,12 +372,22 @@ static int noop_requeue(struct sk_buff
> *skb, struct Qdisc* qdisc)
> return NET_XMIT_CN;
> }
>
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +static int noop_map_queue(struct sk_buff *skb, struct Qdisc *qdisc) {
> + return 0;
> +}
> +#endif
> +
> struct Qdisc_ops noop_qdisc_ops = {
> .id = "noop",
> .priv_size = 0,
> .enqueue = noop_enqueue,
> .dequeue = noop_dequeue,
> .requeue = noop_requeue,
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + .map_queue = noop_map_queue,
> +#endif
> .owner = THIS_MODULE,
> };
>
> @@ -310,6 +405,9 @@ static struct Qdisc_ops noqueue_qdisc_ops = {
> .enqueue = noop_enqueue,
> .dequeue = noop_dequeue,
> .requeue = noop_requeue,
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + .map_queue = noop_map_queue,
> +#endif
> .owner = THIS_MODULE,
> };
>
> @@ -356,10 +454,18 @@ static struct sk_buff *pfifo_fast_dequeue(struct
> Qdisc* qdisc)
> struct sk_buff_head *list = qdisc_priv(qdisc);
>
> for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + if (netif_is_multiqueue(qdisc->dev))
> +
> spin_lock(&qdisc->dev->egress_subqueue[0].queue_lock);
> +#endif
> if (!skb_queue_empty(list + prio)) {
> qdisc->q.qlen--;
> return __qdisc_dequeue_head(qdisc, list + prio);
> }
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + if (netif_is_multiqueue(qdisc->dev))
> +
> spin_unlock(&qdisc->dev->egress_subqueue[0].queue_lock);
> +#endif
> }
>
> return NULL;
> @@ -406,6 +512,14 @@ static int pfifo_fast_init(struct Qdisc
> *qdisc, struct rtattr *opt)
> return 0;
> }
>
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +static int pfifo_fast_map_queue(struct sk_buff *skb, struct Qdisc
> +*qdisc) {
> + /* Generic qdisc, so always return queue index 0. */
> + return 0;
> +}
> +#endif
> +
> static struct Qdisc_ops pfifo_fast_ops = {
> .id = "pfifo_fast",
> .priv_size = PFIFO_FAST_BANDS * sizeof(struct
> sk_buff_head),
> @@ -415,6 +529,9 @@ static struct Qdisc_ops pfifo_fast_ops = {
> .init = pfifo_fast_init,
> .reset = pfifo_fast_reset,
> .dump = pfifo_fast_dump,
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + .map_queue = pfifo_fast_map_queue,
> +#endif
> .owner = THIS_MODULE,
> };
>
> diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index
> 2567b4c..0e24071 100644
> --- a/net/sched/sch_prio.c
> +++ b/net/sched/sch_prio.c
> @@ -43,16 +43,19 @@ struct prio_sched_data
> struct tcf_proto *filter_list;
> u8 prio2band[TC_PRIO_MAX+1];
> struct Qdisc *queues[TCQ_PRIO_BANDS];
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + int band2queue[TC_PRIO_MAX + 1];
> +#endif
> };
>
> -
> static struct Qdisc *
> -prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
> +prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr, int
> +*retband)
> {
> struct prio_sched_data *q = qdisc_priv(sch);
> u32 band = skb->priority;
> struct tcf_result res;
>
> + *retband = band;
> *qerr = NET_XMIT_BYPASS;
> if (TC_H_MAJ(skb->priority) != sch->handle) { #ifdef
> CONFIG_NET_CLS_ACT @@ -68,13 +71,16 @@ prio_classify(struct
> sk_buff *skb, struct Qdisc *sch, int *qerr) #else
> if (!q->filter_list || tc_classify(skb, q->filter_list,
> &res)) { #endif
> - if (TC_H_MAJ(band))
> + if (TC_H_MAJ(band)) {
> band = 0;
> + *retband = 0;
> + }
> return
> q->queues[q->prio2band[band&TC_PRIO_MAX]];
> }
> band = res.classid;
> }
> band = TC_H_MIN(band) - 1;
> + *retband = band;
> if (band > q->bands)
> return q->queues[q->prio2band[0]];
>
> @@ -86,8 +92,15 @@ prio_enqueue(struct sk_buff *skb, struct
> Qdisc *sch) {
> struct Qdisc *qdisc;
> int ret;
> + int band; /* Unused in this function, just here for multi-queue
> */
> +
> + /*
> + * NOTE: if a device is multiqueue, then it is imperative the
> + * underlying qdisc NOT be another PRIO qdisc. Otherwise we're
> going
> + * to deadlock. Fixme please.
> + */
> + qdisc = prio_classify(skb, sch, &ret, &band);
>
> - qdisc = prio_classify(skb, sch, &ret);
> #ifdef CONFIG_NET_CLS_ACT
> if (qdisc == NULL) {
>
> @@ -108,14 +121,34 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc
> *sch)
> return ret;
> }
>
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> +static int
> +prio_map_queue(struct sk_buff *skb, struct Qdisc *sch) {
> + /*
> + * We'll classify the skb here, so return the Qdisc back through
> + * the function call. The return value is the queue we need to
> + * lock, which will be the value returned through band. This
> will help
> + * speed up enqueue, since it won't have to do the classify
> again.
> + */
> + int ret;
> + int band;
> + struct prio_sched_data *q = qdisc_priv(sch);
> +
> + sch = prio_classify(skb, sch, &ret, &band);
> + return q->band2queue[band];
> +}
> +#endif
>
> static int
> prio_requeue(struct sk_buff *skb, struct Qdisc* sch) {
> struct Qdisc *qdisc;
> int ret;
> + int band;
> +
> + qdisc = prio_classify(skb, sch, &ret, &band);
>
> - qdisc = prio_classify(skb, sch, &ret);
> #ifdef CONFIG_NET_CLS_ACT
> if (qdisc == NULL) {
> if (ret == NET_XMIT_BYPASS)
> @@ -141,18 +174,53 @@ prio_dequeue(struct Qdisc* sch)
> struct sk_buff *skb;
> struct prio_sched_data *q = qdisc_priv(sch);
> int prio;
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + int queue;
> +#endif
> struct Qdisc *qdisc;
>
> + /*
> + * If we're multiqueue, the basic approach is try the lock on
> each
> + * queue. If it's locked, either we're already dequeuing, or
> enqueue
> + * is doing something. Go to the next band if we're locked.
> Once we
> + * have a packet, unlock the queue. NOTE: the underlying qdisc
> CANNOT
> + * be a PRIO qdisc, otherwise we will deadlock. FIXME
> + */
> for (prio = 0; prio < q->bands; prio++) {
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + if (netif_is_multiqueue(sch->dev)) {
> + queue = q->band2queue[prio];
> + if
> (spin_trylock(&sch->dev->egress_subqueue[queue].queue_lock)) {
> + qdisc = q->queues[prio];
> + skb = qdisc->dequeue(qdisc);
> + if (skb) {
> + sch->q.qlen--;
> + skb->priority = prio;
> +
> spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> + return skb;
> + }
> +
> spin_unlock(&sch->dev->egress_subqueue[queue].queue_lock);
> + }
> + } else {
> + qdisc = q->queues[prio];
> + skb = qdisc->dequeue(qdisc);
> + if (skb) {
> + sch->q.qlen--;
> + skb->priority = prio;
> + return skb;
> + }
> + }
> +#else
> qdisc = q->queues[prio];
> skb = qdisc->dequeue(qdisc);
> if (skb) {
> sch->q.qlen--;
> + skb->priority = prio;
> return skb;
> }
> +#endif
> }
> return NULL;
> -
> }
>
> static unsigned int prio_drop(struct Qdisc* sch) @@ -205,6
> +273,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt)
> struct prio_sched_data *q = qdisc_priv(sch);
> struct tc_prio_qopt *qopt = RTA_DATA(opt);
> int i;
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + int queue;
> + int qmapoffset;
> +#endif
>
> if (opt->rta_len < RTA_LENGTH(sizeof(*qopt)))
> return -EINVAL;
> @@ -247,6 +319,25 @@ static int prio_tune(struct Qdisc *sch,
> struct rtattr *opt)
> }
> }
> }
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + /* setup queue to band mapping */
> + if (netif_is_multiqueue(sch->dev)) {
> + if (q->bands == sch->dev->egress_subqueue_count)
> + qmapoffset = 0;
> + else
> + qmapoffset = q->bands /
> sch->dev->egress_subqueue_count;
> +
> + queue = 0;
> + for (i = 0; i < q->bands; i++) {
> + q->band2queue[i] = queue;
> + if ( (((i + 1) - queue) == qmapoffset) &&
> + (queue < (sch->dev->egress_subqueue_count -
> 1))) {
> + queue++;
> + }
> + }
> + }
> +
> +#endif
> return 0;
> }
>
> @@ -430,6 +521,9 @@ static struct Qdisc_ops prio_qdisc_ops = {
> .destroy = prio_destroy,
> .change = prio_tune,
> .dump = prio_dump,
> +#ifdef CONFIG_NET_MULTI_QUEUE_DEVICE
> + .map_queue = prio_map_queue,
> +#endif
> .owner = THIS_MODULE,
> };
>
>
>
>
> ---
> Auke Kok <[email protected]>
> -
> To unsubscribe from this list: send the line "unsubscribe
> netdev" in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2007-02-23 19:23:14

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [PATCH 1/2] NET: Multiple queue network device support

On Fri, 23 Feb 2007 04:00:55 -0500
"Sreenivasa Honnur" <[email protected]> wrote:

> Fucntion "map_queue" returns queue index as '0'. There is no support to
> return different queue indexes.
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Kok, Auke
> Sent: Friday, February 09, 2007 5:40 AM
> To: David Miller; Garzik, Jeff; [email protected];
> [email protected]
> Cc: Kok, Auke; Peter Waskiewicz Jr; Brandeburg, Jesse; Kok, Auke;
> Ronciak, John
> Subject: [PATCH 1/2] NET: Multiple queue network device support
>
>
> From: Peter Waskiewicz Jr <[email protected]>
>
> Added an API and associated supporting routines for multiqueue network
> devices. This allows network devices supporting multiple TX queues to
> configure each queue within the netdevice and manage each queue
> independantly. Changes to the PRIO Qdisc also allow a user to map
> multiple flows to individual TX queues, taking advantage of each queue
> on the device.
>
> Signed-off-by: Peter Waskiewicz Jr <[email protected]>
> Signed-off-by: Auke Kok <[email protected]>
> ---

>
> +config NET_MULTI_QUEUE_DEVICE
> + bool "Multiple queue network device support (EXPERIMENTAL)"
> + depends on NET_SCHED && EXPERIMENTAL
> + help
> + Saying Y here will add support for network devices that have
> more than
> + one physical TX queue and/or RX queue.
> +
> + Multiple queue devices will require qdiscs that understand how
> to
> + queue to multiple targets. The default packet scheduler will
> default
> + to the first queue in a device. In other words, if you need
> the
> + ability to spread traffic across queues, your queueing
> discipline
> + needs to know how to do that.
> +
> + Note that saying Y here will give preferential treatment to
> multiple
> + queue devices in the network stack. A slight drop in
> single-queue
> + device performance may be seen.
> +
> + Say N here unless you know your network device supports
> multiple
> + TX and/or RX queues.
> +

This should not be a user visible configuration option.
It should either: always be part of the kernel API
or be selected by drivers that need/want it.

--
Stephen Hemminger <[email protected]>

2007-02-23 19:23:24

by Kok, Auke

[permalink] [raw]
Subject: Re: [PATCH 1/2] NET: Multiple queue network device support

Stephen Hemminger wrote:
> On Fri, 23 Feb 2007 04:00:55 -0500
> "Sreenivasa Honnur" <[email protected]> wrote:
>
>> Fucntion "map_queue" returns queue index as '0'. There is no support to
>> return different queue indexes.
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> On Behalf Of Kok, Auke
>> Sent: Friday, February 09, 2007 5:40 AM
>> To: David Miller; Garzik, Jeff; [email protected];
>> [email protected]
>> Cc: Kok, Auke; Peter Waskiewicz Jr; Brandeburg, Jesse; Kok, Auke;
>> Ronciak, John
>> Subject: [PATCH 1/2] NET: Multiple queue network device support
>>
>>
>> From: Peter Waskiewicz Jr <[email protected]>
>>
>> Added an API and associated supporting routines for multiqueue network
>> devices. This allows network devices supporting multiple TX queues to
>> configure each queue within the netdevice and manage each queue
>> independantly. Changes to the PRIO Qdisc also allow a user to map
>> multiple flows to individual TX queues, taking advantage of each queue
>> on the device.
>>
>> Signed-off-by: Peter Waskiewicz Jr <[email protected]>
>> Signed-off-by: Auke Kok <[email protected]>
>> ---
>
>>
>> +config NET_MULTI_QUEUE_DEVICE
>> + bool "Multiple queue network device support (EXPERIMENTAL)"
>> + depends on NET_SCHED && EXPERIMENTAL
>> + help
>> + Saying Y here will add support for network devices that have
>> more than
>> + one physical TX queue and/or RX queue.
>> +
>> + Multiple queue devices will require qdiscs that understand how
>> to
>> + queue to multiple targets. The default packet scheduler will
>> default
>> + to the first queue in a device. In other words, if you need
>> the
>> + ability to spread traffic across queues, your queueing
>> discipline
>> + needs to know how to do that.
>> +
>> + Note that saying Y here will give preferential treatment to
>> multiple
>> + queue devices in the network stack. A slight drop in
>> single-queue
>> + device performance may be seen.
>> +
>> + Say N here unless you know your network device supports
>> multiple
>> + TX and/or RX queues.
>> +
>
> This should not be a user visible configuration option.
> It should either: always be part of the kernel API
> or be selected by drivers that need/want it.

perhaps when it's stable, yes, but right now it's definately experimental and
may result in a slight overhead for single-queue devices as the text reads.


Cheers,

Auke

2007-02-27 01:03:20

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 1/2] NET: Multiple queue network device support

From: "Kok, Auke" <[email protected]>
Date: Thu, 08 Feb 2007 16:09:50 -0800

> From: Peter Waskiewicz Jr <[email protected]>
>
> Added an API and associated supporting routines for multiqueue network
> devices. This allows network devices supporting multiple TX queues to
> configure each queue within the netdevice and manage each queue
> independantly. Changes to the PRIO Qdisc also allow a user to map
> multiple flows to individual TX queues, taking advantage of each queue
> on the device.
>
> Signed-off-by: Peter Waskiewicz Jr <[email protected]>
> Signed-off-by: Auke Kok <[email protected]>

Thanks for posting this.

I was wonding if it would be better to have the ->enqueue() perform
the mapping operation, store the queue index selected inside of the
skb, and just use the normal ->hard_start_xmit() to send the packet
out?

The only thing to take care of is the per-queue locking, but that
should be easily doable.

We might be able to do this even without making sk_buff any larger.
For example, I suppose skb->priority might be deducable down to
a u16. After a quick audit I really cannot see any legitimate use
of anything larger than 16-bit values, but a more thorough audit
would be necessary to validate this.

2007-02-27 19:38:49

by Waskiewicz Jr, Peter P

[permalink] [raw]
Subject: RE: [PATCH 1/2] NET: Multiple queue network device support

Dave,
Thanks for the feedback. Please see below for my comments /
questions.

PJ Waskiewicz
Intel Corporation

> > From: Peter Waskiewicz Jr <[email protected]>
> >
> > Added an API and associated supporting routines for
> multiqueue network
> > devices. This allows network devices supporting multiple
> TX queues to
> > configure each queue within the netdevice and manage each queue
> > independantly. Changes to the PRIO Qdisc also allow a user to map
> > multiple flows to individual TX queues, taking advantage of
> each queue
> > on the device.
> >
> > Signed-off-by: Peter Waskiewicz Jr <[email protected]>
> > Signed-off-by: Auke Kok <[email protected]>
>
> Thanks for posting this.
>
> I was wonding if it would be better to have the ->enqueue()
> perform the mapping operation, store the queue index selected
> inside of the skb, and just use the normal
> ->hard_start_xmit() to send the packet out?

One of the reasons I chose to add more functions is for the queue
management itself. Specifically, I needed the ability to lock queues,
stop queues (netif_stop_queue), and wake them up, independent of the
global queue_lock in the netdevice. At first I did think about storing
the queue index in the skb, but that won't help with the queue locking,
since only netdevice is passed to them outside of the context of the
skb. As for adding the additional multiqueue version of
hard_start_subqueue_xmit(), I did that so non-MQ devices would have a
clean, untouched path in dev_queue_xmit(), and only MQ devices would
touch a new codepath. This also removes the need for the device driver
to inspect the queue index in the skb for which Tx ring to place the
packet in, which is good for performance.

For having ->enqueue() take care of the mapping, I thought for awhile
about this. Originally, I had ->enqueue() doing just that, until I
realized I was not locking the individual subqueues correctly. The
thought process was once I receive an skb for Tx, I need to lock a
queue, enqueue the skb, call qdisc_run(), then unlock. That grew into
lock a queue, enqueue, unlock the queue, have dequeue decide which queue
to pull an skb from (since the queue an skb comes from can be different
than the one just enqueued to), lock it, dequeue, unlock, and return the
skb. The whole issue came down to needing to know what queue to lock
before enqueuing. If I called enqueue, I would be going in unlocked. I
suppose this would be ok if the lock is done inside the ->enqueue(), and
unlock is done before exiting; let me look at that possibility.

>
> The only thing to take care of is the per-queue locking, but
> that should be easily doable.
>
> We might be able to do this even without making sk_buff any larger.
> For example, I suppose skb->priority might be deducable down
> to a u16. After a quick audit I really cannot see any
> legitimate use of anything larger than 16-bit values, but a
> more thorough audit would be necessary to validate this.

The skb->priority field right now is used to determine the PRIO /
pfifo_fast band to place the skb in on enqueue. I still need to think
about if storing the queue index in the skb would help, due to the queue
management functions outside of the skb context.

Again, thanks for the feedback. I'll respond once I've looked at moving
locks into ->enqueue() and removing ->map_queue(). Please note that the
PRIO band to queue mapping in this patch has a problem when the number
of bands is <= the number of Tx queues on the NIC. I have a fix, and
will include that in a repost.