2008-11-21 19:33:07

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

The Physical Function and Virtual Function drivers using the SR-IOV
APIs will come soon!

Major changes from v6 to v7:
1, remove boot-time resource rebalancing support. (Greg KH)
2, emit uevent upon the PF driver is loaded. (Greg KH)
3, put SR-IOV callback function into the 'pci_driver'. (Matthew Wilcox)
4, register SR-IOV service at the PF loading stage.
5, remove unnecessary APIs (pci_iov_enable/disable).

---

[PATCH 1/13 v7] PCI: enhance pci_ari_enabled()
[PATCH 2/13 v7] PCI: remove unnecessary arg of pci_update_resource()
[PATCH 3/13 v7] PCI: define PCI resource names in an 'enum'
[PATCH 4/13 v7] PCI: remove unnecessary condition check in pci_restore_bars()
[PATCH 5/13 v7] PCI: export __pci_read_base()
[PATCH 6/13 v7] PCI: make pci_alloc_child_bus() be able to handle NULL bridge
[PATCH 7/13 v7] PCI: add a new function to map BAR offset
[PATCH 8/13 v7] PCI: cleanup pci_bus_add_devices()
[PATCH 9/13 v7] PCI: split a new function from pci_bus_add_devices()
[PATCH 10/13 v7] PCI: support the SR-IOV capability
[PATCH 11/13 v7] PCI: reserve bus range for SR-IOV device
[PATCH 12/13 v7] PCI: document the SR-IOV sysfs entries
[PATCH 13/13 v7] PCI: document for SR-IOV user and developer

Cc: Alex Chiang <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Cc: Simon Horman <[email protected]>
Cc: Yinghai Lu <[email protected]>

---

Single Root I/O Virtualization (SR-IOV) capability defined by PCI-SIG
is intended to enable multiple system software to share PCI hardware
resources. PCI device that supports this capability can be extended
to one Physical Functions plus multiple Virtual Functions. Physical
Function, which could be considered as the "real" PCI device, reflects
the hardware instance and manages all physical resources. Virtual
Functions are associated with a Physical Function and shares physical
resources with the Physical Function.Software can control allocation of
Virtual Functions via registers encapsulated in the capability structure.

SR-IOV specification can be found at
http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf

Devices that support SR-IOV are available from following vendors:
http://download.intel.com/design/network/ProdBrf/320025.pdf
http://www.netxen.com/products/chipsolutions/NX3031.html
http://www.neterion.com/products/x3100.html


2008-11-21 19:35:25

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 1/13 v7] PCI: enhance pci_ari_enabled()

Change parameter of pci_ari_enabled() from 'pci_dev' to 'pci_bus'.

ARI forwarding on the bridge mostly concerns the subordinate devices
rather than the bridge itself. So this change will make the function
easier to use.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.h | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9de87e9..1449884 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -162,13 +162,13 @@ struct pci_slot_attribute {
extern void pci_enable_ari(struct pci_dev *dev);
/**
* pci_ari_enabled - query ARI forwarding status
- * @dev: the PCI device
+ * @bus: the PCI bus
*
* Returns 1 if ARI forwarding is enabled, or 0 if not enabled;
*/
-static inline int pci_ari_enabled(struct pci_dev *dev)
+static inline int pci_ari_enabled(struct pci_bus *bus)
{
- return dev->ari_enabled;
+ return bus->self && bus->self->ari_enabled;
}

#endif /* DRIVERS_PCI_H */
--
1.5.6.4

2008-11-21 19:36:11

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 2/13 v7] PCI: remove unnecessary arg of pci_update_resource()

This cleanup removes unnecessary argument 'struct resource *res' in
pci_update_resource(), so it takes same arguments as other companion
functions (pci_assign_resource(), etc.).

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.c | 4 ++--
drivers/pci/setup-res.c | 7 ++++---
include/linux/pci.h | 2 +-
3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 21f2ac6..c408be8 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -377,8 +377,8 @@ pci_restore_bars(struct pci_dev *dev)
return;
}

- for (i = 0; i < numres; i ++)
- pci_update_resource(dev, &dev->resource[i], i);
+ for (i = 0; i < numres; i++)
+ pci_update_resource(dev, i);
}

static struct pci_platform_pm_ops *pci_platform_pm;
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index 2dbd96c..b7ca679 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -26,11 +26,12 @@
#include "pci.h"


-void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno)
+void pci_update_resource(struct pci_dev *dev, int resno)
{
struct pci_bus_region region;
u32 new, check, mask;
int reg;
+ struct resource *res = dev->resource + resno;

/*
* Ignore resources for unimplemented BARs and unused resource slots
@@ -162,7 +163,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno)
} else {
res->flags &= ~IORESOURCE_STARTALIGN;
if (resno < PCI_BRIDGE_RESOURCES)
- pci_update_resource(dev, res, resno);
+ pci_update_resource(dev, resno);
}

return ret;
@@ -197,7 +198,7 @@ int pci_assign_resource_fixed(struct pci_dev *dev, int resno)
dev_err(&dev->dev, "BAR %d: can't allocate %s resource %pR\n",
resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res);
} else if (resno < PCI_BRIDGE_RESOURCES) {
- pci_update_resource(dev, res, resno);
+ pci_update_resource(dev, resno);
}

return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index feb4657..7e7ff03 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -642,7 +642,7 @@ int pcie_get_readrq(struct pci_dev *dev);
int pcie_set_readrq(struct pci_dev *dev, int rq);
int pci_reset_function(struct pci_dev *dev);
int pci_execute_reset_function(struct pci_dev *dev);
-void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno);
+void pci_update_resource(struct pci_dev *dev, int resno);
int __must_check pci_assign_resource(struct pci_dev *dev, int i);
int pci_select_bars(struct pci_dev *dev, unsigned long flags);

--
1.5.6.4

2008-11-21 19:36:37

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 3/13 v7] PCI: define PCI resource names in an 'enum'

This patch moves all definitions of the PCI resource names to an 'enum',
and also replaces some hard-coded resource variables with symbol
names. This change eases introduction of device specific resources.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci-sysfs.c | 4 +++-
drivers/pci/probe.c | 2 +-
drivers/pci/proc.c | 7 ++++---
include/linux/pci.h | 37 ++++++++++++++++++++++++-------------
4 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 5d72866..0d74851 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -101,11 +101,13 @@ resource_show(struct device * dev, struct device_attribute *attr, char * buf)
struct pci_dev * pci_dev = to_pci_dev(dev);
char * str = buf;
int i;
- int max = 7;
+ int max;
resource_size_t start, end;

if (pci_dev->subordinate)
max = DEVICE_COUNT_RESOURCE;
+ else
+ max = PCI_BRIDGE_RESOURCES;

for (i = 0; i < max; i++) {
struct resource *res = &pci_dev->resource[i];
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 003a9b3..4c5429f 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -423,7 +423,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
child->subordinate = 0xff;

/* Set up default resource pointers and names.. */
- for (i = 0; i < 4; i++) {
+ for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i];
child->resource[i]->name = child->name;
}
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index e1098c3..f6f2a59 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -352,15 +352,16 @@ static int show_device(struct seq_file *m, void *v)
dev->vendor,
dev->device,
dev->irq);
- /* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */
- for (i=0; i<7; i++) {
+
+ /* only print standard and ROM resources to preserve compatibility */
+ for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
resource_size_t start, end;
pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, "\t%16llx",
(unsigned long long)(start |
(dev->resource[i].flags & PCI_REGION_FLAG_MASK)));
}
- for (i=0; i<7; i++) {
+ for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
resource_size_t start, end;
pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, "\t%16llx",
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7e7ff03..d455ec8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -82,7 +82,30 @@ enum pci_mmap_state {
#define PCI_DMA_FROMDEVICE 2
#define PCI_DMA_NONE 3

-#define DEVICE_COUNT_RESOURCE 12
+/*
+ * For PCI devices, the region numbers are assigned this way:
+ */
+enum {
+ /* #0-5: standard PCI resources */
+ PCI_STD_RESOURCES,
+ PCI_STD_RESOURCE_END = 5,
+
+ /* #6: expansion ROM resource */
+ PCI_ROM_RESOURCE,
+
+ /* resources assigned to buses behind the bridge */
+#define PCI_BRIDGE_RESOURCE_NUM 4
+
+ PCI_BRIDGE_RESOURCES,
+ PCI_BRIDGE_RESOURCE_END = PCI_BRIDGE_RESOURCES +
+ PCI_BRIDGE_RESOURCE_NUM - 1,
+
+ /* total resources associated with a PCI device */
+ PCI_NUM_RESOURCES,
+
+ /* preserve this for compatibility */
+ DEVICE_COUNT_RESOURCE
+};

typedef int __bitwise pci_power_t;

@@ -268,18 +291,6 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev,
hlist_add_head(&new_cap->next, &pci_dev->saved_cap_space);
}

-/*
- * For PCI devices, the region numbers are assigned this way:
- *
- * 0-5 standard PCI regions
- * 6 expansion ROM
- * 7-10 bridges: address space assigned to buses behind the bridge
- */
-
-#define PCI_ROM_RESOURCE 6
-#define PCI_BRIDGE_RESOURCES 7
-#define PCI_NUM_RESOURCES 11
-
#ifndef PCI_BUS_NUM_RESOURCES
#define PCI_BUS_NUM_RESOURCES 16
#endif
--
1.5.6.4

2008-11-21 19:37:04

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 4/13 v7] PCI: remove unnecessary condition check in pci_restore_bars()

Remove the unnecessary number of resources condition checks because
the pci_update_resource() will check availability of the resources.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.c | 19 ++-----------------
1 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index c408be8..9d3f793 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -360,24 +360,9 @@ pci_find_parent_resource(const struct pci_dev *dev, struct resource *res)
static void
pci_restore_bars(struct pci_dev *dev)
{
- int i, numres;
-
- switch (dev->hdr_type) {
- case PCI_HEADER_TYPE_NORMAL:
- numres = 6;
- break;
- case PCI_HEADER_TYPE_BRIDGE:
- numres = 2;
- break;
- case PCI_HEADER_TYPE_CARDBUS:
- numres = 1;
- break;
- default:
- /* Should never get here, but just in case... */
- return;
- }
+ int i;

- for (i = 0; i < numres; i++)
+ for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
pci_update_resource(dev, i);
}

--
1.5.6.4

2008-11-21 19:37:40

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 5/13 v7] PCI: export __pci_read_base()

Export __pci_read_base() so it can be used by whole PCI subsystem.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.h | 9 +++++++++
drivers/pci/probe.c | 20 +++++++++-----------
2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1449884..fd0d087 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -159,6 +159,15 @@ struct pci_slot_attribute {
};
#define to_pci_slot_attr(s) container_of(s, struct pci_slot_attribute, attr)

+enum pci_bar_type {
+ pci_bar_unknown, /* Standard PCI BAR probe */
+ pci_bar_io, /* An io port BAR */
+ pci_bar_mem32, /* A 32-bit memory BAR */
+ pci_bar_mem64, /* A 64-bit memory BAR */
+};
+
+extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
+ struct resource *res, unsigned int reg);
extern void pci_enable_ari(struct pci_dev *dev);
/**
* pci_ari_enabled - query ARI forwarding status
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4c5429f..ae5c7fe 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -135,13 +135,6 @@ static u64 pci_size(u64 base, u64 maxbase, u64 mask)
return size;
}

-enum pci_bar_type {
- pci_bar_unknown, /* Standard PCI BAR probe */
- pci_bar_io, /* An io port BAR */
- pci_bar_mem32, /* A 32-bit memory BAR */
- pci_bar_mem64, /* A 64-bit memory BAR */
-};
-
static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar)
{
if ((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
@@ -156,11 +149,16 @@ static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar)
return pci_bar_mem32;
}

-/*
- * If the type is not unknown, we assume that the lowest bit is 'enable'.
- * Returns 1 if the BAR was 64-bit and 0 if it was 32-bit.
+/**
+ * pci_read_base - read a PCI BAR
+ * @dev: the PCI device
+ * @type: type of the BAR
+ * @res: resource buffer to be filled in
+ * @pos: BAR position in the config space
+ *
+ * Returns 1 if the BAR is 64-bit, or 0 if 32-bit.
*/
-static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
+int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int pos)
{
u32 l, sz, mask;
--
1.5.6.4

2008-11-21 19:38:15

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 6/13 v7] PCI: make pci_alloc_child_bus() be able to handle NULL bridge

Make pci_alloc_child_bus() be able to allocate buses without bridge
devices. Some SR-IOV devices can occupy more than one bus number,
but there is no explicit bridges because that have internal routing
mechanism.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/probe.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ae5c7fe..cd205fd 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -398,12 +398,10 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
if (!child)
return NULL;

- child->self = bridge;
child->parent = parent;
child->ops = parent->ops;
child->sysdata = parent->sysdata;
child->bus_flags = parent->bus_flags;
- child->bridge = get_device(&bridge->dev);

/* initialize some portions of the bus device, but don't register it
* now as the parent is not properly set up yet. This device will get
@@ -420,6 +418,12 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
child->primary = parent->secondary;
child->subordinate = 0xff;

+ if (!bridge)
+ return child;
+
+ child->self = bridge;
+ child->bridge = get_device(&bridge->dev);
+
/* Set up default resource pointers and names.. */
for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) {
child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i];
--
1.5.6.4

2008-11-21 19:38:34

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 7/13 v7] PCI: add a new function to map BAR offset

Add a function to map resource number to corresponding register so
people can get the offset and type of device specific BARs.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.c | 22 ++++++++++++++++++++++
drivers/pci/pci.h | 2 ++
drivers/pci/setup-res.c | 13 +++++--------
3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9d3f793..9382b5f 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2007,6 +2007,28 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags)
return bars;
}

+/**
+ * pci_resource_bar - get position of the BAR associated with a resource
+ * @dev: the PCI device
+ * @resno: the resource number
+ * @type: the BAR type to be filled in
+ *
+ * Returns BAR position in config space, or 0 if the BAR is invalid.
+ */
+int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type)
+{
+ if (resno < PCI_ROM_RESOURCE) {
+ *type = pci_bar_unknown;
+ return PCI_BASE_ADDRESS_0 + 4 * resno;
+ } else if (resno == PCI_ROM_RESOURCE) {
+ *type = pci_bar_mem32;
+ return dev->rom_base_reg;
+ }
+
+ dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno);
+ return 0;
+}
+
static void __devinit pci_no_domains(void)
{
#ifdef CONFIG_PCI_DOMAINS
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fd0d087..3de70d7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -168,6 +168,8 @@ enum pci_bar_type {

extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int reg);
+extern int pci_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type);
extern void pci_enable_ari(struct pci_dev *dev);
/**
* pci_ari_enabled - query ARI forwarding status
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index b7ca679..854d43e 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -31,6 +31,7 @@ void pci_update_resource(struct pci_dev *dev, int resno)
struct pci_bus_region region;
u32 new, check, mask;
int reg;
+ enum pci_bar_type type;
struct resource *res = dev->resource + resno;

/*
@@ -62,17 +63,13 @@ void pci_update_resource(struct pci_dev *dev, int resno)
else
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;

- if (resno < 6) {
- reg = PCI_BASE_ADDRESS_0 + 4 * resno;
- } else if (resno == PCI_ROM_RESOURCE) {
+ reg = pci_resource_bar(dev, resno, &type);
+ if (!reg)
+ return;
+ if (type != pci_bar_unknown) {
if (!(res->flags & IORESOURCE_ROM_ENABLE))
return;
new |= PCI_ROM_ADDRESS_ENABLE;
- reg = dev->rom_base_reg;
- } else {
- /* Hmm, non-standard resource. */
-
- return; /* kill uninitialised var warning */
}

pci_write_config_dword(dev, reg, new);
--
1.5.6.4

2008-11-21 19:38:49

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 8/13 v7] PCI: cleanup pci_bus_add_devices()

This cleanup makes pci_bus_add_devices() easier to read.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/bus.c | 55 +++++++++++++++++++++++++++--------------------------
1 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 999cc40..9d800cb 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -71,7 +71,7 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
}

/**
- * add a single device
+ * pci_bus_add_device - add a single device
* @dev: device to add
*
* This adds a single pci device to the global
@@ -105,7 +105,7 @@ int pci_bus_add_device(struct pci_dev *dev)
void pci_bus_add_devices(struct pci_bus *bus)
{
struct pci_dev *dev;
- struct pci_bus *child_bus;
+ struct pci_bus *child;
int retval;

list_for_each_entry(dev, &bus->devices, bus_list) {
@@ -120,39 +120,40 @@ void pci_bus_add_devices(struct pci_bus *bus)
list_for_each_entry(dev, &bus->devices, bus_list) {
BUG_ON(!dev->is_added);

+ child = dev->subordinate;
/*
* If there is an unattached subordinate bus, attach
* it and then scan for unattached PCI devices.
*/
- if (dev->subordinate) {
- if (list_empty(&dev->subordinate->node)) {
- down_write(&pci_bus_sem);
- list_add_tail(&dev->subordinate->node,
- &dev->bus->children);
- up_write(&pci_bus_sem);
- }
- pci_bus_add_devices(dev->subordinate);
-
- /* register the bus with sysfs as the parent is now
- * properly registered. */
- child_bus = dev->subordinate;
- if (child_bus->is_added)
- continue;
- child_bus->dev.parent = child_bus->bridge;
- retval = device_register(&child_bus->dev);
- if (retval)
- dev_err(&dev->dev, "Error registering pci_bus,"
- " continuing...\n");
- else {
- child_bus->is_added = 1;
- retval = device_create_file(&child_bus->dev,
- &dev_attr_cpuaffinity);
- }
+ if (!child)
+ continue;
+ if (list_empty(&child->node)) {
+ down_write(&pci_bus_sem);
+ list_add_tail(&child->node, &dev->bus->children);
+ up_write(&pci_bus_sem);
+ }
+ pci_bus_add_devices(child);
+
+ /*
+ * register the bus with sysfs as the parent is now
+ * properly registered.
+ */
+ if (child->is_added)
+ continue;
+ child->dev.parent = child->bridge;
+ retval = device_register(&child->dev);
+ if (retval)
+ dev_err(&dev->dev, "Error registering pci_bus,"
+ " continuing...\n");
+ else {
+ child->is_added = 1;
+ retval = device_create_file(&child->dev,
+ &dev_attr_cpuaffinity);
if (retval)
dev_err(&dev->dev, "Error creating cpuaffinity"
" file, continuing...\n");

- retval = device_create_file(&child_bus->dev,
+ retval = device_create_file(&child->dev,
&dev_attr_cpulistaffinity);
if (retval)
dev_err(&dev->dev,
--
1.5.6.4

2008-11-21 19:39:34

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 9/13 v7] PCI: split a new function from pci_bus_add_devices()

This patch splits a new function from pci_bus_add_devices(). The new
function can be used to register PCI bus to the device core.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/bus.c | 49 ++++++++++++++++++++++++++++++-------------------
drivers/pci/pci.h | 1 +
2 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 9d800cb..65f5a6f 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -91,6 +91,34 @@ int pci_bus_add_device(struct pci_dev *dev)
}

/**
+ * pci_bus_add_child - add a child bus
+ * @bus: bus to add
+ *
+ * This adds sysfs entries for a single bus
+ */
+int pci_bus_add_child(struct pci_bus *bus)
+{
+ int retval;
+
+ if (bus->bridge)
+ bus->dev.parent = bus->bridge;
+
+ retval = device_register(&bus->dev);
+ if (retval)
+ return retval;
+
+ bus->is_added = 1;
+
+ retval = device_create_file(&bus->dev, &dev_attr_cpuaffinity);
+ if (retval)
+ return retval;
+
+ retval = device_create_file(&bus->dev, &dev_attr_cpulistaffinity);
+
+ return retval;
+}
+
+/**
* pci_bus_add_devices - insert newly discovered PCI devices
* @bus: bus to check for new devices
*
@@ -140,26 +168,9 @@ void pci_bus_add_devices(struct pci_bus *bus)
*/
if (child->is_added)
continue;
- child->dev.parent = child->bridge;
- retval = device_register(&child->dev);
+ retval = pci_bus_add_child(child);
if (retval)
- dev_err(&dev->dev, "Error registering pci_bus,"
- " continuing...\n");
- else {
- child->is_added = 1;
- retval = device_create_file(&child->dev,
- &dev_attr_cpuaffinity);
- if (retval)
- dev_err(&dev->dev, "Error creating cpuaffinity"
- " file, continuing...\n");
-
- retval = device_create_file(&child->dev,
- &dev_attr_cpulistaffinity);
- if (retval)
- dev_err(&dev->dev,
- "Error creating cpulistaffinity"
- " file, continuing...\n");
- }
+ dev_err(&dev->dev, "Error adding bus, continuing\n");
}
}

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3de70d7..315bbe6 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -170,6 +170,7 @@ extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int reg);
extern int pci_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
+extern int pci_bus_add_child(struct pci_bus *bus);
extern void pci_enable_ari(struct pci_dev *dev);
/**
* pci_ari_enabled - query ARI forwarding status
--
1.5.6.4

2008-11-21 19:40:12

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 10/13 v7] PCI: support the SR-IOV capability

Support Single Root I/O Virtualization (SR-IOV) capability.

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/Kconfig | 13 ++
drivers/pci/Makefile | 3 +
drivers/pci/iov.c | 491 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-driver.c | 12 +-
drivers/pci/pci.c | 8 +
drivers/pci/pci.h | 51 +++++
drivers/pci/probe.c | 4 +
include/linux/pci.h | 9 +
include/linux/pci_regs.h | 21 ++
9 files changed, 610 insertions(+), 2 deletions(-)
create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index e1ca425..493233e 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -50,3 +50,16 @@ config HT_IRQ
This allows native hypertransport devices to use interrupts.

If unsure say Y.
+
+config PCI_IOV
+ bool "PCI IOV support"
+ depends on PCI
+ select PCI_MSI
+ default n
+ help
+ PCI-SIG I/O Virtualization (IOV) Specifications support.
+ Single Root IOV: allows the Physical Function device driver
+ to enable the hardware capability, so the Virtual Function
+ is accessible via the PCI configuration space using its own
+ Bus, Device and Function Number. Each Virtual Function also
+ has PCI Memory Space to map its own register set.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index af3bfe2..8c7c12d 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,9 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o

obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o

+# PCI IOV support
+obj-$(CONFIG_PCI_IOV) += iov.o
+
#
# Some architectures use the generic PCI setup functions
#
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 0000000..03f62ca
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,491 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2008 Intel Corporation
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ * Single Root IOV 1.0
+ */
+
+#include <linux/ctype.h>
+#include <linux/string.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <asm/page.h>
+#include "pci.h"
+
+
+#define pci_iov_attr(field) \
+static ssize_t iov_##field##_show(struct device *dev, \
+ struct device_attribute *attr, char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sprintf(buf, "%d\n", pdev->iov->field); \
+}
+
+pci_iov_attr(total);
+pci_iov_attr(initial);
+pci_iov_attr(nr_virtfn);
+
+static inline void virtfn_bdf(struct pci_dev *dev, int id, u8 *busnr, u8 *devfn)
+{
+ u16 bdf;
+
+ bdf = (dev->bus->number << 8) + dev->devfn +
+ dev->iov->offset + dev->iov->stride * id;
+ *busnr = bdf >> 8;
+ *devfn = bdf & 0xff;
+}
+
+static int virtfn_add(struct pci_dev *dev, int id)
+{
+ int i;
+ int rc;
+ u8 busnr, devfn;
+ struct pci_dev *virtfn;
+ struct resource *res;
+ resource_size_t size;
+
+ virtfn_bdf(dev, id, &busnr, &devfn);
+
+ virtfn = alloc_pci_dev();
+ if (!virtfn)
+ return -ENOMEM;
+
+ virtfn->bus = pci_find_bus(pci_domain_nr(dev->bus), busnr);
+ BUG_ON(!virtfn->bus);
+ virtfn->sysdata = dev->bus->sysdata;
+ virtfn->dev.parent = dev->dev.parent;
+ virtfn->dev.bus = dev->dev.bus;
+ virtfn->devfn = devfn;
+ virtfn->hdr_type = PCI_HEADER_TYPE_NORMAL;
+ virtfn->multifunction = 0;
+ virtfn->vendor = dev->vendor;
+ pci_read_config_word(dev, dev->iov->cap + PCI_IOV_VF_DID,
+ &virtfn->device);
+ virtfn->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
+ virtfn->error_state = pci_channel_io_normal;
+ virtfn->is_pcie = 1;
+ virtfn->pcie_type = PCI_EXP_TYPE_ENDPOINT;
+ virtfn->dma_mask = 0xffffffff;
+
+ dev_set_name(&virtfn->dev, "%04x:%02x:%02x.%d",
+ pci_domain_nr(virtfn->bus), busnr,
+ PCI_SLOT(devfn), PCI_FUNC(devfn));
+
+ pci_read_config_byte(virtfn, PCI_REVISION_ID, &virtfn->revision);
+ virtfn->class = dev->class;
+ virtfn->current_state = PCI_UNKNOWN;
+ virtfn->irq = 0;
+
+ for (i = 0; i < PCI_IOV_NUM_BAR; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ if (!res->parent)
+ continue;
+ virtfn->resource[i].name = pci_name(virtfn);
+ virtfn->resource[i].flags = res->flags;
+ size = resource_size(res);
+ do_div(size, dev->iov->total);
+ virtfn->resource[i].start = res->start + size * id;
+ virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
+ rc = request_resource(res, &virtfn->resource[i]);
+ BUG_ON(rc);
+ }
+
+ virtfn->subsystem_vendor = dev->subsystem_vendor;
+ pci_read_config_word(virtfn, PCI_SUBSYSTEM_ID,
+ &virtfn->subsystem_device);
+
+ pci_device_add(virtfn, virtfn->bus);
+ rc = pci_bus_add_device(virtfn);
+
+ return rc;
+}
+
+static void virtfn_remove(struct pci_dev *dev, int id)
+{
+ u8 busnr, devfn;
+ struct pci_bus *bus;
+ struct pci_dev *virtfn;
+
+ virtfn_bdf(dev, id, &busnr, &devfn);
+
+ bus = pci_find_bus(pci_domain_nr(dev->bus), busnr);
+ BUG_ON(!bus);
+ virtfn = pci_get_slot(bus, devfn);
+ BUG_ON(!virtfn);
+ pci_dev_put(virtfn);
+ pci_remove_bus_device(virtfn);
+}
+
+static int iov_add_bus(struct pci_bus *bus, int busnr)
+{
+ int i;
+ int rc;
+ struct pci_bus *child;
+
+ for (i = bus->number + 1; i <= busnr; i++) {
+ child = pci_find_bus(pci_domain_nr(bus), i);
+ if (child)
+ continue;
+ child = pci_add_new_bus(bus, NULL, i);
+ if (!child)
+ return -ENOMEM;
+
+ child->subordinate = i;
+ child->dev.parent = bus->bridge;
+ rc = pci_bus_add_child(child);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+
+static void iov_remove_bus(struct pci_bus *bus, int busnr)
+{
+ int i;
+ struct pci_bus *child;
+
+ for (i = bus->number + 1; i <= busnr; i++) {
+ child = pci_find_bus(pci_domain_nr(bus), i);
+ BUG_ON(!child);
+ if (list_empty(&child->devices))
+ pci_remove_bus(child);
+ }
+}
+
+static int iov_enable(struct pci_dev *dev, int nr_virtfn)
+{
+ int i, j;
+ int rc;
+ u8 busnr, devfn;
+ u16 ctrl, offset, stride;
+
+ pci_write_config_word(dev, dev->iov->cap + PCI_IOV_NUM_VF, nr_virtfn);
+ pci_read_config_word(dev, dev->iov->cap + PCI_IOV_VF_OFFSET, &offset);
+ pci_read_config_word(dev, dev->iov->cap + PCI_IOV_VF_STRIDE, &stride);
+
+ if (!offset || (nr_virtfn > 1 && !stride))
+ return -EIO;
+
+ dev->iov->offset = offset;
+ dev->iov->stride = stride;
+
+ virtfn_bdf(dev, nr_virtfn - 1, &busnr, &devfn);
+ if (busnr > dev->bus->subordinate)
+ return -EIO;
+
+ rc = dev->driver->virtual(dev, nr_virtfn);
+ if (rc)
+ return rc;
+
+ pci_read_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, &ctrl);
+ ctrl |= PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE;
+ pci_write_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, ctrl);
+ ssleep(1);
+
+ iov_add_bus(dev->bus, busnr);
+ for (i = 0; i < nr_virtfn; i++) {
+ rc = virtfn_add(dev, i);
+ if (rc)
+ goto failed;
+ }
+
+ dev->iov->nr_virtfn = nr_virtfn;
+
+ return 0;
+
+failed:
+ for (j = 0; j < i; j++)
+ virtfn_remove(dev, j);
+
+ iov_remove_bus(dev->bus, busnr);
+
+ ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE);
+ pci_write_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, ctrl);
+ ssleep(1);
+
+ return rc;
+}
+
+static void iov_disable(struct pci_dev *dev)
+{
+ int i;
+ int rc;
+ u16 ctrl;
+ u8 busnr, devfn;
+
+ if (!dev->iov->nr_virtfn)
+ return;
+
+ rc = dev->driver->virtual(dev, 0);
+ if (rc)
+ return;
+
+ for (i = 0; i < dev->iov->nr_virtfn; i++)
+ virtfn_remove(dev, i);
+
+ virtfn_bdf(dev, dev->iov->nr_virtfn - 1, &busnr, &devfn);
+ iov_remove_bus(dev->bus, busnr);
+
+ pci_read_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, &ctrl);
+ ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE);
+ pci_write_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, ctrl);
+ ssleep(1);
+
+ dev->iov->nr_virtfn = 0;
+}
+
+static ssize_t iov_set_nr_virtfn(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int rc;
+ long nr_virtfn;
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ rc = strict_strtol(buf, 0, &nr_virtfn);
+ if (rc)
+ return rc;
+
+ if (nr_virtfn < 0 || nr_virtfn > pdev->iov->initial)
+ return -EINVAL;
+
+ if (nr_virtfn == pdev->iov->nr_virtfn)
+ return count;
+
+ mutex_lock(&pdev->iov->physfn->iov->lock);
+ iov_disable(pdev);
+
+ if (nr_virtfn)
+ rc = iov_enable(pdev, nr_virtfn);
+ mutex_unlock(&pdev->iov->physfn->iov->lock);
+
+ return rc ? rc : count;
+}
+
+static DEVICE_ATTR(total_virtfn, S_IRUGO, iov_total_show, NULL);
+static DEVICE_ATTR(initial_virtfn, S_IRUGO, iov_initial_show, NULL);
+static DEVICE_ATTR(nr_virtfn, S_IWUSR | S_IRUGO,
+ iov_nr_virtfn_show, iov_set_nr_virtfn);
+
+static struct attribute *iov_attrs[] = {
+ &dev_attr_total_virtfn.attr,
+ &dev_attr_initial_virtfn.attr,
+ &dev_attr_nr_virtfn.attr,
+ NULL
+};
+
+static struct attribute_group iov_attr_group = {
+ .attrs = iov_attrs,
+ .name = "iov",
+};
+
+/**
+ * pci_iov_init - initialize device's SR-IOV capability
+ * @dev: the PCI device
+ *
+ * Returns 0 on success, or negative on failure.
+ *
+ * The major differences between Virtual Function and PCI device are:
+ * 1) the device with multiple bus numbers uses internal routing, so
+ * there is no explicit bridge device in this case.
+ * 2) Virtual Function memory spaces are designated by BARs encapsulated
+ * in the capability structure, and the BARs in Virtual Function PCI
+ * configuration space are read-only zero.
+ */
+int pci_iov_init(struct pci_dev *dev)
+{
+ int i;
+ int pos;
+ u32 pgsz;
+ u16 ctrl, total, initial, offset, stride;
+ struct pci_iov *iov;
+ struct resource *res;
+ struct pci_dev *physfn;
+
+ if (!dev->is_pcie)
+ return -ENODEV;
+
+ if (dev->pcie_type != PCI_EXP_TYPE_RC_END &&
+ dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)
+ return -ENODEV;
+
+ pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_IOV);
+ if (!pos)
+ return -ENODEV;
+
+ pci_read_config_word(dev, pos + PCI_IOV_CTRL, &ctrl);
+ if (ctrl & PCI_IOV_CTRL_VFE) {
+ pci_write_config_word(dev, pos + PCI_IOV_CTRL, 0);
+ ssleep(1);
+ }
+
+ physfn = NULL;
+ if (!list_empty(&dev->bus->devices))
+ list_for_each_entry(physfn, &dev->bus->devices, bus_list)
+ if (physfn->iov)
+ break;
+
+ ctrl = 0;
+ if (!(physfn && physfn->iov) && pci_ari_enabled(dev->bus))
+ ctrl |= PCI_IOV_CTRL_ARI;
+
+ pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl);
+ pci_read_config_word(dev, pos + PCI_IOV_TOTAL_VF, &total);
+ pci_read_config_word(dev, pos + PCI_IOV_INITIAL_VF, &initial);
+ pci_write_config_word(dev, pos + PCI_IOV_NUM_VF, initial);
+ pci_read_config_word(dev, pos + PCI_IOV_VF_OFFSET, &offset);
+ pci_read_config_word(dev, pos + PCI_IOV_VF_STRIDE, &stride);
+
+ if (!total || initial > total || (initial && !offset) ||
+ (initial > 1 && !stride))
+ return -EIO;
+
+ pci_read_config_dword(dev, pos + PCI_IOV_SUP_PGSIZE, &pgsz);
+ i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
+ pgsz &= ~((1 << i) - 1);
+ if (!pgsz)
+ return -EIO;
+
+ pgsz &= ~(pgsz - 1);
+ pci_write_config_dword(dev, pos + PCI_IOV_SYS_PGSIZE, pgsz);
+
+ iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+ if (!iov)
+ return -ENOMEM;
+
+ iov->cap = pos;
+ iov->total = total;
+ iov->initial = initial;
+ iov->offset = offset;
+ iov->stride = stride;
+ iov->pgsz = pgsz;
+
+ for (i = 0; i < PCI_IOV_NUM_BAR; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ pos = iov->cap + PCI_IOV_BAR_0 + i * 4;
+ i += __pci_read_base(dev, pci_bar_unknown, res, pos);
+ if (!res->flags)
+ continue;
+ res->end = res->start + resource_size(res) * total - 1;
+ }
+
+ if (physfn && physfn->iov) {
+ pci_dev_get(physfn);
+ iov->physfn = physfn;
+ } else {
+ mutex_init(&iov->lock);
+ iov->physfn = dev;
+ }
+
+ dev->iov = iov;
+
+ return 0;
+}
+
+/**
+ * pci_iov_release - release resources used by the SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_iov_release(struct pci_dev *dev)
+{
+ if (!dev->iov)
+ return;
+
+ if (dev == dev->iov->physfn)
+ mutex_destroy(&dev->iov->lock);
+ else
+ pci_dev_put(dev->iov->physfn);
+
+ kfree(dev->iov);
+}
+
+/**
+ * pci_iov_resource_bar - get position of the SR-IOV BAR
+ * @dev: the PCI device
+ * @resno: the resource number
+ * @type: the BAR type to be filled in
+ *
+ * Returns position of the BAR encapsulated in the SR-IOV capability.
+ */
+int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type)
+{
+ if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCE_END)
+ return 0;
+
+ BUG_ON(!dev->iov);
+
+ *type = pci_bar_unknown;
+ return dev->iov->cap + PCI_IOV_BAR_0 +
+ 4 * (resno - PCI_IOV_RESOURCES);
+}
+
+/**
+ * pci_restore_iov_state - restore the state of the SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_restore_iov_state(struct pci_dev *dev)
+{
+ u16 ctrl;
+
+ if (!dev->iov)
+ return;
+
+ pci_read_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, &ctrl);
+ if (ctrl & PCI_IOV_CTRL_VFE)
+ return;
+
+ pci_write_config_dword(dev, dev->iov->cap + PCI_IOV_SYS_PGSIZE,
+ dev->iov->pgsz);
+ ctrl = 0;
+ if (dev == dev->iov->physfn && pci_ari_enabled(dev->bus))
+ ctrl |= PCI_IOV_CTRL_ARI;
+ pci_write_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, ctrl);
+
+ if (!dev->iov->nr_virtfn)
+ return;
+
+ pci_write_config_word(dev, dev->iov->cap + PCI_IOV_NUM_VF,
+ dev->iov->nr_virtfn);
+ ctrl |= PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE;
+ pci_write_config_word(dev, dev->iov->cap + PCI_IOV_CTRL, ctrl);
+
+ ssleep(1);
+}
+
+/**
+ * pci_iov_register - register the SR-IOV capability
+ * @dev: the PCI device
+ */
+int pci_iov_register(struct pci_dev *dev)
+{
+ int rc;
+
+ if (!dev->iov)
+ return -ENODEV;
+
+ rc = sysfs_create_group(&dev->dev.kobj, &iov_attr_group);
+ if (rc)
+ return rc;
+
+ rc = kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
+
+ return rc;
+}
+
+/**
+ * pci_iov_unregister - unregister the SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_iov_unregister(struct pci_dev *dev)
+{
+ if (!dev->iov)
+ return;
+
+ sysfs_remove_group(&dev->dev.kobj, &iov_attr_group);
+ iov_disable(dev);
+ kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
+}
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index b4cdd69..3d5f3a3 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -234,6 +234,8 @@ __pci_device_probe(struct pci_driver *drv, struct pci_dev *pci_dev)
error = pci_call_probe(drv, pci_dev, id);
if (error >= 0) {
pci_dev->driver = drv;
+ if (drv->virtual)
+ pci_iov_register(pci_dev);
error = 0;
}
}
@@ -262,6 +264,8 @@ static int pci_device_remove(struct device * dev)
struct pci_driver * drv = pci_dev->driver;

if (drv) {
+ if (drv->virtual)
+ pci_iov_unregister(pci_dev);
if (drv->remove)
drv->remove(pci_dev);
pci_dev->driver = NULL;
@@ -292,8 +296,12 @@ static void pci_device_shutdown(struct device *dev)
struct pci_dev *pci_dev = to_pci_dev(dev);
struct pci_driver *drv = pci_dev->driver;

- if (drv && drv->shutdown)
- drv->shutdown(pci_dev);
+ if (drv) {
+ if (drv->virtual)
+ pci_iov_unregister(pci_dev);
+ if (drv->shutdown)
+ drv->shutdown(pci_dev);
+ }
pci_msi_shutdown(pci_dev);
pci_msix_shutdown(pci_dev);
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 9382b5f..ca26e53 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -763,6 +763,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
+ pci_restore_iov_state(dev);

return 0;
}
@@ -2017,12 +2018,19 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags)
*/
int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type)
{
+ int reg;
+
if (resno < PCI_ROM_RESOURCE) {
*type = pci_bar_unknown;
return PCI_BASE_ADDRESS_0 + 4 * resno;
} else if (resno == PCI_ROM_RESOURCE) {
*type = pci_bar_mem32;
return dev->rom_base_reg;
+ } else if (resno < PCI_BRIDGE_RESOURCES) {
+ /* device specific resource */
+ reg = pci_iov_resource_bar(dev, resno, type);
+ if (reg)
+ return reg;
}

dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 315bbe6..3113d11 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -183,4 +183,55 @@ static inline int pci_ari_enabled(struct pci_bus *bus)
return bus->self && bus->self->ari_enabled;
}

+/* Single Root I/O Virtualization */
+struct pci_iov {
+ int cap; /* capability position */
+ int status; /* status of SR-IOV */
+ u16 total; /* total VFs associated with the PF */
+ u16 initial; /* initial VFs associated with the PF */
+ u16 nr_virtfn; /* number of VFs available */
+ u16 offset; /* first VF Routing ID offset */
+ u16 stride; /* following VF stride */
+ u32 pgsz; /* page size for BAR alignment */
+ struct pci_dev *physfn; /* lowest numbered PF */
+ struct mutex lock; /* lock for VF bus */
+};
+
+#ifdef CONFIG_PCI_IOV
+extern int pci_iov_init(struct pci_dev *dev);
+extern void pci_iov_release(struct pci_dev *dev);
+extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type);
+extern int pci_iov_register(struct pci_dev *dev);
+extern void pci_iov_unregister(struct pci_dev *dev);
+extern void pci_restore_iov_state(struct pci_dev *dev);
+#else
+static inline int pci_iov_init(struct pci_dev *dev)
+{
+ return -EIO;
+}
+static inline void pci_iov_release(struct pci_dev *dev)
+
+{
+}
+
+static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type)
+{
+ return 0;
+}
+
+static inline int pci_iov_register(struct pci_dev *dev)
+{
+}
+
+static inline void pci_iov_unregister(struct pci_dev *dev)
+{
+}
+
+static inline void pci_restore_iov_state(struct pci_dev *dev)
+{
+}
+#endif /* CONFIG_PCI_IOV */
+
#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index cd205fd..cb26e64 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -785,6 +785,7 @@ static int pci_setup_device(struct pci_dev * dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
pci_vpd_release(dev);
+ pci_iov_release(dev);
}

/**
@@ -968,6 +969,9 @@ static void pci_init_capabilities(struct pci_dev *dev)

/* Alternative Routing-ID Forwarding */
pci_enable_ari(dev);
+
+ /* Single Root I/O Virtualization */
+ pci_iov_init(dev);
}

void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d455ec8..c9046a3 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -93,6 +93,12 @@ enum {
/* #6: expansion ROM resource */
PCI_ROM_RESOURCE,

+ /* device specific resources */
+#ifdef CONFIG_PCI_IOV
+ PCI_IOV_RESOURCES,
+ PCI_IOV_RESOURCE_END = PCI_IOV_RESOURCES + PCI_IOV_NUM_BAR - 1,
+#endif
+
/* resources assigned to buses behind the bridge */
#define PCI_BRIDGE_RESOURCE_NUM 4

@@ -171,6 +177,7 @@ struct pci_cap_saved_state {

struct pcie_link_state;
struct pci_vpd;
+struct pci_iov;

/*
* The pci_dev structure is used to describe PCI devices.
@@ -259,6 +266,7 @@ struct pci_dev {
struct list_head msi_list;
#endif
struct pci_vpd *vpd;
+ struct pci_iov *iov;
};

extern struct pci_dev *alloc_pci_dev(void);
@@ -426,6 +434,7 @@ struct pci_driver {
int (*resume_early) (struct pci_dev *dev);
int (*resume) (struct pci_dev *dev); /* Device woken up */
void (*shutdown) (struct pci_dev *dev);
+ int (*virtual) (struct pci_dev *dev, int nr_virtfn);
struct pm_ext_ops *pm;
struct pci_error_handlers *err_handler;
struct device_driver driver;
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index e5effd4..1d1ade2 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -363,6 +363,7 @@
#define PCI_EXP_TYPE_UPSTREAM 0x5 /* Upstream Port */
#define PCI_EXP_TYPE_DOWNSTREAM 0x6 /* Downstream Port */
#define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCI/PCI-X Bridge */
+#define PCI_EXP_TYPE_RC_END 0x9 /* Root Complex Integrated Endpoint */
#define PCI_EXP_FLAGS_SLOT 0x0100 /* Slot implemented */
#define PCI_EXP_FLAGS_IRQ 0x3e00 /* Interrupt message number */
#define PCI_EXP_DEVCAP 4 /* Device capabilities */
@@ -436,6 +437,7 @@
#define PCI_EXT_CAP_ID_DSN 3
#define PCI_EXT_CAP_ID_PWR 4
#define PCI_EXT_CAP_ID_ARI 14
+#define PCI_EXT_CAP_ID_IOV 16

/* Advanced Error Reporting */
#define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */
@@ -553,4 +555,23 @@
#define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */
#define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */

+/* Single Root I/O Virtualization */
+#define PCI_IOV_CAP 0x04 /* SR-IOV Capabilities */
+#define PCI_IOV_CTRL 0x08 /* SR-IOV Control */
+#define PCI_IOV_CTRL_VFE 0x01 /* VF Enable */
+#define PCI_IOV_CTRL_MSE 0x08 /* VF Memory Space Enable */
+#define PCI_IOV_CTRL_ARI 0x10 /* ARI Capable Hierarchy */
+#define PCI_IOV_STATUS 0x0a /* SR-IOV Status */
+#define PCI_IOV_INITIAL_VF 0x0c /* Initial VFs */
+#define PCI_IOV_TOTAL_VF 0x0e /* Total VFs */
+#define PCI_IOV_NUM_VF 0x10 /* Number of VFs */
+#define PCI_IOV_FUNC_LINK 0x12 /* Function Dependency Link */
+#define PCI_IOV_VF_OFFSET 0x14 /* First VF Offset */
+#define PCI_IOV_VF_STRIDE 0x16 /* Following VF Stride */
+#define PCI_IOV_VF_DID 0x1a /* VF Device ID */
+#define PCI_IOV_SUP_PGSIZE 0x1c /* Supported Page Sizes */
+#define PCI_IOV_SYS_PGSIZE 0x20 /* System Page Size */
+#define PCI_IOV_BAR_0 0x24 /* VF BAR0 */
+#define PCI_IOV_NUM_BAR 6 /* Number of VF BARs */
+
#endif /* LINUX_PCI_REGS_H */
--
1.5.6.4

2008-11-21 19:40:59

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 11/13 v7] PCI: reserve bus range for SR-IOV device

Reserve bus range for SR-IOV at device scanning stage when the kernel
boot parameter 'pci=assign-busses' is used .

Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/iov.c | 24 ++++++++++++++++++++++++
drivers/pci/pci.h | 6 ++++++
drivers/pci/probe.c | 3 +++
3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 03f62ca..c1a3cea 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -489,3 +489,27 @@ void pci_iov_unregister(struct pci_dev *dev)
iov_disable(dev);
kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
}
+
+/**
+ * pci_iov_bus_range - find bus range used by the Virtual Function
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+ int max = 0;
+ u8 busnr, devfn;
+ struct pci_dev *dev;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ if (!dev->iov)
+ continue;
+ virtfn_bdf(dev, dev->iov->total - 1, &busnr, &devfn);
+ if (busnr > max)
+ max = busnr;
+ }
+
+ return max ? max - bus->number : 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 3113d11..574bbc7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -205,6 +205,7 @@ extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
extern int pci_iov_register(struct pci_dev* dev);
extern void pci_iov_unregister(struct pci_dev* dev);
extern void pci_restore_iov_state(struct pci_dev *dev);
+extern int pci_iov_bus_range(struct pci_bus *bus);
#else
static inline int pci_iov_init(struct pci_dev *dev)
{
@@ -232,6 +233,11 @@ static inline void pci_iov_unregister(struct pci_dev* dev)
static inline void pci_restore_iov_state(struct pci_dev *dev)
{
}
+
+static inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+ return 0;
+}
#endif /* CONFIG_PCI_IOV */

#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index cb26e64..7b591e5 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1074,6 +1074,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
for (devfn = 0; devfn < 0x100; devfn += 8)
pci_scan_slot(bus, devfn);

+ /* Reserve buses for SR-IOV capability. */
+ max += pci_iov_bus_range(bus);
+
/*
* After performing arch-dependent fixup of the bus, look behind
* all PCI-to-PCI bridges on this bus.
--
1.5.6.4

2008-11-21 19:41:29

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 12/13 v7] PCI: document the SR-IOV sysfs entries

Document the SR-IOV sysfs entries.

Signed-off-by: Yu Zhao <[email protected]>

---
Documentation/ABI/testing/sysfs-bus-pci | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index ceddcff..d66d63d 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -9,3 +9,29 @@ Description:
that some devices may have malformatted data. If the
underlying VPD has a writable section then the
corresponding section of this file will be writable.
+
+What: /sys/bus/pci/devices/.../iov/total_virtfn
+Date: November 2008
+Contact: Yu Zhao <[email protected]>
+Description:
+ This file appears when a device has the SR-IOV capability
+ and the device driver (PF driver) support this operation.
+ It holds the number of total Virtual Functions (read-only).
+
+What: /sys/bus/pci/devices/.../iov/initial_virtfn
+Date: November 2008
+Contact: Yu Zhao <[email protected]>
+Description:
+ This file appears when a device has the SR-IOV capability
+ and the device driver (PF driver) support this operation.
+ It holds the number of initial Virtual Functions (read-only).
+
+What: /sys/bus/pci/devices/.../iov/nr_virtfn
+Date: November 2008
+Contact: Yu Zhao <[email protected]>
+Description:
+ This file appears when a device has the SR-IOV capability
+ and the device driver (PF driver) support this operation.
+ It holds the number of available Virtual Functions, and
+ could be written (0 ~ InitialVFs) to change the number of
+ the Virtual Functions.
--
1.5.6.4

2008-11-21 19:41:44

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 13/13 v7] PCI: document for SR-IOV user and developer

Create how-to for the SR-IOV user and driver developer.

Signed-off-by: Yu Zhao <[email protected]>

---
Documentation/DocBook/kernel-api.tmpl | 1 +
Documentation/PCI/pci-iov-howto.txt | 138 +++++++++++++++++++++++++++++++++
2 files changed, 139 insertions(+), 0 deletions(-)
create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index 5818ff7..506e611 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c
-->
!Edrivers/pci/probe.c
!Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
</sect1>
<sect1><title>PCI Hotplug Support Library</title>
!Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 0000000..216cecc
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,138 @@
+ PCI Express I/O Virtualization Howto
+ Copyright (C) 2008 Intel Corporation
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I manage the SR-IOV
+
+If a device has the SR-IOV capability and the device driver (PF driver)
+supports this operation, then there should be some entries under the
+PF's sysfs directory:
+ - /sys/bus/pci/devices/NNNN:BB:DD.F/iov/
+ (NNNN:BB:DD:F is the domain, bus, device and function numbers)
+
+To change number of Virtual Functions:
+ - /sys/bus/pci/devices/NNNN:BB:DD.F/iov/nr_virtfn
+ (writing positive integer to this file will change the number of
+ VFs, and 0 means disable the capability)
+
+The total and initial numbers of VFs can get from:
+ - /sys/bus/pci/devices/NNNN:BB:DD.F/iov/total_virtfn
+ - /sys/bus/pci/devices/NNNN:BB:DD.F/iov/initial_virtfn
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. And also
+the VF requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV APIs
+
+To use the SR-IOV service, the Physical Function driver needs to declare
+a callback function in its 'struct pci_driver':
+
+ static struct pci_driver dev_driver = {
+ ...
+ .virtual = dev_virtual,
+ ...
+ };
+
+ The 'dev_virtual' is a callback function that the SR-IOV service
+ will invoke it when the number of VFs is changed by the user.
+ The first argument of this callback is PF itself ('struct pci_dev'),
+ and the second argument is the number of VFs requested. The callback
+ should return 0 if the requested number of VFs is supported and all
+ necessary resources are granted to these VFs; otherwise it should
+ return a negative value indicating the error.
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of APIs above.
+
+static int __devinit dev_probe(struct pci_dev *dev,
+ const struct pci_device_id *id)
+{
+ ...
+
+ return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+ ...
+}
+
+#ifdef CONFIG_PM
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+ ...
+
+ return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+ pci_restore_state(dev);
+
+ ...
+
+ return 0;
+}
+#endif
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+ ...
+}
+
+static int dev_virtual(struct pci_dev *dev, int nr_virtfn)
+{
+
+ if (nr_virtfn) {
+ /*
+ * allocate device internal resources for VFs.
+ * these resources are device-specific (e.g. rx/tx
+ * queue in the NIC) and necessary to make the VF
+ * functional.
+ */
+ } else {
+ /*
+ * reclaim the VF related resources if any.
+ */
+ }
+
+ return 0;
+}
+
+static struct pci_driver dev_driver = {
+ .name = "SR-IOV Physical Function driver",
+ .id_table = dev_id_table,
+ .probe = dev_probe,
+ .remove = __devexit_p(dev_remove),
+#ifdef CONFIG_PM
+ .suspend = dev_suspend,
+ .resume = dev_resume,
+#endif
+ .shutdown = dev_shutdown,
+ .virtual = dev_virtual
+};
--
1.5.6.4

2008-11-21 21:01:26

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Sat, Nov 22, 2008 at 02:36:05AM +0800, Yu Zhao wrote:
> Greetings,
>
> Following patches are intended to support SR-IOV capability in the
> Linux kernel. With these patches, people can turn a PCI device with
> the capability into multiple ones from software perspective, which
> will benefit KVM and achieve other purposes such as QoS, security,
> and etc.
>
> The Physical Function and Virtual Function drivers using the SR-IOV
> APIs will come soon!

Thanks for respining these patches, but I think we really need to see a
driver using this in order to get an idea of how it will be used.

Also, the Xen and KVM people need to agree on the userspace interface
here, perhaps also getting some libvirt involvement as well, as they are
going to be the ones having to use this all the time.

thanks,

greg k-h

2008-11-22 07:04:23

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> On Sat, Nov 22, 2008 at 02:36:05AM +0800, Yu Zhao wrote:
>> Greetings,
>>
>> Following patches are intended to support SR-IOV capability in the
>> Linux kernel. With these patches, people can turn a PCI device with
>> the capability into multiple ones from software perspective, which
>> will benefit KVM and achieve other purposes such as QoS, security,
>> and etc.
>>
>> The Physical Function and Virtual Function drivers using the SR-IOV
>> APIs will come soon!
>
> Thanks for respining these patches, but I think we really need to see a
> driver using this in order to get an idea of how it will be used.

Yes, the PF driver patch and the VF driver for Intel 82576 NIC will be
available next week. Both PF/VF drivers are testing versions. The PF
driver patch is based on the lasted kernel IGB driver
(drivers/net/igb/), and uses SR-IOV v7 API. The VF driver is a totally
new NIC driver, it looks like other normal PCI NIC drivers.

> Also, the Xen and KVM people need to agree on the userspace interface
> here, perhaps also getting some libvirt involvement as well, as they are
> going to be the ones having to use this all the time.

I'll keep KVM/Xen people updated on the user level interface and let
them comment based on a real usage model of the Intel 82576 NIC after
PF/VF drivers are available.

Regards,
Yu

2008-11-26 15:00:09

by Zhao, Yu

[permalink] [raw]
Subject: [SR-IOV driver example 0/3] introduction

SR-IOV drivers of Intel 82576 NIC are available. There are two parts
of the drivers: Physical Function driver and Virtual Function driver.
The PF driver is based on the IGB driver and is used to control PF to
allocate hardware specific resources and interface with the SR-IOV core.
The VF driver is a new NIC driver that is same as the traditional PCI
device driver. It works in both the host and the guest (Xen and KVM)
environment.

These two drivers are testing versions and they are *only* intended to
show how to use SR-IOV API.

Intel 82576 NIC specification can be found at:
http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf

[SR-IOV driver example 1/3] PF driver: allocate hardware specific resource
[SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
[SR-IOV driver example 3/3] VF driver tar ball

2008-11-26 15:09:09

by Zhao, Yu

[permalink] [raw]
Subject: [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource

This patch makes the IGB driver allocate hardware resource (rx/tx queues)
for Virtual Functions. All operations in this patch are hardware specific.

---
drivers/net/igb/Makefile | 2 +-
drivers/net/igb/e1000_82575.c | 1 +
drivers/net/igb/e1000_82575.h | 61 ++++
drivers/net/igb/e1000_defines.h | 7 +
drivers/net/igb/e1000_hw.h | 2 +
drivers/net/igb/e1000_regs.h | 13 +
drivers/net/igb/e1000_vf.c | 223 ++++++++++++++
drivers/net/igb/igb.h | 10 +
drivers/net/igb/igb_main.c | 604 ++++++++++++++++++++++++++++++++++++++-
9 files changed, 910 insertions(+), 13 deletions(-)
create mode 100644 drivers/net/igb/e1000_vf.c

diff --git a/drivers/net/igb/Makefile b/drivers/net/igb/Makefile
index 1927b3f..ab3944c 100644
--- a/drivers/net/igb/Makefile
+++ b/drivers/net/igb/Makefile
@@ -33,5 +33,5 @@
obj-$(CONFIG_IGB) += igb.o

igb-objs := igb_main.o igb_ethtool.o e1000_82575.o \
- e1000_mac.o e1000_nvm.o e1000_phy.o
+ e1000_mac.o e1000_nvm.o e1000_phy.o e1000_vf.o

diff --git a/drivers/net/igb/e1000_82575.c b/drivers/net/igb/e1000_82575.c
index f5e2e72..bb823ac 100644
--- a/drivers/net/igb/e1000_82575.c
+++ b/drivers/net/igb/e1000_82575.c
@@ -87,6 +87,7 @@ static s32 igb_get_invariants_82575(struct e1000_hw *hw)
case E1000_DEV_ID_82576:
case E1000_DEV_ID_82576_FIBER:
case E1000_DEV_ID_82576_SERDES:
+ case E1000_DEV_ID_82576_QUAD_COPPER:
mac->type = e1000_82576;
break;
default:
diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..8c488ab 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -170,4 +170,65 @@ struct e1000_adv_tx_context_desc {
#define E1000_DCA_TXCTRL_CPUID_SHIFT 24 /* Tx CPUID now in the last byte */
#define E1000_DCA_RXCTRL_CPUID_SHIFT 24 /* Rx CPUID now in the last byte */

+#define MAX_NUM_VFS 8
+
+#define E1000_DTXSWC_VMDQ_LOOPBACK_EN (1 << 31) /* global VF LB enable */
+
+/* Easy defines for setting default pool, would normally be left a zero */
+#define E1000_VT_CTL_DEFAULT_POOL_SHIFT 7
+#define E1000_VT_CTL_DEFAULT_POOL_MASK (0x7 << E1000_VT_CTL_DEFAULT_POOL_SHIFT)
+
+/* Other useful VMD_CTL register defines */
+#define E1000_VT_CTL_DISABLE_DEF_POOL (1 << 29)
+#define E1000_VT_CTL_VM_REPL_EN (1 << 30)
+
+/* Per VM Offload register setup */
+#define E1000_VMOLR_LPE 0x00010000 /* Accept Long packet */
+#define E1000_VMOLR_AUPE 0x01000000 /* Accept untagged packets */
+#define E1000_VMOLR_BAM 0x08000000 /* Accept Broadcast packets */
+#define E1000_VMOLR_MPME 0x10000000 /* Multicast promiscuous mode */
+#define E1000_VMOLR_STRVLAN 0x40000000 /* Vlan stripping enable */
+
+#define E1000_P2VMAILBOX_STS 0x00000001 /* Initiate message send to VF */
+#define E1000_P2VMAILBOX_ACK 0x00000002 /* Ack message recv'd from VF */
+#define E1000_P2VMAILBOX_VFU 0x00000004 /* VF owns the mailbox buffer */
+#define E1000_P2VMAILBOX_PFU 0x00000008 /* PF owns the mailbox buffer */
+
+#define E1000_VLVF_ARRAY_SIZE 32
+#define E1000_VLVF_VLANID_MASK 0x00000FFF
+#define E1000_VLVF_POOLSEL_SHIFT 12
+#define E1000_VLVF_POOLSEL_MASK (0xFF << E1000_VLVF_POOLSEL_SHIFT)
+#define E1000_VLVF_VLANID_ENABLE 0x80000000
+
+#define E1000_VFMAILBOX_SIZE 16 /* 16 32 bit words - 64 bytes */
+
+/* If it's a E1000_VF_* msg then it originates in the VF and is sent to the
+ * PF. The reverse is true if it is E1000_PF_*.
+ * Message ACK's are the value or'd with 0xF0000000
+ */
+#define E1000_VT_MSGTYPE_ACK 0xF0000000 /* Messages below or'd with
+ * this are the ACK */
+#define E1000_VT_MSGTYPE_NACK 0xFF000000 /* Messages below or'd with
+ * this are the NACK */
+#define E1000_VT_MSGINFO_SHIFT 16
+/* bits 23:16 are used for exra info for certain messages */
+#define E1000_VT_MSGINFO_MASK (0xFF << E1000_VT_MSGINFO_SHIFT)
+
+#define E1000_VF_MSGTYPE_REQ_MAC 1 /* VF needs to know its MAC */
+#define E1000_VF_MSGTYPE_VFLR 2 /* VF notifies VFLR to PF */
+#define E1000_VF_SET_MULTICAST 3 /* VF requests PF to set MC addr */
+#define E1000_VF_SET_VLAN 4 /* VF requests PF to set VLAN */
+#define E1000_VF_SET_LPE 5 /* VF requests PF to set VMOLR.LPE */
+
+s32 e1000_send_mail_to_vf(struct e1000_hw *hw, u32 *msg,
+ u32 vf_number, s16 size);
+s32 e1000_receive_mail_from_vf(struct e1000_hw *hw, u32 *msg,
+ u32 vf_number, s16 size);
+void e1000_vmdq_loopback_enable_vf(struct e1000_hw *hw);
+void e1000_vmdq_loopback_disable_vf(struct e1000_hw *hw);
+void e1000_vmdq_replication_enable_vf(struct e1000_hw *hw, u32 enables);
+void e1000_vmdq_replication_disable_vf(struct e1000_hw *hw);
+bool e1000_check_for_pf_ack_vf(struct e1000_hw *hw);
+bool e1000_check_for_pf_mail_vf(struct e1000_hw *hw, u32*);
+
#endif
diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index ce70068..08f9db0 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -389,6 +389,7 @@
#define E1000_ICR_RXDMT0 0x00000010 /* rx desc min. threshold (0) */
#define E1000_ICR_RXO 0x00000040 /* rx overrun */
#define E1000_ICR_RXT0 0x00000080 /* rx timer intr (ring 0) */
+#define E1000_ICR_VMMB 0x00000100 /* VM MB event */
#define E1000_ICR_MDAC 0x00000200 /* MDIO access complete */
#define E1000_ICR_RXCFG 0x00000400 /* Rx /c/ ordered set */
#define E1000_ICR_GPI_EN0 0x00000800 /* GP Int 0 */
@@ -451,6 +452,7 @@
/* Interrupt Mask Set */
#define E1000_IMS_TXDW E1000_ICR_TXDW /* Transmit desc written back */
#define E1000_IMS_LSC E1000_ICR_LSC /* Link Status Change */
+#define E1000_IMS_VMMB E1000_ICR_VMMB /* Mail box activity */
#define E1000_IMS_RXSEQ E1000_ICR_RXSEQ /* rx sequence error */
#define E1000_IMS_RXDMT0 E1000_ICR_RXDMT0 /* rx desc min. threshold */
#define E1000_IMS_RXT0 E1000_ICR_RXT0 /* rx timer intr */
@@ -768,4 +770,9 @@
#define E1000_GEN_CTL_ADDRESS_SHIFT 8
#define E1000_GEN_POLL_TIMEOUT 640

+#define E1000_WRITE_FLUSH(a) (readl((a)->hw_addr + E1000_STATUS))
+#define E1000_MRQC_ENABLE_MASK 0x00000007
+#define E1000_MRQC_ENABLE_VMDQ 0x00000003
+#define E1000_CTRL_EXT_PFRSTD 0x00004000
+
#endif
diff --git a/drivers/net/igb/e1000_hw.h b/drivers/net/igb/e1000_hw.h
index 99504a6..b57ecfd 100644
--- a/drivers/net/igb/e1000_hw.h
+++ b/drivers/net/igb/e1000_hw.h
@@ -41,6 +41,7 @@ struct e1000_hw;
#define E1000_DEV_ID_82576 0x10C9
#define E1000_DEV_ID_82576_FIBER 0x10E6
#define E1000_DEV_ID_82576_SERDES 0x10E7
+#define E1000_DEV_ID_82576_QUAD_COPPER 0x10E8
#define E1000_DEV_ID_82575EB_COPPER 0x10A7
#define E1000_DEV_ID_82575EB_FIBER_SERDES 0x10A9
#define E1000_DEV_ID_82575GB_QUAD_COPPER 0x10D6
@@ -91,6 +92,7 @@ enum e1000_phy_type {
e1000_phy_gg82563,
e1000_phy_igp_3,
e1000_phy_ife,
+ e1000_phy_vf,
};

enum e1000_bus_type {
diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index 95523af..8a39bbc 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -262,6 +262,19 @@
#define E1000_RETA(_i) (0x05C00 + ((_i) * 4))
#define E1000_RSSRK(_i) (0x05C80 + ((_i) * 4)) /* RSS Random Key - RW Array */

+/* VT Registers */
+#define E1000_MBVFICR 0x00C80 /* Mailbox VF Cause - RWC */
+#define E1000_MBVFIMR 0x00C84 /* Mailbox VF int Mask - RW */
+#define E1000_VFLRE 0x00C88 /* VF Register Events - RWC */
+#define E1000_VFRE 0x00C8C /* VF Receive Enables */
+#define E1000_VFTE 0x00C90 /* VF Transmit Enables */
+#define E1000_DTXSWC 0x03500 /* DMA Tx Switch Control - RW */
+/* These act per VF so an array friendly macro is used */
+#define E1000_P2VMAILBOX(_n) (0x00C00 + (4 * (_n)))
+#define E1000_VMBMEM(_n) (0x00800 + (64 * (_n)))
+#define E1000_VMOLR(_n) (0x05AD0 + (4 * (_n)))
+#define E1000_VLVF(_n) (0x05D00 + (4 * (_n))) /* VLAN Virtual Machine */
+
#define wr32(reg, value) (writel(value, hw->hw_addr + reg))
#define rd32(reg) (readl(hw->hw_addr + reg))
#define wrfl() ((void)rd32(E1000_STATUS))
diff --git a/drivers/net/igb/e1000_vf.c b/drivers/net/igb/e1000_vf.c
new file mode 100644
index 0000000..9e4e566
--- /dev/null
+++ b/drivers/net/igb/e1000_vf.c
@@ -0,0 +1,223 @@
+/*******************************************************************************
+
+ Intel(R) Gigabit Ethernet Linux driver
+ Copyright(c) 2007-2008 Intel Corporation.
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms and conditions of the GNU General Public License,
+ version 2, as published by the Free Software Foundation.
+
+ This program is distributed in the hope it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc.,
+ 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+
+ The full GNU General Public License is included in this distribution in
+ the file called "COPYING".
+
+ Contact Information:
+ e1000-devel Mailing List <[email protected]>
+ Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
+*******************************************************************************/
+
+
+#include <linux/if_ether.h>
+#include <linux/delay.h>
+#include <linux/pci.h>
+#include <linux/netdevice.h>
+
+#include "igb.h"
+
+/**
+ * e1000_send_mail_to_vf - Sends a mailbox message from PF to VF
+ * @hw: pointer to the HW structure
+ * @msg: The message buffer
+ * @vf_number: the VF index
+ * @size: Length of buffer
+ **/
+s32 e1000_send_mail_to_vf(struct e1000_hw *hw, u32 *msg, u32 vf_number,
+ s16 size)
+{
+ u32 p2v_mailbox = rd32(E1000_P2VMAILBOX(vf_number));
+ s32 ret_val = 0;
+ s16 i;
+
+ /*
+ * if the VF owns the mailbox then we can't grab the mailbox buffer
+ * - mostly an indication of a programming error
+ */
+ if (p2v_mailbox & E1000_P2VMAILBOX_VFU) {
+ ret_val = -1;
+ goto out_no_write;
+ } else {
+ /* Take ownership of the buffer */
+ p2v_mailbox |= E1000_P2VMAILBOX_PFU;
+ wr32(E1000_P2VMAILBOX(vf_number), p2v_mailbox);
+ p2v_mailbox = rd32(E1000_P2VMAILBOX(vf_number));
+ /* Make sure we have ownership now... */
+ if (p2v_mailbox & E1000_P2VMAILBOX_VFU) {
+ /*
+ * oops, VF grabbed ownership while we were attempting
+ * to take ownership - avoid the race condition
+ */
+ ret_val = -2;
+ goto out_no_write;
+ }
+ }
+
+ /*
+ * At this point we have established ownership of the PF mailbox memory
+ * buffer. IT IS IMPORTANT THAT THIS OWNERSHIP BE GIVEN UP! Whether
+ * success or failure, the PF ownership bit must be cleared before
+ * exiting this function - so if you change this function keep that
+ * in mind
+ */
+
+ /* Clear PF ownership of the mail box memory buffer */
+ /*
+ * Do this whether success or failure on the wait for ack from
+ * the PF
+ */
+ p2v_mailbox &= ~(E1000_P2VMAILBOX_PFU);
+
+ /* check for overflow */
+ if (size > E1000_VFMAILBOX_SIZE) {
+ ret_val = -3;
+ goto out;
+ }
+
+ /*
+ * copy the caller specified message to the mailbox
+ * memory buffer
+ */
+ for (i = 0; i < size; i++)
+ wr32(((E1000_VMBMEM(vf_number)) + (i * 4)), msg[i]);
+
+ /* Interrupt the VF to tell it a message has been sent */
+ p2v_mailbox |= E1000_P2VMAILBOX_STS;
+
+out:
+ wr32(E1000_P2VMAILBOX(vf_number), p2v_mailbox);
+
+out_no_write:
+ return ret_val;
+
+}
+
+/**
+ * e1000_receive_mail_from_vf - Receives a mailbox message from VF to PF
+ * @hw: pointer to the HW structure
+ * @msg: The message buffer
+ * @vf_number: the VF index
+ * @size: Length of buffer
+ **/
+s32 e1000_receive_mail_from_vf(struct e1000_hw *hw,
+ u32 *msg, u32 vf_number, s16 size)
+{
+ u32 p2v_mailbox = rd32(E1000_P2VMAILBOX(vf_number));
+ s16 i;
+
+ /*
+ * Should we be checking if the VF has set the ownership bit?
+ * I don't know... presumably well written software will set the
+ * VF mailbox memory ownership bit but I can't think of a reason
+ * to call it an error if it doesn't... I'll think 'pon it some more
+ */
+
+ /*
+ * No message ready polling mechanism - the presumption is that
+ * the caller knows there is a message because of the interrupt
+ * ack
+ */
+
+ /*
+ * copy the caller specified message to the mailbox
+ * memory buffer
+ */
+ for (i = 0; i < size; i++)
+ msg[i] = rd32(((E1000_VMBMEM(vf_number)) + (i * 4)));
+
+ /*
+ * Acknowledge receipt of the message to the VF and then
+ * we're done
+ */
+ p2v_mailbox |= E1000_P2VMAILBOX_ACK; /* Set PF Ack bit */
+ wr32(E1000_P2VMAILBOX(vf_number), p2v_mailbox);
+
+ return 0; /* Success is the only option */
+}
+
+/**
+ * e1000_vmdq_loopback_enable_vf - Enables VM to VM queue loopback replication
+ * @hw: pointer to the HW structure
+ **/
+void e1000_vmdq_loopback_enable_vf(struct e1000_hw *hw)
+{
+ u32 reg;
+
+ reg = rd32(E1000_DTXSWC);
+ reg |= E1000_DTXSWC_VMDQ_LOOPBACK_EN;
+ wr32(E1000_DTXSWC, reg);
+}
+
+/**
+ * e1000_vmdq_loopback_disable_vf - Disable VM to VM queue loopbk replication
+ * @hw: pointer to the HW structure
+ **/
+void e1000_vmdq_loopback_disable_vf(struct e1000_hw *hw)
+{
+ u32 reg;
+
+ reg = rd32(E1000_DTXSWC);
+ reg &= ~(E1000_DTXSWC_VMDQ_LOOPBACK_EN);
+ wr32(E1000_DTXSWC, reg);
+}
+
+/**
+ * e1000_vmdq_replication_enable_vf - Enable replication of brdcst & multicst
+ * @hw: pointer to the HW structure
+ *
+ * Enables replication of broadcast and multicast packets from the network
+ * to VM's which have their respective broadcast and multicast accept
+ * bits set in the VM Offload Register. This gives the PF driver per
+ * VM granularity control over which VM's get replicated broadcast traffic.
+ **/
+void e1000_vmdq_replication_enable_vf(struct e1000_hw *hw, u32 enables)
+{
+ u32 reg;
+ u32 i;
+
+ for (i = 0; i < MAX_NUM_VFS; i++) {
+ if (enables & (1 << i)) {
+ reg = rd32(E1000_VMOLR(i));
+ reg |= (E1000_VMOLR_AUPE |
+ E1000_VMOLR_BAM |
+ E1000_VMOLR_MPME);
+ wr32(E1000_VMOLR(i), reg);
+ }
+ }
+
+ reg = rd32(E1000_VMD_CTL);
+ reg |= E1000_VT_CTL_VM_REPL_EN;
+ wr32(E1000_VMD_CTL, reg);
+}
+
+/**
+ * e1000_vmdq_replication_disable_vf - Disable replication of brdcst & multicst
+ * @hw: pointer to the HW structure
+ *
+ * Disables replication of broadcast and multicast packets to the VM's.
+ **/
+void e1000_vmdq_replication_disable_vf(struct e1000_hw *hw)
+{
+ u32 reg;
+
+ reg = rd32(E1000_VMD_CTL);
+ reg &= ~(E1000_VT_CTL_VM_REPL_EN);
+ wr32(E1000_VMD_CTL, reg);
+}
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 4ff6f05..81dfd66 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -294,6 +294,16 @@ struct igb_adapter {
unsigned int lro_flushed;
unsigned int lro_no_desc;
#endif
+#ifdef CONFIG_PCI_IOV
+ unsigned int vfs_allocated_count;
+ struct work_struct msg_task;
+ u32 vf_icr;
+ u32 vflre;
+ unsigned char vf_mac_addresses[8][6];
+ u8 vfta_tracking_entry[128];
+ int int0counter;
+ int int1counter;
+#endif
};

#define IGB_FLAG_HAS_MSI (1 << 0)
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 1cbae85..bc063d4 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -62,6 +62,7 @@ static struct pci_device_id igb_pci_tbl[] = {
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82576), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82576_FIBER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82576_SERDES), board_82575 },
+ { PCI_VDEVICE(INTEL, E1000_DEV_ID_82576_QUAD_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82575EB_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82575EB_FIBER_SERDES), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82575GB_QUAD_COPPER), board_82575 },
@@ -126,6 +127,19 @@ static void igb_vlan_rx_register(struct net_device *, struct vlan_group *);
static void igb_vlan_rx_add_vid(struct net_device *, u16);
static void igb_vlan_rx_kill_vid(struct net_device *, u16);
static void igb_restore_vlan(struct igb_adapter *);
+#ifdef CONFIG_PCI_IOV
+static void igb_msg_task(struct work_struct *);
+int igb_send_msg_to_vf(struct igb_adapter *, u32 *, u32);
+static int igb_get_vf_msg_ack(struct igb_adapter *, u32);
+static int igb_rcv_msg_from_vf(struct igb_adapter *, u32);
+static int igb_set_pf_mac(struct net_device *, int, u8*);
+static void igb_enable_pf_queues(struct igb_adapter *adapter);
+static void igb_set_vf_vmolr(struct igb_adapter *adapter, int vfn);
+void igb_set_mc_list_pools(struct igb_adapter *, struct e1000_hw *, int, u16);
+static int igb_vmm_control(struct igb_adapter *, bool);
+static int igb_set_vf_mac(struct net_device *, int, u8*);
+static void igb_mbox_handler(struct igb_adapter *);
+#endif

static int igb_suspend(struct pci_dev *, pm_message_t);
#ifdef CONFIG_PM
@@ -169,7 +183,7 @@ static struct pci_driver igb_driver = {
.resume = igb_resume,
#endif
.shutdown = igb_shutdown,
- .err_handler = &igb_err_handler
+ .err_handler = &igb_err_handler,
};

static int global_quad_port_a; /* global quad port a indication */
@@ -292,6 +306,11 @@ static void igb_assign_vector(struct igb_adapter *adapter, int rx_queue,
u32 msixbm = 0;
struct e1000_hw *hw = &adapter->hw;
u32 ivar, index;
+#ifdef CONFIG_PCI_IOV
+ u32 rbase_offset = adapter->vfs_allocated_count;
+#else
+ u32 rbase_offset = 0;
+#endif

switch (hw->mac.type) {
case e1000_82575:
@@ -316,9 +335,9 @@ static void igb_assign_vector(struct igb_adapter *adapter, int rx_queue,
a vector number along with a "valid" bit. Sadly, the layout
of the table is somewhat counterintuitive. */
if (rx_queue > IGB_N0_QUEUE) {
- index = (rx_queue & 0x7);
+ index = ((rx_queue + rbase_offset) & 0x7);
ivar = array_rd32(E1000_IVAR0, index);
- if (rx_queue < 8) {
+ if ((rx_queue + rbase_offset) < 8) {
/* vector goes into low byte of register */
ivar = ivar & 0xFFFFFF00;
ivar |= msix_vector | E1000_IVAR_VALID;
@@ -331,9 +350,9 @@ static void igb_assign_vector(struct igb_adapter *adapter, int rx_queue,
array_wr32(E1000_IVAR0, index, ivar);
}
if (tx_queue > IGB_N0_QUEUE) {
- index = (tx_queue & 0x7);
+ index = ((tx_queue + rbase_offset) & 0x7);
ivar = array_rd32(E1000_IVAR0, index);
- if (tx_queue < 8) {
+ if ((tx_queue + rbase_offset) < 8) {
/* vector goes into second byte of register */
ivar = ivar & 0xFFFF00FF;
ivar |= (msix_vector | E1000_IVAR_VALID) << 8;
@@ -419,6 +438,10 @@ static void igb_configure_msix(struct igb_adapter *adapter)
case e1000_82576:
tmp = (vector++ | E1000_IVAR_VALID) << 8;
wr32(E1000_IVAR_MISC, tmp);
+#ifdef CONFIG_PCI_IOV
+ if (adapter->vfs_allocated_count > 0)
+ wr32(E1000_MBVFIMR, 0xFF);
+#endif

adapter->eims_enable_mask = (1 << (vector)) - 1;
adapter->eims_other = 1 << (vector - 1);
@@ -440,6 +463,11 @@ static int igb_request_msix(struct igb_adapter *adapter)
{
struct net_device *netdev = adapter->netdev;
int i, err = 0, vector = 0;
+#ifdef CONFIG_PCI_IOV
+ u32 rbase_offset = adapter->vfs_allocated_count;
+#else
+ u32 rbase_offset = 0;
+#endif

vector = 0;

@@ -451,7 +479,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
&(adapter->tx_ring[i]));
if (err)
goto out;
- ring->itr_register = E1000_EITR(0) + (vector << 2);
+ ring->itr_register = E1000_EITR(0 + rbase_offset) + (vector << 2);
ring->itr_val = 976; /* ~4000 ints/sec */
vector++;
}
@@ -466,7 +494,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
&(adapter->rx_ring[i]));
if (err)
goto out;
- ring->itr_register = E1000_EITR(0) + (vector << 2);
+ ring->itr_register = E1000_EITR(0 + rbase_offset) + (vector << 2);
ring->itr_val = adapter->itr;
/* overwrite the poll routine for MSIX, we've already done
* netif_napi_add */
@@ -649,7 +677,11 @@ static void igb_irq_enable(struct igb_adapter *adapter)
wr32(E1000_EIAC, adapter->eims_enable_mask);
wr32(E1000_EIAM, adapter->eims_enable_mask);
wr32(E1000_EIMS, adapter->eims_enable_mask);
+#ifdef CONFIG_PCI_IOV
+ wr32(E1000_IMS, (E1000_IMS_LSC | E1000_IMS_VMMB));
+#else
wr32(E1000_IMS, E1000_IMS_LSC);
+#endif
} else {
wr32(E1000_IMS, IMS_ENABLE_MASK);
wr32(E1000_IAM, IMS_ENABLE_MASK);
@@ -773,6 +805,16 @@ int igb_up(struct igb_adapter *adapter)
if (adapter->msix_entries)
igb_configure_msix(adapter);

+#ifdef CONFIG_PCI_IOV
+ if (adapter->vfs_allocated_count > 0) {
+ igb_vmm_control(adapter, true);
+ igb_set_pf_mac(adapter->netdev,
+ adapter->vfs_allocated_count,
+ hw->mac.addr);
+ igb_enable_pf_queues(adapter);
+ }
+#endif
+
/* Clear any pending interrupts. */
rd32(E1000_ICR);
igb_irq_enable(adapter);
@@ -1189,6 +1231,9 @@ static int __devinit igb_probe(struct pci_dev *pdev,

INIT_WORK(&adapter->reset_task, igb_reset_task);
INIT_WORK(&adapter->watchdog_task, igb_watchdog_task);
+#ifdef CONFIG_PCI_IOV
+ INIT_WORK(&adapter->msg_task, igb_msg_task);
+#endif

/* Initialize link & ring properties that are user-changeable */
adapter->tx_ring->count = 256;
@@ -1404,8 +1449,13 @@ static int __devinit igb_sw_init(struct igb_adapter *adapter)

/* Number of supported queues. */
/* Having more queues than CPUs doesn't make sense. */
+#ifdef CONFIG_PCI_IOV
+ adapter->num_rx_queues = 1;
+ adapter->num_tx_queues = 1;
+#else
adapter->num_rx_queues = min((u32)IGB_MAX_RX_QUEUES, (u32)num_online_cpus());
adapter->num_tx_queues = min(IGB_MAX_TX_QUEUES, num_online_cpus());
+#endif

/* This call may decrease the number of queues depending on
* interrupt mode. */
@@ -1469,6 +1519,16 @@ static int igb_open(struct net_device *netdev)
* clean_rx handler before we do so. */
igb_configure(adapter);

+#ifdef CONFIG_PCI_IOV
+ if (adapter->vfs_allocated_count > 0) {
+ igb_vmm_control(adapter, true);
+ igb_set_pf_mac(netdev,
+ adapter->vfs_allocated_count,
+ hw->mac.addr);
+ igb_enable_pf_queues(adapter);
+ }
+#endif
+
err = igb_request_irq(adapter);
if (err)
goto err_req_irq;
@@ -1623,9 +1683,14 @@ static void igb_configure_tx(struct igb_adapter *adapter)
u32 tctl;
u32 txdctl, txctrl;
int i;
+#ifdef CONFIG_PCI_IOV
+ u32 rbase_offset = adapter->vfs_allocated_count;
+#else
+ u32 rbase_offset = 0;
+#endif

- for (i = 0; i < adapter->num_tx_queues; i++) {
- struct igb_ring *ring = &(adapter->tx_ring[i]);
+ for (i = rbase_offset; i < (adapter->num_tx_queues + rbase_offset); i++) {
+ struct igb_ring *ring = &(adapter->tx_ring[i - rbase_offset]);

wr32(E1000_TDLEN(i),
ring->count * sizeof(struct e1000_tx_desc));
@@ -1772,6 +1837,12 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
u32 rctl;
u32 srrctl = 0;
int i;
+#ifdef CONFIG_PCI_IOV
+ u32 rbase_offset = adapter->vfs_allocated_count;
+ u32 vmolr;
+#else
+ u32 rbase_offset = 0;
+#endif

rctl = rd32(E1000_RCTL);

@@ -1794,6 +1865,7 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
rctl &= ~E1000_RCTL_LPE;
else
rctl |= E1000_RCTL_LPE;
+#ifndef CONFIG_PCI_IOV
if (adapter->rx_buffer_len <= IGB_RXBUFFER_2048) {
/* Setup buffer sizes */
rctl &= ~E1000_RCTL_SZ_4096;
@@ -1818,9 +1890,12 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
break;
}
} else {
+#endif
rctl &= ~E1000_RCTL_BSEX;
srrctl = adapter->rx_buffer_len >> E1000_SRRCTL_BSIZEPKT_SHIFT;
+#ifndef CONFIG_PCI_IOV
}
+#endif

/* 82575 and greater support packet-split where the protocol
* header is placed in skb->data and the packet data is
@@ -1836,13 +1911,36 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
srrctl |= adapter->rx_ps_hdr_size <<
E1000_SRRCTL_BSIZEHDRSIZE_SHIFT;
srrctl |= E1000_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
+#ifdef CONFIG_PCI_IOV
+ srrctl |= 0x80000000;
+#endif
} else {
adapter->rx_ps_hdr_size = 0;
srrctl |= E1000_SRRCTL_DESCTYPE_ADV_ONEBUF;
}

- for (i = 0; i < adapter->num_rx_queues; i++)
+ for (i = rbase_offset; i < (adapter->num_rx_queues + rbase_offset); i++) {
wr32(E1000_SRRCTL(i), srrctl);
+#ifdef CONFIG_PCI_IOV
+ if ((rctl & E1000_RCTL_LPE) && adapter->vfs_allocated_count > 0 ) {
+ vmolr = rd32(E1000_VMOLR(i));
+ vmolr |= E1000_VMOLR_LPE;
+ wr32(E1000_VMOLR(i), vmolr);
+ }
+#endif
+ }
+
+#ifdef CONFIG_PCI_IOV
+ /* Attention!!! For SR-IOV PF driver operations you must enable
+ * queue drop for the queue 0 or the PF driver will *never* receive
+ * any traffic on it's own default queue, which will be equal to the
+ * number of VFs enabled.
+ */
+ if (adapter->vfs_allocated_count > 0) {
+ srrctl = rd32(E1000_SRRCTL(0));
+ wr32(E1000_SRRCTL(0), (srrctl | 0x80000000));
+ }
+#endif

wr32(E1000_RCTL, rctl);
}
@@ -1860,6 +1958,11 @@ static void igb_configure_rx(struct igb_adapter *adapter)
u32 rctl, rxcsum;
u32 rxdctl;
int i;
+#ifdef CONFIG_PCI_IOV
+ u32 rbase_offset = adapter->vfs_allocated_count;
+#else
+ u32 rbase_offset = 0;
+#endif

/* disable receives while setting up the descriptors */
rctl = rd32(E1000_RCTL);
@@ -1872,8 +1975,8 @@ static void igb_configure_rx(struct igb_adapter *adapter)

/* Setup the HW Rx Head and Tail Descriptor Pointers and
* the Base and Length of the Rx Descriptor Ring */
- for (i = 0; i < adapter->num_rx_queues; i++) {
- struct igb_ring *ring = &(adapter->rx_ring[i]);
+ for (i = rbase_offset; i < (adapter->num_rx_queues + rbase_offset); i++) {
+ struct igb_ring *ring = &(adapter->rx_ring[i - rbase_offset]);
rdba = ring->dma;
wr32(E1000_RDBAL(i),
rdba & 0x00000000ffffffffULL);
@@ -2268,8 +2371,25 @@ static void igb_set_multi(struct net_device *netdev)
memcpy(mta_list + (i*ETH_ALEN), mc_ptr->dmi_addr, ETH_ALEN);
mc_ptr = mc_ptr->next;
}
+#ifdef CONFIG_PCI_IOV
+ if (adapter->vfs_allocated_count > 0) {
+ igb_update_mc_addr_list_82575(hw, mta_list, i,
+ adapter->vfs_allocated_count + 1,
+ mac->rar_entry_count);
+ igb_set_mc_list_pools(adapter, hw, i, mac->rar_entry_count);
+ /* TODO - if this is done after VF's are loaded and have their MC
+ * addresses set then we need to restore their entries in the MTA.
+ * This means we have to save them in the adapter structure somewhere
+ * so that we can retrieve them when this particular event occurs
+ */
+ } else {
+ igb_update_mc_addr_list_82575(hw, mta_list, i, 1,
+ mac->rar_entry_count);
+ }
+#else
igb_update_mc_addr_list_82575(hw, mta_list, i, 1,
mac->rar_entry_count);
+#endif
kfree(mta_list);
}

@@ -3274,6 +3394,22 @@ static irqreturn_t igb_msix_other(int irq, void *data)
struct e1000_hw *hw = &adapter->hw;
u32 icr = rd32(E1000_ICR);

+#ifdef CONFIG_PCI_IOV
+ adapter->int0counter++;
+
+ /* Check for a mailbox event */
+ if (icr & E1000_ICR_VMMB) {
+ adapter->vf_icr = rd32(E1000_MBVFICR);
+ /* Clear the bits */
+ wr32(E1000_MBVFICR, adapter->vf_icr);
+ E1000_WRITE_FLUSH(hw);
+ adapter->vflre = rd32(E1000_VFLRE);
+ wr32(E1000_VFLRE, adapter->vflre);
+ E1000_WRITE_FLUSH(hw);
+ igb_mbox_handler(adapter);
+ }
+#endif
+
/* reading ICR causes bit 31 of EICR to be cleared */
if (!(icr & E1000_ICR_LSC))
goto no_link_interrupt;
@@ -3283,6 +3419,11 @@ static irqreturn_t igb_msix_other(int irq, void *data)
mod_timer(&adapter->watchdog_timer, jiffies + 1);

no_link_interrupt:
+#ifdef CONFIG_PCI_IOV
+ if (adapter->vfs_allocated_count != 0)
+ wr32(E1000_IMS, E1000_IMS_LSC | E1000_IMS_VMMB);
+ else
+#endif
wr32(E1000_IMS, E1000_IMS_LSC);
wr32(E1000_EIMS, adapter->eims_other);

@@ -3342,6 +3483,10 @@ static irqreturn_t igb_msix_rx(int irq, void *data)
* previous interrupt.
*/

+#ifdef CONFIG_PCI_IOV
+ adapter->int1counter++;
+#endif
+
igb_write_itr(rx_ring);

if (netif_rx_schedule_prep(adapter->netdev, &rx_ring->napi))
@@ -4192,6 +4337,9 @@ static void igb_vlan_rx_add_vid(struct net_device *netdev, u16 vid)
vfta = array_rd32(E1000_VFTA, index);
vfta |= (1 << (vid & 0x1F));
igb_write_vfta(&adapter->hw, index, vfta);
+#ifdef CONFIG_PCI_IOV
+ adapter->vfta_tracking_entry[index] = (u8)vfta;
+#endif
}

static void igb_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
@@ -4219,6 +4367,9 @@ static void igb_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
vfta = array_rd32(E1000_VFTA, index);
vfta &= ~(1 << (vid & 0x1F));
igb_write_vfta(&adapter->hw, index, vfta);
+#ifdef CONFIG_PCI_IOV
+ adapter->vfta_tracking_entry[index] = (u8)vfta;
+#endif
}

static void igb_restore_vlan(struct igb_adapter *adapter)
@@ -4529,4 +4680,433 @@ static void igb_io_resume(struct pci_dev *pdev)

}

+#ifdef CONFIG_PCI_IOV
+static void igb_set_vf_multicasts(struct igb_adapter *adapter,
+ u32 *msgbuf, u32 vf)
+{
+ struct e1000_hw *hw = &adapter->hw;
+ int n = (msgbuf[0] & E1000_VT_MSGINFO_MASK) >> E1000_VT_MSGINFO_SHIFT;
+ int i;
+ u32 hash_value;
+ u8 *p = (u8 *)&msgbuf[1];
+
+ /* VFs are limited to using the MTA hash table for their multicast
+ * addresses */
+ for (i = 0; i < n; i++) {
+ hash_value = igb_hash_mc_addr(hw, p);
+ printk("Adding MC Addr: %2.2X:%2.2X:%2.2X:%2.2X:%2.2X:%2.2X\n"
+ "for VF %d\n",
+ p[0],
+ p[1],
+ p[2],
+ p[3],
+ p[4],
+ p[5],
+ vf);
+ printk("Hash value = 0x%03X\n", hash_value);
+ igb_mta_set(hw, hash_value);
+ p += ETH_ALEN;
+ }
+}
+
+static void igb_set_vf_vlan(struct igb_adapter *adapter,
+ u32 *msgbuf, u32 vf)
+{
+ struct e1000_hw *hw = &adapter->hw;
+ int add = (msgbuf[0] & E1000_VT_MSGINFO_MASK) >> E1000_VT_MSGINFO_SHIFT;
+ int vid = (msgbuf[1] & E1000_VLVF_VLANID_MASK);
+ u32 reg, index, vfta;
+ int i;
+
+ if (add) {
+ /* See if a vlan filter for this id is already
+ * set and enabled */
+ for(i = 0; i < E1000_VLVF_ARRAY_SIZE; i++) {
+ reg = rd32(E1000_VLVF(i));
+ if ((reg & E1000_VLVF_VLANID_ENABLE) &&
+ vid == (reg & E1000_VLVF_VLANID_MASK))
+ break;
+ }
+ if (i < E1000_VLVF_ARRAY_SIZE) {
+ /* Found an enabled entry with the same VLAN
+ * ID. Just enable the pool select bit for
+ * this requesting VF
+ */
+ reg |= 1 << (E1000_VLVF_POOLSEL_SHIFT + vf);
+ wr32(E1000_VLVF(i), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ } else {
+ /* Did not find a matching VLAN ID filter entry
+ * that was also enabled. Search for a free
+ * filter entry, i.e. one without the enable
+ * bit set
+ */
+ for(i = 0; i < E1000_VLVF_ARRAY_SIZE; i++) {
+ reg = rd32(E1000_VLVF(i));
+ if (!(reg & E1000_VLVF_VLANID_ENABLE))
+ break;
+ }
+ if (i == E1000_VLVF_ARRAY_SIZE) {
+ /* oops, no free entry, send nack */
+ msgbuf[0] |= E1000_VT_MSGTYPE_NACK;
+ } else {
+ /* add VID to filter table */
+ index = (vid >> 5) & 0x7F;
+ vfta = array_rd32(E1000_VFTA, index);
+ vfta |= (1 << (vid & 0x1F));
+ igb_write_vfta(hw, index, vfta);
+ reg |= vid;
+ reg |= 1 << (E1000_VLVF_POOLSEL_SHIFT + vf);
+ reg |= E1000_VLVF_VLANID_ENABLE;
+ wr32(E1000_VLVF(i), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ }
+ }
+ } else {
+ /* Find the vlan filter for this id */
+ for(i = 0; i < E1000_VLVF_ARRAY_SIZE; i++) {
+ reg = rd32(E1000_VLVF(i));
+ if ((reg & E1000_VLVF_VLANID_ENABLE) &&
+ vid == (reg & E1000_VLVF_VLANID_MASK))
+ break;
+ }
+ if (i == E1000_VLVF_ARRAY_SIZE) {
+ /* oops, not found. send nack */
+ msgbuf[0] |= E1000_VT_MSGTYPE_NACK;
+ } else {
+ u32 pool_sel;
+ /* Check to see if the entry belongs to more than one
+ * pool. If so just reset this VF's pool select bit
+ */
+ /* mask off the pool select bits */
+ pool_sel = (reg & E1000_VLVF_POOLSEL_MASK) >>
+ E1000_VLVF_POOLSEL_SHIFT;
+ /* reset this VF's pool select bit */
+ pool_sel &= ~(1 << vf);
+ /* check if other pools are set */
+ if (pool_sel != 0) {
+ reg &= ~(E1000_VLVF_POOLSEL_MASK);
+ reg |= pool_sel;
+ } else {
+ /* just disable the whole entry */
+ reg = 0;
+ /* remove VID from filter table *IF AND
+ * ONLY IF!!!* this entry was enabled for
+ * VFs only through a write to the VFTA
+ * table a few lines above here in this
+ * function. If this VFTA entry was added
+ * through the rx_add_vid function then
+ * we can't delete it here. */
+ index = (vid >> 5) & 0x7F;
+ if (adapter->vfta_tracking_entry[index] == 0) {
+ vfta = array_rd32(E1000_VFTA, index);
+ vfta &= ~(1 << (vid & 0x1F));
+ igb_write_vfta(hw, index, vfta);
+ }
+ }
+ wr32(E1000_VLVF(i), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ }
+ }
+}
+
+static void igb_msg_task(struct work_struct *work)
+{
+ struct igb_adapter *adapter;
+ struct e1000_hw *hw;
+ u32 bit, vf, vfr;
+ u32 vflre;
+ u32 vf_icr;
+
+ adapter = container_of(work, struct igb_adapter, msg_task);
+ hw = &adapter->hw;
+
+ vflre = adapter->vflre;
+ vf_icr = adapter->vf_icr;
+
+ /* Now that we have salted away local values of these events
+ * for processing we can enable the interrupt so more events
+ * can be captured
+ */
+
+ wr32(E1000_IMS, E1000_IMS_VMMB);
+
+ if (vflre & 0xFF) {
+ printk("VFLR Event %2.2X\n", vflre);
+ vfr = rd32(E1000_VFRE);
+ wr32(E1000_VFRE, vfr | vflre);
+ E1000_WRITE_FLUSH(hw);
+ vfr = rd32(E1000_VFTE);
+ wr32(E1000_VFTE, vfr | vflre);
+ E1000_WRITE_FLUSH(hw);
+ }
+
+ if (!vf_icr)
+ return;
+
+ /* Check for message acks from VF first as that may affect
+ * pending messages to the VF
+ */
+ for (bit = 1, vf = 0; bit < 0x100; bit <<= 1, vf++) {
+ if ((bit << 16) & vf_icr)
+ igb_get_vf_msg_ack(adapter, vf);
+ }
+
+ /* Check for message sent from a VF */
+ for (bit = 1, vf = 0; bit < 0x100; bit <<= 1, vf++) {
+ if (bit & vf_icr)
+ igb_rcv_msg_from_vf(adapter, vf);
+ }
+}
+
+int igb_send_msg_to_vf(struct igb_adapter *adapter, u32 *msg, u32 vfn)
+{
+ struct e1000_hw *hw = &adapter->hw;
+
+ return e1000_send_mail_to_vf(hw, msg, vfn, 16);
+}
+
+static int igb_get_vf_msg_ack(struct igb_adapter *adapter, u32 vf)
+{
+ return 0;
+}
+
+static int igb_rcv_msg_from_vf(struct igb_adapter *adapter, u32 vf)
+{
+ u32 msgbuf[E1000_VFMAILBOX_SIZE];
+ struct net_device *netdev = adapter->netdev;
+ struct e1000_hw *hw = &adapter->hw;
+ u32 reg;
+ s32 retval;
+ int err = 0;
+
+ retval = e1000_receive_mail_from_vf(hw, msgbuf, vf, 16);
+
+ switch ((msgbuf[0] & 0xFFFF)) {
+ case E1000_VF_MSGTYPE_REQ_MAC:
+ {
+ unsigned char *p;
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ p = (char *)&msgbuf[1];
+ memcpy(p, adapter->vf_mac_addresses[vf], ETH_ALEN);
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf)
+ == 0)) {
+ printk(KERN_INFO "Sending MAC Address %2.2x:%2.2x:"
+ "%2.2x:%2.2x:%2.2x:%2.2x to VF %d\n",
+ p[0], p[1], p[2], p[3], p[4], p[5], vf);
+ igb_set_vf_mac(netdev,
+ vf,
+ adapter->vf_mac_addresses[vf]);
+ igb_set_vf_vmolr(adapter, vf);
+ }
+ else {
+ printk(KERN_ERR "Error %d Sending MAC Address to VF\n",
+ err);
+ }
+ }
+ break;
+ case E1000_VF_MSGTYPE_VFLR:
+ {
+ u32 vfe = rd32(E1000_VFTE);
+ vfe |= (1 << vf);
+ wr32(E1000_VFTE, vfe);
+ vfe = rd32(E1000_VFRE);
+ vfe |= (1 << vf);
+ wr32(E1000_VFRE, vfe);
+ printk(KERN_INFO "Enabling VFTE and VFRE for vf %d\n",
+ vf);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf)
+ != 0))
+ printk(KERN_ERR "Error %d Sending VFLR Ack"
+ "to VF\n", err);
+ }
+ break;
+ case E1000_VF_SET_MULTICAST:
+ igb_set_vf_multicasts(adapter, msgbuf, vf);
+ break;
+ case E1000_VF_SET_LPE:
+ /* Make sure global LPE is set */
+ reg = rd32(E1000_RCTL);
+ reg |= E1000_RCTL_LPE;
+ wr32(E1000_RCTL, reg);
+ /* Set per VM LPE */
+ reg = rd32(E1000_VMOLR(vf));
+ reg |= E1000_VMOLR_LPE;
+ wr32(E1000_VMOLR(vf), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf) != 0))
+ printk(KERN_ERR "Error %d Sending set VMOLR LPE Ack"
+ "to VF\n", err);
+ break;
+ case E1000_VF_SET_VLAN:
+ igb_set_vf_vlan(adapter, msgbuf, vf);
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf) != 0))
+ printk(KERN_ERR "Error %d Sending set VLAN ID Ack"
+ "to VF\n", err);
+ break;
+ default:
+ if ((msgbuf[0] & 0xFF000000) != E1000_VT_MSGTYPE_ACK &&
+ (msgbuf[0] & 0xFF000000) != E1000_VT_MSGTYPE_NACK)
+ printk(KERN_ERR "Unhandled Msg %8.8x\n", msgbuf[0]);
+ break;
+ }
+
+ return retval;
+}
+
+static void igb_mbox_handler(struct igb_adapter *adapter)
+{
+ schedule_work(&adapter->msg_task);
+}
+
+#define E1000_RAH(_i) (((_i) <= 15) ? (0x05404 + ((_i) * 8)) : (0x054E4 + ((_i - 16) * 8)))
+
+static int igb_set_pf_mac(struct net_device *netdev, int queue, u8*mac_addr)
+{
+ struct igb_adapter *adapter;
+ struct e1000_hw *hw;
+ u32 reg_data;
+
+ adapter = netdev_priv(netdev);
+ hw = &adapter->hw;
+
+ /* point the pool selector for our default MAC entry to
+ * the right pool, which is equal to the number of vfs enabled.
+ */
+ reg_data = rd32(E1000_RAH(0));
+ reg_data |= (1 << (18 + queue));
+ wr32(E1000_RAH(0), reg_data);
+
+ return 0;
+}
+
+static void igb_set_vf_vmolr(struct igb_adapter *adapter, int vfn)
+{
+ struct e1000_hw *hw = &adapter->hw;
+ u32 reg_data;
+
+ reg_data = rd32(E1000_VMOLR(vfn));
+ reg_data |= 0xF << 24; /* aupe, rompe, rope, bam */
+ reg_data |= E1000_VMOLR_STRVLAN; /* Strip vlan tags */
+ wr32(E1000_VMOLR(vfn), reg_data);
+}
+
+static int igb_set_vf_mac(struct net_device *netdev,
+ int vf,
+ unsigned char *mac_addr)
+{
+ struct igb_adapter *adapter;
+ struct e1000_hw *hw;
+ u32 reg_data;
+ int rar_entry = vf + 1; /* VF MAC addresses start at entry 1 */
+
+ adapter = netdev_priv(netdev);
+ hw = &adapter->hw;
+
+ igb_rar_set(hw, mac_addr, rar_entry);
+
+ memcpy(adapter->vf_mac_addresses[vf], mac_addr, 6);
+
+ reg_data = rd32(E1000_RAH(rar_entry));
+ reg_data |= (1 << (18 + vf));
+ wr32(E1000_RAH(rar_entry), reg_data);
+
+ return 0;
+}
+
+static int igb_vmm_control(struct igb_adapter *adapter, bool enable)
+{
+ struct e1000_hw *hw;
+ u32 reg_data;
+
+ hw = &adapter->hw;
+
+ if (enable) {
+ /* Enable multi-queue */
+ reg_data = rd32(E1000_MRQC);
+ reg_data &= E1000_MRQC_ENABLE_MASK;
+ reg_data |= E1000_MRQC_ENABLE_VMDQ;
+ wr32(E1000_MRQC, reg_data);
+ /* VF's need PF reset indication before they
+ * can send/receive mail */
+ reg_data = rd32(E1000_CTRL_EXT);
+ reg_data |= E1000_CTRL_EXT_PFRSTD;
+ wr32(E1000_CTRL_EXT, reg_data);
+
+ /* Set the default pool for the PF's first queue */
+ reg_data = rd32(E1000_VMD_CTL);
+ reg_data &= ~(E1000_VMD_CTL | E1000_VT_CTL_DISABLE_DEF_POOL);
+ reg_data |= adapter->vfs_allocated_count <<
+ E1000_VT_CTL_DEFAULT_POOL_SHIFT;
+ wr32(E1000_VMD_CTL, reg_data);
+
+ e1000_vmdq_loopback_enable_vf(hw);
+ e1000_vmdq_replication_enable_vf(hw, 0xFF);
+ } else {
+ e1000_vmdq_loopback_disable_vf(hw);
+ e1000_vmdq_replication_disable_vf(hw);
+ }
+
+ return 0;
+}
+
+static void igb_enable_pf_queues(struct igb_adapter *adapter)
+{
+ u64 rdba;
+ int i;
+ u32 rbase_offset = adapter->vfs_allocated_count;
+ struct e1000_hw *hw = &adapter->hw;
+ u32 rxdctl;
+
+ for (i = rbase_offset;
+ i < (adapter->num_rx_queues + rbase_offset); i++) {
+ struct igb_ring *ring = &adapter->rx_ring[i - rbase_offset];
+ rdba = ring->dma;
+
+ rxdctl = rd32(E1000_RXDCTL(i));
+ rxdctl |= E1000_RXDCTL_QUEUE_ENABLE;
+ rxdctl &= 0xFFF00000;
+ rxdctl |= IGB_RX_PTHRESH;
+ rxdctl |= IGB_RX_HTHRESH << 8;
+ rxdctl |= IGB_RX_WTHRESH << 16;
+ wr32(E1000_RXDCTL(i), rxdctl);
+ printk("RXDCTL%d == %8.8x\n", i, rxdctl);
+
+ wr32(E1000_RDBAL(i),
+ rdba & 0x00000000ffffffffULL);
+ wr32(E1000_RDBAH(i), rdba >> 32);
+ wr32(E1000_RDLEN(i),
+ ring->count * sizeof(union e1000_adv_rx_desc));
+
+ writel(ring->next_to_use, adapter->hw.hw_addr + ring->tail);
+ writel(ring->next_to_clean, adapter->hw.hw_addr + ring->head);
+ }
+}
+
+void igb_set_mc_list_pools(struct igb_adapter *adapter,
+ struct e1000_hw *hw,
+ int entry_count, u16 total_rar_filters)
+{
+ u32 reg_data;
+ int i;
+ int pool = adapter->vfs_allocated_count;
+
+ for (i = adapter->vfs_allocated_count + 1; i < total_rar_filters; i++) {
+ reg_data = rd32(E1000_RAH(i));
+ reg_data |= (1 << (18 + pool));
+ wr32(E1000_RAH(i), reg_data);
+ entry_count--;
+ if (!entry_count)
+ break;
+ }
+
+ reg_data = rd32(E1000_VMOLR(pool));
+ /* Set bit 25 for this pool in the VM Offload register so that
+ * it can accept packets that match the MTA table */
+ reg_data |= (1 << 25);
+ wr32(E1000_VMOLR(pool), reg_data);
+}
+#endif
+
/* igb_main.c */
--
1.5.4.4

2008-11-26 15:18:57

by Zhao, Yu

[permalink] [raw]
Subject: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

This patch integrates the IGB driver with the SR-IOV core. It shows how
the SR-IOV API is used to support the capability. Obviously people does
not need to put much effort to integrate the PF driver with SR-IOV core.
All SR-IOV standard stuff are handled by SR-IOV core and PF driver only
concerns the device specific resource allocation and deallocation once it
gets the necessary information (i.e. number of Virtual Functions) from
the callback function.

---
drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index bc063d4..b8c7dc6 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, struct e1000_hw *, int, u16);
static int igb_vmm_control(struct igb_adapter *, bool);
static int igb_set_vf_mac(struct net_device *, int, u8*);
static void igb_mbox_handler(struct igb_adapter *);
+static int igb_virtual(struct pci_dev *, int);
#endif

static int igb_suspend(struct pci_dev *, pm_message_t);
@@ -184,6 +185,9 @@ static struct pci_driver igb_driver = {
#endif
.shutdown = igb_shutdown,
.err_handler = &igb_err_handler,
+#ifdef CONFIG_PCI_IOV
+ .virtual = igb_virtual
+#endif
};

static int global_quad_port_a; /* global quad port a indication */
@@ -5107,6 +5111,32 @@ void igb_set_mc_list_pools(struct igb_adapter *adapter,
reg_data |= (1 << 25);
wr32(E1000_VMOLR(pool), reg_data);
}
+
+static int
+igb_virtual(struct pci_dev *pdev, int nr_virtfn)
+{
+ unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ struct igb_adapter *adapter = netdev_priv(netdev);
+ int i;
+
+ if (nr_virtfn > 7)
+ return -EINVAL;
+
+ if (nr_virtfn) {
+ for (i = 0; i < nr_virtfn; i++) {
+ printk(KERN_INFO "SR-IOV: VF %d is enabled\n", i);
+ my_mac_addr[5] = (unsigned char)i;
+ igb_set_vf_mac(netdev, i, my_mac_addr);
+ igb_set_vf_vmolr(adapter, i);
+ }
+ } else
+ printk(KERN_INFO "SR-IOV is disabled\n");
+
+ adapter->vfs_allocated_count = nr_virtfn;
+
+ return 0;
+}
#endif

/* igb_main.c */
--
1.5.4.4

2008-11-26 15:37:55

by Zhao, Yu

[permalink] [raw]
Subject: [SR-IOV driver example 3/3] VF driver tar ball

The attachment is the VF driver for Intel 82576 NIC. Since the VF
appears as the normal PCI device driver, this VF driver has no
difference from other PCI NIC drivers. It handles interrupt, DMA
operations, etc. to perform packet receiving and transmission.

How the design of the VF internals is up to the hardware vendor.
So the VF may have totally different register set from the PF,
which means the VF driver may have its own logics (rather than
derive from PF driver) to handle the hardware specific stuff.


Attachments:
(No filename) (511.00 B)
igbvf-0.5.2.tar.gz (138.11 kB)
Download all attachments

2008-11-26 17:01:28

by Greg KH

[permalink] [raw]
Subject: Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> This patch integrates the IGB driver with the SR-IOV core. It shows how
> the SR-IOV API is used to support the capability. Obviously people does
> not need to put much effort to integrate the PF driver with SR-IOV core.
> All SR-IOV standard stuff are handled by SR-IOV core and PF driver only
> concerns the device specific resource allocation and deallocation once it
> gets the necessary information (i.e. number of Virtual Functions) from
> the callback function.
>
> ---
> drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
> 1 files changed, 30 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> index bc063d4..b8c7dc6 100644
> --- a/drivers/net/igb/igb_main.c
> +++ b/drivers/net/igb/igb_main.c
> @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, struct e1000_hw *, int, u16);
> static int igb_vmm_control(struct igb_adapter *, bool);
> static int igb_set_vf_mac(struct net_device *, int, u8*);
> static void igb_mbox_handler(struct igb_adapter *);
> +static int igb_virtual(struct pci_dev *, int);
> #endif
>
> static int igb_suspend(struct pci_dev *, pm_message_t);
> @@ -184,6 +185,9 @@ static struct pci_driver igb_driver = {
> #endif
> .shutdown = igb_shutdown,
> .err_handler = &igb_err_handler,
> +#ifdef CONFIG_PCI_IOV
> + .virtual = igb_virtual
> +#endif

#ifdef should not be needed, right?

> };
>
> static int global_quad_port_a; /* global quad port a indication */
> @@ -5107,6 +5111,32 @@ void igb_set_mc_list_pools(struct igb_adapter *adapter,
> reg_data |= (1 << 25);
> wr32(E1000_VMOLR(pool), reg_data);
> }
> +
> +static int
> +igb_virtual(struct pci_dev *pdev, int nr_virtfn)
> +{
> + unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
> + struct net_device *netdev = pci_get_drvdata(pdev);
> + struct igb_adapter *adapter = netdev_priv(netdev);
> + int i;
> +
> + if (nr_virtfn > 7)
> + return -EINVAL;

Why the check for 7? Is that the max virtual functions for this card?
Shouldn't that be a define somewhere so it's easier to fix in future
versions of this hardware? :)

> +
> + if (nr_virtfn) {
> + for (i = 0; i < nr_virtfn; i++) {
> + printk(KERN_INFO "SR-IOV: VF %d is enabled\n", i);

Use dev_info() please, that shows the exact pci device and driver that
emitted the message.

> + my_mac_addr[5] = (unsigned char)i;
> + igb_set_vf_mac(netdev, i, my_mac_addr);
> + igb_set_vf_vmolr(adapter, i);
> + }
> + } else
> + printk(KERN_INFO "SR-IOV is disabled\n");

Is that really true? (oh, use dev_info as well.) What happens if you
had called this with "5" and then later with "0", you never destroyed
those existing virtual functions, yet the code does:

> + adapter->vfs_allocated_count = nr_virtfn;

Which makes the driver think they are not present. What happens when
the driver later goes to shut down? Are those resources freed up
properly?

thanks,

greg k-h

2008-11-26 17:01:45

by Greg KH

[permalink] [raw]
Subject: Re: [SR-IOV driver example 0/3] introduction

On Wed, Nov 26, 2008 at 10:03:03PM +0800, Yu Zhao wrote:
> SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> of the drivers: Physical Function driver and Virtual Function driver.
> The PF driver is based on the IGB driver and is used to control PF to
> allocate hardware specific resources and interface with the SR-IOV core.
> The VF driver is a new NIC driver that is same as the traditional PCI
> device driver. It works in both the host and the guest (Xen and KVM)
> environment.
>
> These two drivers are testing versions and they are *only* intended to
> show how to use SR-IOV API.

That's funny, as some distros are already shipping this driver. You
might want to tell them that this is an "example only" driver and not to
be used "for real"... :(

greg k-h

2008-11-26 17:02:04

by Greg KH

[permalink] [raw]
Subject: Re: [SR-IOV driver example 3/3] VF driver tar ball

On Wed, Nov 26, 2008 at 10:40:43PM +0800, Yu Zhao wrote:
> The attachment is the VF driver for Intel 82576 NIC.

Please don't attach things as tarballs, we can't review or easily read
them at all.

Care to resend it?

thanks,

greg k-h

2008-11-26 17:54:49

by Chris Wright

[permalink] [raw]
Subject: Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

* Greg KH ([email protected]) wrote:
> > +static int
> > +igb_virtual(struct pci_dev *pdev, int nr_virtfn)
> > +{
> > + unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
> > + struct net_device *netdev = pci_get_drvdata(pdev);
> > + struct igb_adapter *adapter = netdev_priv(netdev);
> > + int i;
> > +
> > + if (nr_virtfn > 7)
> > + return -EINVAL;
>
> Why the check for 7? Is that the max virtual functions for this card?
> Shouldn't that be a define somewhere so it's easier to fix in future
> versions of this hardware? :)

IIRC it's 8 for the card, 1 reserved for PF. I think both notions
should be captured w/ commented constants.

thanks,
-chris

2008-11-26 19:27:32

by Nakajima, Jun

[permalink] [raw]
Subject: RE: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

On 11/26/2008 8:58:59 AM, Greg KH wrote:
> On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> > This patch integrates the IGB driver with the SR-IOV core. It shows
> > how the SR-IOV API is used to support the capability. Obviously
> > people does not need to put much effort to integrate the PF driver
> > with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV
> > core and PF driver once it gets the necessary information (i.e.
> > number of Virtual
> > Functions) from the callback function.
> >
> > ---
> > drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
> > 1 files changed, 30 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> > index bc063d4..b8c7dc6 100644
> > --- a/drivers/net/igb/igb_main.c
> > +++ b/drivers/net/igb/igb_main.c
> > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *,
> > struct e1000_hw *, int, u16); static int igb_vmm_control(struct
> > igb_adapter *, bool); static int igb_set_vf_mac(struct net_device
> > *, int, u8*); static void igb_mbox_handler(struct igb_adapter *);
> > +static int igb_virtual(struct pci_dev *, int);
> > #endif
> >
> > static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6
> > +185,9 @@ static struct pci_driver igb_driver = { #endif
> > .shutdown = igb_shutdown,
> > .err_handler = &igb_err_handler,
> > +#ifdef CONFIG_PCI_IOV
> > + .virtual = igb_virtual
> > +#endif
>
> #ifdef should not be needed, right?
>

Good point. I think this is because the driver is expected to build on older kernels also, but the problem is that the driver (and probably others) is broken unless the kernel is built with CONFIG_PCI_IOV because of the following hunk, for example.

However, we don't want to use #ifdef for the (*virtual) field in the header. One option would be to define a constant like the following along with those changes.
#define PCI_DEV_IOV

Any better idea?

Thanks,
.
Jun Nakajima | Intel Open Source Technology Center

----
@@ -259,6 +266,7 @@ struct pci_dev {
struct list_head msi_list;
#endif
struct pci_vpd *vpd;
+ struct pci_iov *iov;
};

extern struct pci_dev *alloc_pci_dev(void); @@ -426,6 +434,7 @@ struct pci_driver {
int (*resume_early) (struct pci_dev *dev);
int (*resume) (struct pci_dev *dev); /* Device woken up */
void (*shutdown) (struct pci_dev *dev);
+ int (*virtual) (struct pci_dev *dev, int nr_virtfn);
struct pm_ext_ops *pm;
struct pci_error_handlers *err_handler;
struct device_driver driver;

2008-11-26 19:57:20

by Greg KH

[permalink] [raw]
Subject: Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

On Wed, Nov 26, 2008 at 11:27:10AM -0800, Nakajima, Jun wrote:
> On 11/26/2008 8:58:59 AM, Greg KH wrote:
> > On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> > > This patch integrates the IGB driver with the SR-IOV core. It shows
> > > how the SR-IOV API is used to support the capability. Obviously
> > > people does not need to put much effort to integrate the PF driver
> > > with SR-IOV core. All SR-IOV standard stuff are handled by SR-IOV
> > > core and PF driver once it gets the necessary information (i.e.
> > > number of Virtual
> > > Functions) from the callback function.
> > >
> > > ---
> > > drivers/net/igb/igb_main.c | 30 ++++++++++++++++++++++++++++++
> > > 1 files changed, 30 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
> > > index bc063d4..b8c7dc6 100644
> > > --- a/drivers/net/igb/igb_main.c
> > > +++ b/drivers/net/igb/igb_main.c
> > > @@ -139,6 +139,7 @@ void igb_set_mc_list_pools(struct igb_adapter *,
> > > struct e1000_hw *, int, u16); static int igb_vmm_control(struct
> > > igb_adapter *, bool); static int igb_set_vf_mac(struct net_device
> > > *, int, u8*); static void igb_mbox_handler(struct igb_adapter *);
> > > +static int igb_virtual(struct pci_dev *, int);
> > > #endif
> > >
> > > static int igb_suspend(struct pci_dev *, pm_message_t); @@ -184,6
> > > +185,9 @@ static struct pci_driver igb_driver = { #endif
> > > .shutdown = igb_shutdown,
> > > .err_handler = &igb_err_handler,
> > > +#ifdef CONFIG_PCI_IOV
> > > + .virtual = igb_virtual
> > > +#endif
> >
> > #ifdef should not be needed, right?
> >
>
> Good point. I think this is because the driver is expected to build on
> older kernels also,

That should not be an issue for patches that are being submitted, right?

And if this is the case, shouldn't it be called out in the changelog
entry?

> but the problem is that the driver (and probably others) is broken
> unless the kernel is built with CONFIG_PCI_IOV because of the
> following hunk, for example.
>
> However, we don't want to use #ifdef for the (*virtual) field in the
> header. One option would be to define a constant like the following
> along with those changes.
> #define PCI_DEV_IOV
>
> Any better idea?

Just always declare it in your driver, which will be added _after_ this
field gets added to the kernel tree as well. It's not a big deal, just
an ordering of patches issue.

Because remember, don't add #ifdefs to drivers, they should not be
needed at all.

thanks,

greg k-h

2008-11-26 20:15:17

by Jeff Garzik

[permalink] [raw]
Subject: Re: [SR-IOV driver example 0/3] introduction

Yu Zhao wrote:
> SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> of the drivers: Physical Function driver and Virtual Function driver.
> The PF driver is based on the IGB driver and is used to control PF to
> allocate hardware specific resources and interface with the SR-IOV core.
> The VF driver is a new NIC driver that is same as the traditional PCI
> device driver. It works in both the host and the guest (Xen and KVM)
> environment.
>
> These two drivers are testing versions and they are *only* intended to
> show how to use SR-IOV API.
>
> Intel 82576 NIC specification can be found at:
> http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf
>
> [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource
> [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
> [SR-IOV driver example 3/3] VF driver tar ball

Please copy [email protected] on all network-related patches. This
is where the network developers live, and all patches on this list are
automatically archived for review and handling at
http://patchwork.ozlabs.org/project/netdev/list/

Jeff


2008-12-01 12:35:21

by Zhao, Yu

[permalink] [raw]
Subject: Re: [SR-IOV driver example 0/3] introduction

On Thu, Nov 27, 2008 at 04:14:48AM +0800, Jeff Garzik wrote:
> Yu Zhao wrote:
> > SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> > of the drivers: Physical Function driver and Virtual Function driver.
> > The PF driver is based on the IGB driver and is used to control PF to
> > allocate hardware specific resources and interface with the SR-IOV core.
> > The VF driver is a new NIC driver that is same as the traditional PCI
> > device driver. It works in both the host and the guest (Xen and KVM)
> > environment.
> >
> > These two drivers are testing versions and they are *only* intended to
> > show how to use SR-IOV API.
> >
> > Intel 82576 NIC specification can be found at:
> > http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf
> >
> > [SR-IOV driver example 1/3] PF driver: allocate hardware specific resource
> > [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core
> > [SR-IOV driver example 3/3] VF driver tar ball
>
> Please copy [email protected] on all network-related patches. This
> is where the network developers live, and all patches on this list are
> automatically archived for review and handling at
> http://patchwork.ozlabs.org/project/netdev/list/

Will do.

Thanks,
Yu

2008-12-01 12:40:28

by Zhao, Yu

[permalink] [raw]
Subject: Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

On Thu, Nov 27, 2008 at 12:58:59AM +0800, Greg KH wrote:
> On Wed, Nov 26, 2008 at 10:21:56PM +0800, Yu Zhao wrote:
> > + my_mac_addr[5] = (unsigned char)i;
> > + igb_set_vf_mac(netdev, i, my_mac_addr);
> > + igb_set_vf_vmolr(adapter, i);
> > + }
> > + } else
> > + printk(KERN_INFO "SR-IOV is disabled\n");
>
> Is that really true? (oh, use dev_info as well.) What happens if you
> had called this with "5" and then later with "0", you never destroyed
> those existing virtual functions, yet the code does:
>
> > + adapter->vfs_allocated_count = nr_virtfn;
>
> Which makes the driver think they are not present. What happens when
> the driver later goes to shut down? Are those resources freed up
> properly?

For now we hard-code the tx/rx queues allocation so this doesn't
matter. Eventually this will become dynamic allocation: when number
of VFs changes the corresponding resources need to be freed.

I'll put more comments here.

Thanks,
Yu

2008-12-01 12:42:21

by Zhao, Yu

[permalink] [raw]
Subject: Re: [SR-IOV driver example 2/3] PF driver: integrate with SR-IOV core

On Thu, Nov 27, 2008 at 01:54:27AM +0800, Chris Wright wrote:
> * Greg KH ([email protected]) wrote:
> > > +static int
> > > +igb_virtual(struct pci_dev *pdev, int nr_virtfn)
> > > +{
> > > + unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
> > > + struct net_device *netdev = pci_get_drvdata(pdev);
> > > + struct igb_adapter *adapter = netdev_priv(netdev);
> > > + int i;
> > > +
> > > + if (nr_virtfn > 7)
> > > + return -EINVAL;
> >
> > Why the check for 7? Is that the max virtual functions for this card?
> > Shouldn't that be a define somewhere so it's easier to fix in future
> > versions of this hardware? :)
>
> IIRC it's 8 for the card, 1 reserved for PF. I think both notions
> should be captured w/ commented constants.

You remember correctly! I'll put some comments there as suggested.

Thanks,
Yu

2008-12-01 12:50:50

by Zhao, Yu

[permalink] [raw]
Subject: Re: [SR-IOV driver example 0/3] introduction

On Thu, Nov 27, 2008 at 12:59:33AM +0800, Greg KH wrote:
> On Wed, Nov 26, 2008 at 10:03:03PM +0800, Yu Zhao wrote:
> > SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> > of the drivers: Physical Function driver and Virtual Function driver.
> > The PF driver is based on the IGB driver and is used to control PF to
> > allocate hardware specific resources and interface with the SR-IOV core.
> > The VF driver is a new NIC driver that is same as the traditional PCI
> > device driver. It works in both the host and the guest (Xen and KVM)
> > environment.
> >
> > These two drivers are testing versions and they are *only* intended to
> > show how to use SR-IOV API.
>
> That's funny, as some distros are already shipping this driver. You
> might want to tell them that this is an "example only" driver and not to
> be used "for real"... :(

Maybe they are shipping another version, not this one. This one is really
a experimental patch, it's just created a week before...

2008-12-02 05:23:44

by Zhao, Yu

[permalink] [raw]
Subject: [SR-IOV driver example 0/3 resend] introduction

SR-IOV drivers of Intel 82576 NIC are available. There are two parts
of the drivers: Physical Function driver and Virtual Function driver.
The PF driver is based on the IGB driver and is used to control PF to
allocate hardware specific resources and interface with the SR-IOV core.
The VF driver is a new NIC driver that is same as the traditional PCI
device driver. It works in both the host and the guest (Xen and KVM)
environment.

These two drivers are testing versions and they are *only* intended to
show how to use SR-IOV API.

Intel 82576 NIC specification can be found at:
http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf

[SR-IOV driver example 0/3 resend] introduction
[SR-IOV driver example 1/3 resend] PF driver: hardware specific operations
[SR-IOV driver example 2/3 resend] PF driver: integrate with SR-IOV core
[SR-IOV driver example 3/3 resend] VF driver: an independent PCI NIC driver

2008-12-02 05:36:41

by Zhao, Yu

[permalink] [raw]
Subject: [SR-IOV driver example 1/3 resend] PF driver: hardware specific operations

This patch makes the IGB driver allocate hardware resource (rx/tx queues)
for Virtual Functions. All operations in this patch are hardware specific.

From: Intel Corporation, LAN Access Division <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/net/igb/Makefile | 2 +-
drivers/net/igb/e1000_82575.c | 1 +
drivers/net/igb/e1000_82575.h | 61 +++++
drivers/net/igb/e1000_defines.h | 7 +
drivers/net/igb/e1000_hw.h | 2 +
drivers/net/igb/e1000_regs.h | 13 +
drivers/net/igb/igb.h | 8 +
drivers/net/igb/igb_main.c | 567 +++++++++++++++++++++++++++++++++++++-
drivers/pci/iov.c | 6 +-
9 files changed, 649 insertions(+), 18 deletions(-)

diff --git a/drivers/net/igb/Makefile b/drivers/net/igb/Makefile
index 1927b3f..ab3944c 100644
--- a/drivers/net/igb/Makefile
+++ b/drivers/net/igb/Makefile
@@ -33,5 +33,5 @@
obj-$(CONFIG_IGB) += igb.o

igb-objs := igb_main.o igb_ethtool.o e1000_82575.o \
- e1000_mac.o e1000_nvm.o e1000_phy.o
+ e1000_mac.o e1000_nvm.o e1000_phy.o e1000_vf.o

diff --git a/drivers/net/igb/e1000_82575.c b/drivers/net/igb/e1000_82575.c
index f5e2e72..bb823ac 100644
--- a/drivers/net/igb/e1000_82575.c
+++ b/drivers/net/igb/e1000_82575.c
@@ -87,6 +87,7 @@ static s32 igb_get_invariants_82575(struct e1000_hw *hw)
case E1000_DEV_ID_82576:
case E1000_DEV_ID_82576_FIBER:
case E1000_DEV_ID_82576_SERDES:
+ case E1000_DEV_ID_82576_QUAD_COPPER:
mac->type = e1000_82576;
break;
default:
diff --git a/drivers/net/igb/e1000_82575.h b/drivers/net/igb/e1000_82575.h
index c1928b5..8c488ab 100644
--- a/drivers/net/igb/e1000_82575.h
+++ b/drivers/net/igb/e1000_82575.h
@@ -170,4 +170,65 @@ struct e1000_adv_tx_context_desc {
#define E1000_DCA_TXCTRL_CPUID_SHIFT 24 /* Tx CPUID now in the last byte */
#define E1000_DCA_RXCTRL_CPUID_SHIFT 24 /* Rx CPUID now in the last byte */

+#define MAX_NUM_VFS 8
+
+#define E1000_DTXSWC_VMDQ_LOOPBACK_EN (1 << 31) /* global VF LB enable */
+
+/* Easy defines for setting default pool, would normally be left a zero */
+#define E1000_VT_CTL_DEFAULT_POOL_SHIFT 7
+#define E1000_VT_CTL_DEFAULT_POOL_MASK (0x7 << E1000_VT_CTL_DEFAULT_POOL_SHIFT)
+
+/* Other useful VMD_CTL register defines */
+#define E1000_VT_CTL_DISABLE_DEF_POOL (1 << 29)
+#define E1000_VT_CTL_VM_REPL_EN (1 << 30)
+
+/* Per VM Offload register setup */
+#define E1000_VMOLR_LPE 0x00010000 /* Accept Long packet */
+#define E1000_VMOLR_AUPE 0x01000000 /* Accept untagged packets */
+#define E1000_VMOLR_BAM 0x08000000 /* Accept Broadcast packets */
+#define E1000_VMOLR_MPME 0x10000000 /* Multicast promiscuous mode */
+#define E1000_VMOLR_STRVLAN 0x40000000 /* Vlan stripping enable */
+
+#define E1000_P2VMAILBOX_STS 0x00000001 /* Initiate message send to VF */
+#define E1000_P2VMAILBOX_ACK 0x00000002 /* Ack message recv'd from VF */
+#define E1000_P2VMAILBOX_VFU 0x00000004 /* VF owns the mailbox buffer */
+#define E1000_P2VMAILBOX_PFU 0x00000008 /* PF owns the mailbox buffer */
+
+#define E1000_VLVF_ARRAY_SIZE 32
+#define E1000_VLVF_VLANID_MASK 0x00000FFF
+#define E1000_VLVF_POOLSEL_SHIFT 12
+#define E1000_VLVF_POOLSEL_MASK (0xFF << E1000_VLVF_POOLSEL_SHIFT)
+#define E1000_VLVF_VLANID_ENABLE 0x80000000
+
+#define E1000_VFMAILBOX_SIZE 16 /* 16 32 bit words - 64 bytes */
+
+/* If it's a E1000_VF_* msg then it originates in the VF and is sent to the
+ * PF. The reverse is true if it is E1000_PF_*.
+ * Message ACK's are the value or'd with 0xF0000000
+ */
+#define E1000_VT_MSGTYPE_ACK 0xF0000000 /* Messages below or'd with
+ * this are the ACK */
+#define E1000_VT_MSGTYPE_NACK 0xFF000000 /* Messages below or'd with
+ * this are the NACK */
+#define E1000_VT_MSGINFO_SHIFT 16
+/* bits 23:16 are used for exra info for certain messages */
+#define E1000_VT_MSGINFO_MASK (0xFF << E1000_VT_MSGINFO_SHIFT)
+
+#define E1000_VF_MSGTYPE_REQ_MAC 1 /* VF needs to know its MAC */
+#define E1000_VF_MSGTYPE_VFLR 2 /* VF notifies VFLR to PF */
+#define E1000_VF_SET_MULTICAST 3 /* VF requests PF to set MC addr */
+#define E1000_VF_SET_VLAN 4 /* VF requests PF to set VLAN */
+#define E1000_VF_SET_LPE 5 /* VF requests PF to set VMOLR.LPE */
+
+s32 e1000_send_mail_to_vf(struct e1000_hw *hw, u32 *msg,
+ u32 vf_number, s16 size);
+s32 e1000_receive_mail_from_vf(struct e1000_hw *hw, u32 *msg,
+ u32 vf_number, s16 size);
+void e1000_vmdq_loopback_enable_vf(struct e1000_hw *hw);
+void e1000_vmdq_loopback_disable_vf(struct e1000_hw *hw);
+void e1000_vmdq_replication_enable_vf(struct e1000_hw *hw, u32 enables);
+void e1000_vmdq_replication_disable_vf(struct e1000_hw *hw);
+bool e1000_check_for_pf_ack_vf(struct e1000_hw *hw);
+bool e1000_check_for_pf_mail_vf(struct e1000_hw *hw, u32*);
+
#endif
diff --git a/drivers/net/igb/e1000_defines.h b/drivers/net/igb/e1000_defines.h
index ce70068..08f9db0 100644
--- a/drivers/net/igb/e1000_defines.h
+++ b/drivers/net/igb/e1000_defines.h
@@ -389,6 +389,7 @@
#define E1000_ICR_RXDMT0 0x00000010 /* rx desc min. threshold (0) */
#define E1000_ICR_RXO 0x00000040 /* rx overrun */
#define E1000_ICR_RXT0 0x00000080 /* rx timer intr (ring 0) */
+#define E1000_ICR_VMMB 0x00000100 /* VM MB event */
#define E1000_ICR_MDAC 0x00000200 /* MDIO access complete */
#define E1000_ICR_RXCFG 0x00000400 /* Rx /c/ ordered set */
#define E1000_ICR_GPI_EN0 0x00000800 /* GP Int 0 */
@@ -451,6 +452,7 @@
/* Interrupt Mask Set */
#define E1000_IMS_TXDW E1000_ICR_TXDW /* Transmit desc written back */
#define E1000_IMS_LSC E1000_ICR_LSC /* Link Status Change */
+#define E1000_IMS_VMMB E1000_ICR_VMMB /* Mail box activity */
#define E1000_IMS_RXSEQ E1000_ICR_RXSEQ /* rx sequence error */
#define E1000_IMS_RXDMT0 E1000_ICR_RXDMT0 /* rx desc min. threshold */
#define E1000_IMS_RXT0 E1000_ICR_RXT0 /* rx timer intr */
@@ -768,4 +770,9 @@
#define E1000_GEN_CTL_ADDRESS_SHIFT 8
#define E1000_GEN_POLL_TIMEOUT 640

+#define E1000_WRITE_FLUSH(a) (readl((a)->hw_addr + E1000_STATUS))
+#define E1000_MRQC_ENABLE_MASK 0x00000007
+#define E1000_MRQC_ENABLE_VMDQ 0x00000003
+#define E1000_CTRL_EXT_PFRSTD 0x00004000
+
#endif
diff --git a/drivers/net/igb/e1000_hw.h b/drivers/net/igb/e1000_hw.h
index 99504a6..b57ecfd 100644
--- a/drivers/net/igb/e1000_hw.h
+++ b/drivers/net/igb/e1000_hw.h
@@ -41,6 +41,7 @@ struct e1000_hw;
#define E1000_DEV_ID_82576 0x10C9
#define E1000_DEV_ID_82576_FIBER 0x10E6
#define E1000_DEV_ID_82576_SERDES 0x10E7
+#define E1000_DEV_ID_82576_QUAD_COPPER 0x10E8
#define E1000_DEV_ID_82575EB_COPPER 0x10A7
#define E1000_DEV_ID_82575EB_FIBER_SERDES 0x10A9
#define E1000_DEV_ID_82575GB_QUAD_COPPER 0x10D6
@@ -91,6 +92,7 @@ enum e1000_phy_type {
e1000_phy_gg82563,
e1000_phy_igp_3,
e1000_phy_ife,
+ e1000_phy_vf,
};

enum e1000_bus_type {
diff --git a/drivers/net/igb/e1000_regs.h b/drivers/net/igb/e1000_regs.h
index 95523af..8a39bbc 100644
--- a/drivers/net/igb/e1000_regs.h
+++ b/drivers/net/igb/e1000_regs.h
@@ -262,6 +262,19 @@
#define E1000_RETA(_i) (0x05C00 + ((_i) * 4))
#define E1000_RSSRK(_i) (0x05C80 + ((_i) * 4)) /* RSS Random Key - RW Array */

+/* VT Registers */
+#define E1000_MBVFICR 0x00C80 /* Mailbox VF Cause - RWC */
+#define E1000_MBVFIMR 0x00C84 /* Mailbox VF int Mask - RW */
+#define E1000_VFLRE 0x00C88 /* VF Register Events - RWC */
+#define E1000_VFRE 0x00C8C /* VF Receive Enables */
+#define E1000_VFTE 0x00C90 /* VF Transmit Enables */
+#define E1000_DTXSWC 0x03500 /* DMA Tx Switch Control - RW */
+/* These act per VF so an array friendly macro is used */
+#define E1000_P2VMAILBOX(_n) (0x00C00 + (4 * (_n)))
+#define E1000_VMBMEM(_n) (0x00800 + (64 * (_n)))
+#define E1000_VMOLR(_n) (0x05AD0 + (4 * (_n)))
+#define E1000_VLVF(_n) (0x05D00 + (4 * (_n))) /* VLAN Virtual Machine */
+
#define wr32(reg, value) (writel(value, hw->hw_addr + reg))
#define rd32(reg) (readl(hw->hw_addr + reg))
#define wrfl() ((void)rd32(E1000_STATUS))
diff --git a/drivers/net/igb/igb.h b/drivers/net/igb/igb.h
index 4ff6f05..47d474e 100644
--- a/drivers/net/igb/igb.h
+++ b/drivers/net/igb/igb.h
@@ -294,6 +294,14 @@ struct igb_adapter {
unsigned int lro_flushed;
unsigned int lro_no_desc;
#endif
+ unsigned int vfs_allocated_count;
+ struct work_struct msg_task;
+ u32 vf_icr;
+ u32 vflre;
+ unsigned char vf_mac_addresses[8][6];
+ u8 vfta_tracking_entry[128];
+ int int0counter;
+ int int1counter;
};

#define IGB_FLAG_HAS_MSI (1 << 0)
diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index 1cbae85..f0361ef 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -62,6 +62,7 @@ static struct pci_device_id igb_pci_tbl[] = {
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82576), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82576_FIBER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82576_SERDES), board_82575 },
+ { PCI_VDEVICE(INTEL, E1000_DEV_ID_82576_QUAD_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82575EB_COPPER), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82575EB_FIBER_SERDES), board_82575 },
{ PCI_VDEVICE(INTEL, E1000_DEV_ID_82575GB_QUAD_COPPER), board_82575 },
@@ -126,6 +127,17 @@ static void igb_vlan_rx_register(struct net_device *, struct vlan_group *);
static void igb_vlan_rx_add_vid(struct net_device *, u16);
static void igb_vlan_rx_kill_vid(struct net_device *, u16);
static void igb_restore_vlan(struct igb_adapter *);
+static void igb_msg_task(struct work_struct *);
+int igb_send_msg_to_vf(struct igb_adapter *, u32 *, u32);
+static int igb_get_vf_msg_ack(struct igb_adapter *, u32);
+static int igb_rcv_msg_from_vf(struct igb_adapter *, u32);
+static int igb_set_pf_mac(struct net_device *, int, u8*);
+static void igb_enable_pf_queues(struct igb_adapter *adapter);
+static void igb_set_vf_vmolr(struct igb_adapter *adapter, int vfn);
+void igb_set_mc_list_pools(struct igb_adapter *, struct e1000_hw *, int, u16);
+static int igb_vmm_control(struct igb_adapter *, bool);
+static int igb_set_vf_mac(struct net_device *, int, u8*);
+static void igb_mbox_handler(struct igb_adapter *);

static int igb_suspend(struct pci_dev *, pm_message_t);
#ifdef CONFIG_PM
@@ -169,7 +181,7 @@ static struct pci_driver igb_driver = {
.resume = igb_resume,
#endif
.shutdown = igb_shutdown,
- .err_handler = &igb_err_handler
+ .err_handler = &igb_err_handler,
};

static int global_quad_port_a; /* global quad port a indication */
@@ -292,6 +304,7 @@ static void igb_assign_vector(struct igb_adapter *adapter, int rx_queue,
u32 msixbm = 0;
struct e1000_hw *hw = &adapter->hw;
u32 ivar, index;
+ u32 rbase_offset = adapter->vfs_allocated_count;

switch (hw->mac.type) {
case e1000_82575:
@@ -316,9 +329,9 @@ static void igb_assign_vector(struct igb_adapter *adapter, int rx_queue,
a vector number along with a "valid" bit. Sadly, the layout
of the table is somewhat counterintuitive. */
if (rx_queue > IGB_N0_QUEUE) {
- index = (rx_queue & 0x7);
+ index = ((rx_queue + rbase_offset) & 0x7);
ivar = array_rd32(E1000_IVAR0, index);
- if (rx_queue < 8) {
+ if ((rx_queue + rbase_offset) < 8) {
/* vector goes into low byte of register */
ivar = ivar & 0xFFFFFF00;
ivar |= msix_vector | E1000_IVAR_VALID;
@@ -331,9 +344,9 @@ static void igb_assign_vector(struct igb_adapter *adapter, int rx_queue,
array_wr32(E1000_IVAR0, index, ivar);
}
if (tx_queue > IGB_N0_QUEUE) {
- index = (tx_queue & 0x7);
+ index = ((tx_queue + rbase_offset) & 0x7);
ivar = array_rd32(E1000_IVAR0, index);
- if (tx_queue < 8) {
+ if ((tx_queue + rbase_offset) < 8) {
/* vector goes into second byte of register */
ivar = ivar & 0xFFFF00FF;
ivar |= (msix_vector | E1000_IVAR_VALID) << 8;
@@ -419,6 +432,8 @@ static void igb_configure_msix(struct igb_adapter *adapter)
case e1000_82576:
tmp = (vector++ | E1000_IVAR_VALID) << 8;
wr32(E1000_IVAR_MISC, tmp);
+ if (adapter->vfs_allocated_count > 0)
+ wr32(E1000_MBVFIMR, 0xFF);

adapter->eims_enable_mask = (1 << (vector)) - 1;
adapter->eims_other = 1 << (vector - 1);
@@ -440,6 +455,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
{
struct net_device *netdev = adapter->netdev;
int i, err = 0, vector = 0;
+ u32 rbase_offset = adapter->vfs_allocated_count;

vector = 0;

@@ -451,7 +467,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
&(adapter->tx_ring[i]));
if (err)
goto out;
- ring->itr_register = E1000_EITR(0) + (vector << 2);
+ ring->itr_register = E1000_EITR(0 + rbase_offset) + (vector << 2);
ring->itr_val = 976; /* ~4000 ints/sec */
vector++;
}
@@ -466,7 +482,7 @@ static int igb_request_msix(struct igb_adapter *adapter)
&(adapter->rx_ring[i]));
if (err)
goto out;
- ring->itr_register = E1000_EITR(0) + (vector << 2);
+ ring->itr_register = E1000_EITR(0 + rbase_offset) + (vector << 2);
ring->itr_val = adapter->itr;
/* overwrite the poll routine for MSIX, we've already done
* netif_napi_add */
@@ -649,7 +665,11 @@ static void igb_irq_enable(struct igb_adapter *adapter)
wr32(E1000_EIAC, adapter->eims_enable_mask);
wr32(E1000_EIAM, adapter->eims_enable_mask);
wr32(E1000_EIMS, adapter->eims_enable_mask);
+#ifdef CONFIG_PCI_IOV
+ wr32(E1000_IMS, (E1000_IMS_LSC | E1000_IMS_VMMB));
+#else
wr32(E1000_IMS, E1000_IMS_LSC);
+#endif
} else {
wr32(E1000_IMS, IMS_ENABLE_MASK);
wr32(E1000_IAM, IMS_ENABLE_MASK);
@@ -773,6 +793,14 @@ int igb_up(struct igb_adapter *adapter)
if (adapter->msix_entries)
igb_configure_msix(adapter);

+ if (adapter->vfs_allocated_count > 0) {
+ igb_vmm_control(adapter, true);
+ igb_set_pf_mac(adapter->netdev,
+ adapter->vfs_allocated_count,
+ hw->mac.addr);
+ igb_enable_pf_queues(adapter);
+ }
+
/* Clear any pending interrupts. */
rd32(E1000_ICR);
igb_irq_enable(adapter);
@@ -1189,6 +1217,7 @@ static int __devinit igb_probe(struct pci_dev *pdev,

INIT_WORK(&adapter->reset_task, igb_reset_task);
INIT_WORK(&adapter->watchdog_task, igb_watchdog_task);
+ INIT_WORK(&adapter->msg_task, igb_msg_task);

/* Initialize link & ring properties that are user-changeable */
adapter->tx_ring->count = 256;
@@ -1404,8 +1433,13 @@ static int __devinit igb_sw_init(struct igb_adapter *adapter)

/* Number of supported queues. */
/* Having more queues than CPUs doesn't make sense. */
+#ifdef CONFIG_PCI_IOV
+ adapter->num_rx_queues = 1;
+ adapter->num_tx_queues = 1;
+#else
adapter->num_rx_queues = min((u32)IGB_MAX_RX_QUEUES, (u32)num_online_cpus());
adapter->num_tx_queues = min(IGB_MAX_TX_QUEUES, num_online_cpus());
+#endif

/* This call may decrease the number of queues depending on
* interrupt mode. */
@@ -1469,6 +1503,14 @@ static int igb_open(struct net_device *netdev)
* clean_rx handler before we do so. */
igb_configure(adapter);

+ if (adapter->vfs_allocated_count > 0) {
+ igb_vmm_control(adapter, true);
+ igb_set_pf_mac(netdev,
+ adapter->vfs_allocated_count,
+ hw->mac.addr);
+ igb_enable_pf_queues(adapter);
+ }
+
err = igb_request_irq(adapter);
if (err)
goto err_req_irq;
@@ -1623,9 +1665,10 @@ static void igb_configure_tx(struct igb_adapter *adapter)
u32 tctl;
u32 txdctl, txctrl;
int i;
+ u32 rbase_offset = adapter->vfs_allocated_count;

- for (i = 0; i < adapter->num_tx_queues; i++) {
- struct igb_ring *ring = &(adapter->tx_ring[i]);
+ for (i = rbase_offset; i < (adapter->num_tx_queues + rbase_offset); i++) {
+ struct igb_ring *ring = &(adapter->tx_ring[i - rbase_offset]);

wr32(E1000_TDLEN(i),
ring->count * sizeof(struct e1000_tx_desc));
@@ -1772,6 +1815,8 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
u32 rctl;
u32 srrctl = 0;
int i;
+ u32 rbase_offset = adapter->vfs_allocated_count;
+ u32 vmolr;

rctl = rd32(E1000_RCTL);

@@ -1794,6 +1839,7 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
rctl &= ~E1000_RCTL_LPE;
else
rctl |= E1000_RCTL_LPE;
+#ifndef CONFIG_PCI_IOV
if (adapter->rx_buffer_len <= IGB_RXBUFFER_2048) {
/* Setup buffer sizes */
rctl &= ~E1000_RCTL_SZ_4096;
@@ -1818,9 +1864,12 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
break;
}
} else {
+#endif
rctl &= ~E1000_RCTL_BSEX;
srrctl = adapter->rx_buffer_len >> E1000_SRRCTL_BSIZEPKT_SHIFT;
+#ifndef CONFIG_PCI_IOV
}
+#endif

/* 82575 and greater support packet-split where the protocol
* header is placed in skb->data and the packet data is
@@ -1836,13 +1885,32 @@ static void igb_setup_rctl(struct igb_adapter *adapter)
srrctl |= adapter->rx_ps_hdr_size <<
E1000_SRRCTL_BSIZEHDRSIZE_SHIFT;
srrctl |= E1000_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
+#ifdef CONFIG_PCI_IOV
+ srrctl |= 0x80000000;
+#endif
} else {
adapter->rx_ps_hdr_size = 0;
srrctl |= E1000_SRRCTL_DESCTYPE_ADV_ONEBUF;
}

- for (i = 0; i < adapter->num_rx_queues; i++)
+ for (i = rbase_offset; i < (adapter->num_rx_queues + rbase_offset); i++) {
wr32(E1000_SRRCTL(i), srrctl);
+ if ((rctl & E1000_RCTL_LPE) && adapter->vfs_allocated_count > 0 ) {
+ vmolr = rd32(E1000_VMOLR(i));
+ vmolr |= E1000_VMOLR_LPE;
+ wr32(E1000_VMOLR(i), vmolr);
+ }
+ }
+
+ /* Attention!!! For SR-IOV PF driver operations you must enable
+ * queue drop for the queue 0 or the PF driver will *never* receive
+ * any traffic on it's own default queue, which will be equal to the
+ * number of VFs enabled.
+ */
+ if (adapter->vfs_allocated_count > 0) {
+ srrctl = rd32(E1000_SRRCTL(0));
+ wr32(E1000_SRRCTL(0), (srrctl | 0x80000000));
+ }

wr32(E1000_RCTL, rctl);
}
@@ -1860,6 +1928,7 @@ static void igb_configure_rx(struct igb_adapter *adapter)
u32 rctl, rxcsum;
u32 rxdctl;
int i;
+ u32 rbase_offset = adapter->vfs_allocated_count;

/* disable receives while setting up the descriptors */
rctl = rd32(E1000_RCTL);
@@ -1872,8 +1941,8 @@ static void igb_configure_rx(struct igb_adapter *adapter)

/* Setup the HW Rx Head and Tail Descriptor Pointers and
* the Base and Length of the Rx Descriptor Ring */
- for (i = 0; i < adapter->num_rx_queues; i++) {
- struct igb_ring *ring = &(adapter->rx_ring[i]);
+ for (i = rbase_offset; i < (adapter->num_rx_queues + rbase_offset); i++) {
+ struct igb_ring *ring = &(adapter->rx_ring[i - rbase_offset]);
rdba = ring->dma;
wr32(E1000_RDBAL(i),
rdba & 0x00000000ffffffffULL);
@@ -2268,8 +2337,20 @@ static void igb_set_multi(struct net_device *netdev)
memcpy(mta_list + (i*ETH_ALEN), mc_ptr->dmi_addr, ETH_ALEN);
mc_ptr = mc_ptr->next;
}
- igb_update_mc_addr_list_82575(hw, mta_list, i, 1,
- mac->rar_entry_count);
+ if (adapter->vfs_allocated_count > 0) {
+ igb_update_mc_addr_list_82575(hw, mta_list, i,
+ adapter->vfs_allocated_count + 1,
+ mac->rar_entry_count);
+ igb_set_mc_list_pools(adapter, hw, i, mac->rar_entry_count);
+ /* TODO - if this is done after VF's are loaded and have their MC
+ * addresses set then we need to restore their entries in the MTA.
+ * This means we have to save them in the adapter structure somewhere
+ * so that we can retrieve them when this particular event occurs
+ */
+ } else
+ igb_update_mc_addr_list_82575(hw, mta_list, i, 1,
+ mac->rar_entry_count);
+
kfree(mta_list);
}

@@ -3274,6 +3355,22 @@ static irqreturn_t igb_msix_other(int irq, void *data)
struct e1000_hw *hw = &adapter->hw;
u32 icr = rd32(E1000_ICR);

+#ifdef CONFIG_PCI_IOV
+ adapter->int0counter++;
+
+ /* Check for a mailbox event */
+ if (icr & E1000_ICR_VMMB) {
+ adapter->vf_icr = rd32(E1000_MBVFICR);
+ /* Clear the bits */
+ wr32(E1000_MBVFICR, adapter->vf_icr);
+ E1000_WRITE_FLUSH(hw);
+ adapter->vflre = rd32(E1000_VFLRE);
+ wr32(E1000_VFLRE, adapter->vflre);
+ E1000_WRITE_FLUSH(hw);
+ igb_mbox_handler(adapter);
+ }
+#endif
+
/* reading ICR causes bit 31 of EICR to be cleared */
if (!(icr & E1000_ICR_LSC))
goto no_link_interrupt;
@@ -3283,7 +3380,10 @@ static irqreturn_t igb_msix_other(int irq, void *data)
mod_timer(&adapter->watchdog_timer, jiffies + 1);

no_link_interrupt:
- wr32(E1000_IMS, E1000_IMS_LSC);
+ if (adapter->vfs_allocated_count)
+ wr32(E1000_IMS, E1000_IMS_LSC | E1000_IMS_VMMB);
+ else
+ wr32(E1000_IMS, E1000_IMS_LSC);
wr32(E1000_EIMS, adapter->eims_other);

return IRQ_HANDLED;
@@ -3342,6 +3442,10 @@ static irqreturn_t igb_msix_rx(int irq, void *data)
* previous interrupt.
*/

+#ifdef CONFIG_PCI_IOV
+ adapter->int1counter++;
+#endif
+
igb_write_itr(rx_ring);

if (netif_rx_schedule_prep(adapter->netdev, &rx_ring->napi))
@@ -4192,6 +4296,9 @@ static void igb_vlan_rx_add_vid(struct net_device *netdev, u16 vid)
vfta = array_rd32(E1000_VFTA, index);
vfta |= (1 << (vid & 0x1F));
igb_write_vfta(&adapter->hw, index, vfta);
+#ifdef CONFIG_PCI_IOV
+ adapter->vfta_tracking_entry[index] = (u8)vfta;
+#endif
}

static void igb_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
@@ -4219,6 +4326,9 @@ static void igb_vlan_rx_kill_vid(struct net_device *netdev, u16 vid)
vfta = array_rd32(E1000_VFTA, index);
vfta &= ~(1 << (vid & 0x1F));
igb_write_vfta(&adapter->hw, index, vfta);
+#ifdef CONFIG_PCI_IOV
+ adapter->vfta_tracking_entry[index] = (u8)vfta;
+#endif
}

static void igb_restore_vlan(struct igb_adapter *adapter)
@@ -4529,4 +4639,431 @@ static void igb_io_resume(struct pci_dev *pdev)

}

+static void igb_set_vf_multicasts(struct igb_adapter *adapter,
+ u32 *msgbuf, u32 vf)
+{
+ struct e1000_hw *hw = &adapter->hw;
+ int n = (msgbuf[0] & E1000_VT_MSGINFO_MASK) >> E1000_VT_MSGINFO_SHIFT;
+ int i;
+ u32 hash_value;
+ u8 *p = (u8 *)&msgbuf[1];
+
+ /* VFs are limited to using the MTA hash table for their multicast
+ * addresses */
+ for (i = 0; i < n; i++) {
+ hash_value = igb_hash_mc_addr(hw, p);
+ printk("Adding MC Addr: %2.2X:%2.2X:%2.2X:%2.2X:%2.2X:%2.2X\n"
+ "for VF %d\n",
+ p[0],
+ p[1],
+ p[2],
+ p[3],
+ p[4],
+ p[5],
+ vf);
+ printk("Hash value = 0x%03X\n", hash_value);
+ igb_mta_set(hw, hash_value);
+ p += ETH_ALEN;
+ }
+}
+
+static void igb_set_vf_vlan(struct igb_adapter *adapter,
+ u32 *msgbuf, u32 vf)
+{
+ struct e1000_hw *hw = &adapter->hw;
+ int add = (msgbuf[0] & E1000_VT_MSGINFO_MASK) >> E1000_VT_MSGINFO_SHIFT;
+ int vid = (msgbuf[1] & E1000_VLVF_VLANID_MASK);
+ u32 reg, index, vfta;
+ int i;
+
+ if (add) {
+ /* See if a vlan filter for this id is already
+ * set and enabled */
+ for(i = 0; i < E1000_VLVF_ARRAY_SIZE; i++) {
+ reg = rd32(E1000_VLVF(i));
+ if ((reg & E1000_VLVF_VLANID_ENABLE) &&
+ vid == (reg & E1000_VLVF_VLANID_MASK))
+ break;
+ }
+ if (i < E1000_VLVF_ARRAY_SIZE) {
+ /* Found an enabled entry with the same VLAN
+ * ID. Just enable the pool select bit for
+ * this requesting VF
+ */
+ reg |= 1 << (E1000_VLVF_POOLSEL_SHIFT + vf);
+ wr32(E1000_VLVF(i), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ } else {
+ /* Did not find a matching VLAN ID filter entry
+ * that was also enabled. Search for a free
+ * filter entry, i.e. one without the enable
+ * bit set
+ */
+ for(i = 0; i < E1000_VLVF_ARRAY_SIZE; i++) {
+ reg = rd32(E1000_VLVF(i));
+ if (!(reg & E1000_VLVF_VLANID_ENABLE))
+ break;
+ }
+ if (i == E1000_VLVF_ARRAY_SIZE) {
+ /* oops, no free entry, send nack */
+ msgbuf[0] |= E1000_VT_MSGTYPE_NACK;
+ } else {
+ /* add VID to filter table */
+ index = (vid >> 5) & 0x7F;
+ vfta = array_rd32(E1000_VFTA, index);
+ vfta |= (1 << (vid & 0x1F));
+ igb_write_vfta(hw, index, vfta);
+ reg |= vid;
+ reg |= 1 << (E1000_VLVF_POOLSEL_SHIFT + vf);
+ reg |= E1000_VLVF_VLANID_ENABLE;
+ wr32(E1000_VLVF(i), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ }
+ }
+ } else {
+ /* Find the vlan filter for this id */
+ for(i = 0; i < E1000_VLVF_ARRAY_SIZE; i++) {
+ reg = rd32(E1000_VLVF(i));
+ if ((reg & E1000_VLVF_VLANID_ENABLE) &&
+ vid == (reg & E1000_VLVF_VLANID_MASK))
+ break;
+ }
+ if (i == E1000_VLVF_ARRAY_SIZE) {
+ /* oops, not found. send nack */
+ msgbuf[0] |= E1000_VT_MSGTYPE_NACK;
+ } else {
+ u32 pool_sel;
+ /* Check to see if the entry belongs to more than one
+ * pool. If so just reset this VF's pool select bit
+ */
+ /* mask off the pool select bits */
+ pool_sel = (reg & E1000_VLVF_POOLSEL_MASK) >>
+ E1000_VLVF_POOLSEL_SHIFT;
+ /* reset this VF's pool select bit */
+ pool_sel &= ~(1 << vf);
+ /* check if other pools are set */
+ if (pool_sel != 0) {
+ reg &= ~(E1000_VLVF_POOLSEL_MASK);
+ reg |= pool_sel;
+ } else {
+ /* just disable the whole entry */
+ reg = 0;
+ /* remove VID from filter table *IF AND
+ * ONLY IF!!!* this entry was enabled for
+ * VFs only through a write to the VFTA
+ * table a few lines above here in this
+ * function. If this VFTA entry was added
+ * through the rx_add_vid function then
+ * we can't delete it here. */
+ index = (vid >> 5) & 0x7F;
+ if (adapter->vfta_tracking_entry[index] == 0) {
+ vfta = array_rd32(E1000_VFTA, index);
+ vfta &= ~(1 << (vid & 0x1F));
+ igb_write_vfta(hw, index, vfta);
+ }
+ }
+ wr32(E1000_VLVF(i), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ }
+ }
+}
+
+static void igb_msg_task(struct work_struct *work)
+{
+ struct igb_adapter *adapter;
+ struct e1000_hw *hw;
+ u32 bit, vf, vfr;
+ u32 vflre;
+ u32 vf_icr;
+
+ adapter = container_of(work, struct igb_adapter, msg_task);
+ hw = &adapter->hw;
+
+ vflre = adapter->vflre;
+ vf_icr = adapter->vf_icr;
+
+ /* Now that we have salted away local values of these events
+ * for processing we can enable the interrupt so more events
+ * can be captured
+ */
+
+ wr32(E1000_IMS, E1000_IMS_VMMB);
+
+ if (vflre & 0xFF) {
+ printk("VFLR Event %2.2X\n", vflre);
+ vfr = rd32(E1000_VFRE);
+ wr32(E1000_VFRE, vfr | vflre);
+ E1000_WRITE_FLUSH(hw);
+ vfr = rd32(E1000_VFTE);
+ wr32(E1000_VFTE, vfr | vflre);
+ E1000_WRITE_FLUSH(hw);
+ }
+
+ if (!vf_icr)
+ return;
+
+ /* Check for message acks from VF first as that may affect
+ * pending messages to the VF
+ */
+ for (bit = 1, vf = 0; bit < 0x100; bit <<= 1, vf++) {
+ if ((bit << 16) & vf_icr)
+ igb_get_vf_msg_ack(adapter, vf);
+ }
+
+ /* Check for message sent from a VF */
+ for (bit = 1, vf = 0; bit < 0x100; bit <<= 1, vf++) {
+ if (bit & vf_icr)
+ igb_rcv_msg_from_vf(adapter, vf);
+ }
+}
+
+int igb_send_msg_to_vf(struct igb_adapter *adapter, u32 *msg, u32 vfn)
+{
+ struct e1000_hw *hw = &adapter->hw;
+
+ return e1000_send_mail_to_vf(hw, msg, vfn, 16);
+}
+
+static int igb_get_vf_msg_ack(struct igb_adapter *adapter, u32 vf)
+{
+ return 0;
+}
+
+static int igb_rcv_msg_from_vf(struct igb_adapter *adapter, u32 vf)
+{
+ u32 msgbuf[E1000_VFMAILBOX_SIZE];
+ struct net_device *netdev = adapter->netdev;
+ struct e1000_hw *hw = &adapter->hw;
+ u32 reg;
+ s32 retval;
+ int err = 0;
+
+ retval = e1000_receive_mail_from_vf(hw, msgbuf, vf, 16);
+
+ switch ((msgbuf[0] & 0xFFFF)) {
+ case E1000_VF_MSGTYPE_REQ_MAC:
+ {
+ unsigned char *p;
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ p = (char *)&msgbuf[1];
+ memcpy(p, adapter->vf_mac_addresses[vf], ETH_ALEN);
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf)
+ == 0)) {
+ printk(KERN_INFO "Sending MAC Address %2.2x:%2.2x:"
+ "%2.2x:%2.2x:%2.2x:%2.2x to VF %d\n",
+ p[0], p[1], p[2], p[3], p[4], p[5], vf);
+ igb_set_vf_mac(netdev,
+ vf,
+ adapter->vf_mac_addresses[vf]);
+ igb_set_vf_vmolr(adapter, vf);
+ }
+ else {
+ printk(KERN_ERR "Error %d Sending MAC Address to VF\n",
+ err);
+ }
+ }
+ break;
+ case E1000_VF_MSGTYPE_VFLR:
+ {
+ u32 vfe = rd32(E1000_VFTE);
+ vfe |= (1 << vf);
+ wr32(E1000_VFTE, vfe);
+ vfe = rd32(E1000_VFRE);
+ vfe |= (1 << vf);
+ wr32(E1000_VFRE, vfe);
+ printk(KERN_INFO "Enabling VFTE and VFRE for vf %d\n",
+ vf);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf)
+ != 0))
+ printk(KERN_ERR "Error %d Sending VFLR Ack"
+ "to VF\n", err);
+ }
+ break;
+ case E1000_VF_SET_MULTICAST:
+ igb_set_vf_multicasts(adapter, msgbuf, vf);
+ break;
+ case E1000_VF_SET_LPE:
+ /* Make sure global LPE is set */
+ reg = rd32(E1000_RCTL);
+ reg |= E1000_RCTL_LPE;
+ wr32(E1000_RCTL, reg);
+ /* Set per VM LPE */
+ reg = rd32(E1000_VMOLR(vf));
+ reg |= E1000_VMOLR_LPE;
+ wr32(E1000_VMOLR(vf), reg);
+ msgbuf[0] |= E1000_VT_MSGTYPE_ACK;
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf) != 0))
+ printk(KERN_ERR "Error %d Sending set VMOLR LPE Ack"
+ "to VF\n", err);
+ break;
+ case E1000_VF_SET_VLAN:
+ igb_set_vf_vlan(adapter, msgbuf, vf);
+ if ((err = igb_send_msg_to_vf(adapter, msgbuf, vf) != 0))
+ printk(KERN_ERR "Error %d Sending set VLAN ID Ack"
+ "to VF\n", err);
+ break;
+ default:
+ if ((msgbuf[0] & 0xFF000000) != E1000_VT_MSGTYPE_ACK &&
+ (msgbuf[0] & 0xFF000000) != E1000_VT_MSGTYPE_NACK)
+ printk(KERN_ERR "Unhandled Msg %8.8x\n", msgbuf[0]);
+ break;
+ }
+
+ return retval;
+}
+
+static void igb_mbox_handler(struct igb_adapter *adapter)
+{
+ schedule_work(&adapter->msg_task);
+}
+
+#define E1000_RAH(_i) (((_i) <= 15) ? (0x05404 + ((_i) * 8)) : (0x054E4 + ((_i - 16) * 8)))
+
+static int igb_set_pf_mac(struct net_device *netdev, int queue, u8*mac_addr)
+{
+ struct igb_adapter *adapter;
+ struct e1000_hw *hw;
+ u32 reg_data;
+
+ adapter = netdev_priv(netdev);
+ hw = &adapter->hw;
+
+ /* point the pool selector for our default MAC entry to
+ * the right pool, which is equal to the number of vfs enabled.
+ */
+ reg_data = rd32(E1000_RAH(0));
+ reg_data |= (1 << (18 + queue));
+ wr32(E1000_RAH(0), reg_data);
+
+ return 0;
+}
+
+static void igb_set_vf_vmolr(struct igb_adapter *adapter, int vfn)
+{
+ struct e1000_hw *hw = &adapter->hw;
+ u32 reg_data;
+
+ reg_data = rd32(E1000_VMOLR(vfn));
+ reg_data |= 0xF << 24; /* aupe, rompe, rope, bam */
+ reg_data |= E1000_VMOLR_STRVLAN; /* Strip vlan tags */
+ wr32(E1000_VMOLR(vfn), reg_data);
+}
+
+static int igb_set_vf_mac(struct net_device *netdev,
+ int vf,
+ unsigned char *mac_addr)
+{
+ struct igb_adapter *adapter;
+ struct e1000_hw *hw;
+ u32 reg_data;
+ int rar_entry = vf + 1; /* VF MAC addresses start at entry 1 */
+
+ adapter = netdev_priv(netdev);
+ hw = &adapter->hw;
+
+ igb_rar_set(hw, mac_addr, rar_entry);
+
+ memcpy(adapter->vf_mac_addresses[vf], mac_addr, 6);
+
+ reg_data = rd32(E1000_RAH(rar_entry));
+ reg_data |= (1 << (18 + vf));
+ wr32(E1000_RAH(rar_entry), reg_data);
+
+ return 0;
+}
+
+static int igb_vmm_control(struct igb_adapter *adapter, bool enable)
+{
+ struct e1000_hw *hw;
+ u32 reg_data;
+
+ hw = &adapter->hw;
+
+ if (enable) {
+ /* Enable multi-queue */
+ reg_data = rd32(E1000_MRQC);
+ reg_data &= E1000_MRQC_ENABLE_MASK;
+ reg_data |= E1000_MRQC_ENABLE_VMDQ;
+ wr32(E1000_MRQC, reg_data);
+ /* VF's need PF reset indication before they
+ * can send/receive mail */
+ reg_data = rd32(E1000_CTRL_EXT);
+ reg_data |= E1000_CTRL_EXT_PFRSTD;
+ wr32(E1000_CTRL_EXT, reg_data);
+
+ /* Set the default pool for the PF's first queue */
+ reg_data = rd32(E1000_VMD_CTL);
+ reg_data &= ~(E1000_VMD_CTL | E1000_VT_CTL_DISABLE_DEF_POOL);
+ reg_data |= adapter->vfs_allocated_count <<
+ E1000_VT_CTL_DEFAULT_POOL_SHIFT;
+ wr32(E1000_VMD_CTL, reg_data);
+
+ e1000_vmdq_loopback_enable_vf(hw);
+ e1000_vmdq_replication_enable_vf(hw, 0xFF);
+ } else {
+ e1000_vmdq_loopback_disable_vf(hw);
+ e1000_vmdq_replication_disable_vf(hw);
+ }
+
+ return 0;
+}
+
+static void igb_enable_pf_queues(struct igb_adapter *adapter)
+{
+ u64 rdba;
+ int i;
+ u32 rbase_offset = adapter->vfs_allocated_count;
+ struct e1000_hw *hw = &adapter->hw;
+ u32 rxdctl;
+
+ for (i = rbase_offset;
+ i < (adapter->num_rx_queues + rbase_offset); i++) {
+ struct igb_ring *ring = &adapter->rx_ring[i - rbase_offset];
+ rdba = ring->dma;
+
+ rxdctl = rd32(E1000_RXDCTL(i));
+ rxdctl |= E1000_RXDCTL_QUEUE_ENABLE;
+ rxdctl &= 0xFFF00000;
+ rxdctl |= IGB_RX_PTHRESH;
+ rxdctl |= IGB_RX_HTHRESH << 8;
+ rxdctl |= IGB_RX_WTHRESH << 16;
+ wr32(E1000_RXDCTL(i), rxdctl);
+ printk("RXDCTL%d == %8.8x\n", i, rxdctl);
+
+ wr32(E1000_RDBAL(i),
+ rdba & 0x00000000ffffffffULL);
+ wr32(E1000_RDBAH(i), rdba >> 32);
+ wr32(E1000_RDLEN(i),
+ ring->count * sizeof(union e1000_adv_rx_desc));
+
+ writel(ring->next_to_use, adapter->hw.hw_addr + ring->tail);
+ writel(ring->next_to_clean, adapter->hw.hw_addr + ring->head);
+ }
+}
+
+void igb_set_mc_list_pools(struct igb_adapter *adapter,
+ struct e1000_hw *hw,
+ int entry_count, u16 total_rar_filters)
+{
+ u32 reg_data;
+ int i;
+ int pool = adapter->vfs_allocated_count;
+
+ for (i = adapter->vfs_allocated_count + 1; i < total_rar_filters; i++) {
+ reg_data = rd32(E1000_RAH(i));
+ reg_data |= (1 << (18 + pool));
+ wr32(E1000_RAH(i), reg_data);
+ entry_count--;
+ if (!entry_count)
+ break;
+ }
+
+ reg_data = rd32(E1000_VMOLR(pool));
+ /* Set bit 25 for this pool in the VM Offload register so that
+ * it can accept packets that match the MTA table */
+ reg_data |= (1 << 25);
+ wr32(E1000_VMOLR(pool), reg_data);
+}
+
/* igb_main.c */
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b4c1b5a..79b49e5 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -487,9 +487,11 @@ void pci_iov_unregister(struct pci_dev *dev)

sysfs_remove_group(&dev->dev.kobj, &iov_attr_group);

- mutex_lock(&pdev->iov->physfn->iov->lock);
+ mutex_lock(&dev->iov->physfn->iov->lock);
+
iov_disable(dev);
- mutex_unlock(&pdev->iov->physfn->iov->lock);
+
+ mutex_unlock(&dev->iov->physfn->iov->lock);

kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
}
--
1.5.6.4

2008-12-02 05:38:54

by Zhao, Yu

[permalink] [raw]
Subject: [SR-IOV driver example 2/3 resend] PF driver: integrate with SR-IOV core

This patch integrates the IGB driver with the SR-IOV core. It shows how
the SR-IOV API is used to support the capability. Obviously people does
not need to put much effort to integrate the PF driver with SR-IOV core.
All SR-IOV standard stuff are handled by SR-IOV core and PF driver only
concerns the device specific resource allocation and deallocation once it
gets the necessary information (i.e. number of Virtual Functions) from
the callback function.

From: Intel Corporation, LAN Access Division <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/net/igb/igb_main.c | 46 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/drivers/net/igb/igb_main.c b/drivers/net/igb/igb_main.c
index f0361ef..78bda11 100644
--- a/drivers/net/igb/igb_main.c
+++ b/drivers/net/igb/igb_main.c
@@ -138,6 +138,7 @@ void igb_set_mc_list_pools(struct igb_adapter *, struct e1000_hw *, int, u16);
static int igb_vmm_control(struct igb_adapter *, bool);
static int igb_set_vf_mac(struct net_device *, int, u8*);
static void igb_mbox_handler(struct igb_adapter *);
+static int igb_virtual(struct pci_dev *, int);

static int igb_suspend(struct pci_dev *, pm_message_t);
#ifdef CONFIG_PM
@@ -182,6 +183,7 @@ static struct pci_driver igb_driver = {
#endif
.shutdown = igb_shutdown,
.err_handler = &igb_err_handler,
+ .virtual = igb_virtual
};

static int global_quad_port_a; /* global quad port a indication */
@@ -5066,4 +5068,48 @@ void igb_set_mc_list_pools(struct igb_adapter *adapter,
wr32(E1000_VMOLR(pool), reg_data);
}

+static int
+igb_virtual(struct pci_dev *pdev, int nr_virtfn)
+{
+ int i;
+ struct net_device *netdev = pci_get_drvdata(pdev);
+ struct igb_adapter *adapter = netdev_priv(netdev);
+ /* the VFs' MAC addresses are hard-coded */
+ unsigned char my_mac_addr[6] = {0x00, 0xDE, 0xAD, 0xBE, 0xEF, 0xFF};
+
+ /*
+ * the 82576 NIC supports 1-PF NIC + 7-VF NICs mode and 8-VF NICs
+ * mode. In the 8-VF NICs mode, the PF can't tx/rx packets -- it
+ * only behaves as 'VF supervisor'. For now we use the 1-PF NIC +
+ * 7-VF NICs mode to preserve PF's tx/rx capability for the debug
+ * purpose.
+ */
+ if (nr_virtfn > (MAX_NUM_VFS - 1))
+ return -EINVAL;
+
+ if (nr_virtfn) {
+ dev_info(&pdev->dev, "SR-IOV is enabled\n");
+ /*
+ * Currently VFs resources are pre-allocated, so just set
+ * the MAC addresses of each VF here.
+ */
+ for (i = 0; i < nr_virtfn; i++) {
+ my_mac_addr[5] = (unsigned char)i;
+ igb_set_vf_mac(netdev, i, my_mac_addr);
+ igb_set_vf_vmolr(adapter, i);
+ }
+ } else {
+ /*
+ * Since we statically allocate tx/rx queues for the PF
+ * and the VFs, so we don't need to free any VF related
+ * resources here.
+ */
+ dev_info(&pdev->dev, "SR-IOV is disabled\n");
+ }
+
+ adapter->vfs_allocated_count = nr_virtfn;
+
+ return 0;
+}
+
/* igb_main.c */
--
1.5.6.4

2008-12-03 03:12:34

by Jeff Kirsher

[permalink] [raw]
Subject: Re: [SR-IOV driver example 0/3 resend] introduction

On Tue, Dec 2, 2008 at 1:27 AM, Yu Zhao <[email protected]> wrote:
> SR-IOV drivers of Intel 82576 NIC are available. There are two parts
> of the drivers: Physical Function driver and Virtual Function driver.
> The PF driver is based on the IGB driver and is used to control PF to
> allocate hardware specific resources and interface with the SR-IOV core.
> The VF driver is a new NIC driver that is same as the traditional PCI
> device driver. It works in both the host and the guest (Xen and KVM)
> environment.
>
> These two drivers are testing versions and they are *only* intended to
> show how to use SR-IOV API.
>
> Intel 82576 NIC specification can be found at:
> http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf
>
> [SR-IOV driver example 0/3 resend] introduction
> [SR-IOV driver example 1/3 resend] PF driver: hardware specific operations
> [SR-IOV driver example 2/3 resend] PF driver: integrate with SR-IOV core
> [SR-IOV driver example 3/3 resend] VF driver: an independent PCI NIC driver
> --
>

First of all, we (e1000-devel) do support the SR-IOV API.

With that said, NAK on the driver changes. We were not involved in
these changes and are currently working on a version of the drivers
that will make them acceptable for kernel inclusion.

--
Cheers,
Jeff

2008-12-18 22:42:58

by Rose, Gregory V

[permalink] [raw]
Subject: RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Jesse Barnes wrote:
>
> Hm, that's not the answer I was hoping for. :) (Was looking for,
> "Yeah we just need this bits queued and we'll send an update for
> e1000 right away." :)
>
> I really don't want the SR-IOV stuff to sit out another merge cycle
> though... Arg.

We will have drivers that support these API's posted to the
lists within two or three days. These drivers are RFC only
and not to be pushed upstream. More non-Xen testing needs to
happen with the 82576 HW.

- Greg

2008-12-16 23:24:16

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Friday, November 21, 2008 10:36 am Yu Zhao wrote:
> Greetings,
>
> Following patches are intended to support SR-IOV capability in the
> Linux kernel. With these patches, people can turn a PCI device with
> the capability into multiple ones from software perspective, which
> will benefit KVM and achieve other purposes such as QoS, security,
> and etc.
>
> The Physical Function and Virtual Function drivers using the SR-IOV
> APIs will come soon!
>
> Major changes from v6 to v7:
> 1, remove boot-time resource rebalancing support. (Greg KH)
> 2, emit uevent upon the PF driver is loaded. (Greg KH)
> 3, put SR-IOV callback function into the 'pci_driver'. (Matthew Wilcox)
> 4, register SR-IOV service at the PF loading stage.
> 5, remove unnecessary APIs (pci_iov_enable/disable).

Thanks for your patience with this, Yu, I know it's been a long haul. :)

I applied 1-9 to my linux-next branch; and at least patch #10 needs a respin,
so can you re-do 10-13 as a new patch set?

On re-reading the last thread, there was a lot of smoke, but very little fire
afaict. The main questions I saw were:

1) do we need SR-IOV at all? why not just make each subsystem export
devices to guests?
This is a bit of a red herring. Nothing about SR-IOV prevents us from
making subsystems more v12n friendly. And since SR-IOV is a hardware
feature supported by devices these days, we should make Linux support it.

2) should the PF/VF drivers be the same or not?
Again, the SR-IOV patchset and PCI spec don't dictate this. We're free to
do what we want here.

3) should VF devices be represented by pci_dev structs?
Yes. (This is an easy one :)

4) can VF devices be used on the host?
Yet again, SR-IOV doesn't dictate this. Developers can make PF/VF combo
drivers or split them, and export the resulting devices however they want.
Some subsystem work may be needed to make this efficient, but SR-IOV
itself is agnostic about it.

So overall I didn't see many objections to the actual code in the last post,
and the issues above certainly don't merit a NAK IMO...

Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
I'd be much happier about it if we got some driver code along with it, so as
not to have an unused interface sitting around for who knows how many
releases. Is that reasonable? Do you know if any of the corresponding PF/VF
driver bits are ready yet?

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2008-12-17 02:38:16

by Jike Song

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Jesse Barnes wrote:
> Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
> I'd be much happier about it if we got some driver code along with it, so as
> not to have an unused interface sitting around for who knows how many
> releases. Is that reasonable? Do you know if any of the corresponding PF/VF
> driver bits are ready yet?

Hi Jesse,

Yu Zhao has posted a patch set with subject "SR-IOV driver example"
at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
drivers;-)

--
Thanks,
Jike

2008-12-17 06:06:27

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
> Jesse Barnes wrote:
> > Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
> > I'd be much happier about it if we got some driver code along with it, so as
> > not to have an unused interface sitting around for who knows how many
> > releases. Is that reasonable? Do you know if any of the corresponding PF/VF
> > driver bits are ready yet?
>
> Hi Jesse,
>
> Yu Zhao has posted a patch set with subject "SR-IOV driver example"
> at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
> drivers;-)

Yes, but that driver was soundly rejected by the network driver
maintainers, so I wouldn't go around showing that as your primary
example of how to use this interface :)

The point is valid, I don't think these apis should go into the tree
without a driver or some other code using them. Otherwise they make no
sense at all to have in-tree.

thanks,

greg k-h

2008-12-17 07:07:42

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
>> Jesse Barnes wrote:
>>> Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
>>> I'd be much happier about it if we got some driver code along with it, so as
>>> not to have an unused interface sitting around for who knows how many
>>> releases. Is that reasonable? Do you know if any of the corresponding PF/VF
>>> driver bits are ready yet?
>> Hi Jesse,
>>
>> Yu Zhao has posted a patch set with subject "SR-IOV driver example"
>> at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
>> drivers;-)
>
> Yes, but that driver was soundly rejected by the network driver
> maintainers, so I wouldn't go around showing that as your primary
> example of how to use this interface :)
>
> The point is valid, I don't think these apis should go into the tree
> without a driver or some other code using them. Otherwise they make no
> sense at all to have in-tree.

I agree the point is valid, but on another hand this is a 'the chicken &
the egg' problem -- if we don't have the SR-IOV base, people who are
developing PF drivers can not get their changes in-tree. Maybe they are
holding the patches and waiting on the infrastructure... :-)

2008-12-17 07:22:18

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wed, Dec 17, 2008 at 03:07:23PM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
>>> Jesse Barnes wrote:
>>>> Given a respin of 10-13 I think it's reasonable to merge this into
>>>> 2.6.29, but I'd be much happier about it if we got some driver code
>>>> along with it, so as not to have an unused interface sitting around for
>>>> who knows how many releases. Is that reasonable? Do you know if any of
>>>> the corresponding PF/VF driver bits are ready yet?
>>> Hi Jesse,
>>> Yu Zhao has posted a patch set with subject "SR-IOV driver example" at
>>> November 26, which illustrated the usage of SR-IOV API in Intel 82576
>>> VF/PF
>>> drivers;-)
>> Yes, but that driver was soundly rejected by the network driver
>> maintainers, so I wouldn't go around showing that as your primary
>> example of how to use this interface :)
>> The point is valid, I don't think these apis should go into the tree
>> without a driver or some other code using them. Otherwise they make no
>> sense at all to have in-tree.
>
> I agree the point is valid, but on another hand this is a 'the chicken &
> the egg' problem -- if we don't have the SR-IOV base, people who are
> developing PF drivers can not get their changes in-tree. Maybe they are
> holding the patches and waiting on the infrastructure... :-)

Are they? They can both go in at the same time, like almost every other
api addition to the kernel, right?

thanks,

greg k-h

2008-12-17 11:44:53

by Fischer, Anna

[permalink] [raw]
Subject: RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

> From: [email protected] [mailto:linux-pci-
> [email protected]] On Behalf Of Jesse Barnes
> Sent: 16 December 2008 23:24
> To: Yu Zhao
> Cc: [email protected]; Chiang, Alexander; Helgaas, Bjorn;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
>
> On Friday, November 21, 2008 10:36 am Yu Zhao wrote:
> > Greetings,
> >
> > Following patches are intended to support SR-IOV capability in the
> > Linux kernel. With these patches, people can turn a PCI device with
> > the capability into multiple ones from software perspective, which
> > will benefit KVM and achieve other purposes such as QoS, security,
> > and etc.
> >
> > The Physical Function and Virtual Function drivers using the SR-IOV
> > APIs will come soon!
> >
> > Major changes from v6 to v7:
> > 1, remove boot-time resource rebalancing support. (Greg KH)
> > 2, emit uevent upon the PF driver is loaded. (Greg KH)
> > 3, put SR-IOV callback function into the 'pci_driver'. (Matthew
> Wilcox)
> > 4, register SR-IOV service at the PF loading stage.
> > 5, remove unnecessary APIs (pci_iov_enable/disable).
>
> Thanks for your patience with this, Yu, I know it's been a long haul.
> :)
>
> I applied 1-9 to my linux-next branch; and at least patch #10 needs a
> respin,
> so can you re-do 10-13 as a new patch set?
>
> On re-reading the last thread, there was a lot of smoke, but very
> little fire
> afaict. The main questions I saw were:
>
> 1) do we need SR-IOV at all? why not just make each subsystem export
> devices to guests?
> This is a bit of a red herring. Nothing about SR-IOV prevents us
> from
> making subsystems more v12n friendly. And since SR-IOV is a
> hardware
> feature supported by devices these days, we should make Linux
> support it.
>
> 2) should the PF/VF drivers be the same or not?
> Again, the SR-IOV patchset and PCI spec don't dictate this. We're
> free to
> do what we want here.
>
> 3) should VF devices be represented by pci_dev structs?
> Yes. (This is an easy one :)
>
> 4) can VF devices be used on the host?
> Yet again, SR-IOV doesn't dictate this. Developers can make PF/VF
> combo
> drivers or split them, and export the resulting devices however
> they want.
> Some subsystem work may be needed to make this efficient, but SR-
> IOV
> itself is agnostic about it.
>
> So overall I didn't see many objections to the actual code in the last
> post,
> and the issues above certainly don't merit a NAK IMO...

I have two minor comments on this topic.

1) Currently the PF driver is called before the kernel initializes VFs and
their resources, and the current API does not allow the PF driver to
detect that easily if the allocation of the VFs and their resources
has succeeded or not. It would be quite useful if the PF driver gets
notified when the VFs have been created successfully as it might have
to do further device-specific work *after* IOV has been enabled.

2) Configuration of SR-IOV: the current API allows to enable/disable
VFs from userspace via SYSFS. At the moment I am not quite clear what
exactly is supposed to control these capabilities. This could be
Linux tools or, on a virtualized system, hypervisor control tools.
One thing I am missing though is an in-kernel API for this which I
think might be useful. After all the PF driver controls the device,
and, for example, when a device error occurs (e.g. a hardware failure
which only the PF driver will be able to detect, not Linux), then the
PF driver might have to de-allocate all resources, shut down VFs and
reset the device, or something like that. In that case the PF driver
needs to have a way to notify the Linux SR-IOV code about this and
initiate cleaning up of VFs and their resources. At the moment, this
would have to go through userspace, I believe, and I think that is not
an optimal solution. Yu, do you have an opinion on how this would be
realized?

Anna

2008-12-17 14:16:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Tue, Dec 16, 2008 at 03:23:53PM -0800, Jesse Barnes wrote:
> I applied 1-9 to my linux-next branch; and at least patch #10 needs a respin,

I still object to #2. We should have the flexibility to have 'struct
resource's that are not in this array in the pci_dev. I would like to
see the SR-IOV resources _not_ in this array (and indeed, I'd like to
see PCI bridges keep their producer resources somewhere other than in
this array). I accept that there are still some problems with this, but
patch #2 moves us further from being able to achieve this goal, not
closer.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-12-17 16:46:20

by Rose, Gregory V

[permalink] [raw]
Subject: RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

As noted in the attached email to the netdev list, we (e1000_devel) will support the API.

- Greg Rose

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Greg KH
Sent: Tuesday, December 16, 2008 10:06 PM
To: Jike Song
Cc: Jesse Barnes; Zhao, Yu; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wed, Dec 17, 2008 at 10:37:54AM +0800, Jike Song wrote:
> Jesse Barnes wrote:
> > Given a respin of 10-13 I think it's reasonable to merge this into 2.6.29, but
> > I'd be much happier about it if we got some driver code along with it, so as
> > not to have an unused interface sitting around for who knows how many
> > releases. Is that reasonable? Do you know if any of the corresponding PF/VF
> > driver bits are ready yet?
>
> Hi Jesse,
>
> Yu Zhao has posted a patch set with subject "SR-IOV driver example"
> at November 26, which illustrated the usage of SR-IOV API in Intel 82576 VF/PF
> drivers;-)

Yes, but that driver was soundly rejected by the network driver
maintainers, so I wouldn't go around showing that as your primary
example of how to use this interface :)

The point is valid, I don't think these apis should go into the tree
without a driver or some other code using them. Otherwise they make no
sense at all to have in-tree.

thanks,

greg k-h


Attachments:
(No filename) (5.47 kB)

2008-12-17 17:38:17

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wednesday, December 17, 2008 6:15 am Matthew Wilcox wrote:
> On Tue, Dec 16, 2008 at 03:23:53PM -0800, Jesse Barnes wrote:
> > I applied 1-9 to my linux-next branch; and at least patch #10 needs a
> > respin,
>
> I still object to #2. We should have the flexibility to have 'struct
> resource's that are not in this array in the pci_dev. I would like to
> see the SR-IOV resources _not_ in this array (and indeed, I'd like to
> see PCI bridges keep their producer resources somewhere other than in
> this array). I accept that there are still some problems with this, but
> patch #2 moves us further from being able to achieve this goal, not
> closer.

Yeah, I can see what you mean here... but on the other hand it makes the
existing code a bit clearer (no extra args), and really it doesn't push us
*that* much further from non-pci_dev tied resources. Any patches in that
direction will just get a few lines bigger, that's all.

But I agree that eventually we may want to have non-pci_dev resource lists,
especially if we start adding advanced host bridge drivers or something.

--
Jesse Barnes, Intel Open Source Technology Center

2008-12-17 18:18:24

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support


A: No.
Q: Should I include quotations after my reply?

On Wed, Dec 17, 2008 at 08:44:03AM -0800, Rose, Gregory V wrote:
> As noted in the attached email to the netdev list, we (e1000_devel) will support the API.

Great, will you have patches for the existing e1000 drivers soon to use
it? Or will they be a while before they can be available?

As it is, the one posted user of this api is for a driver that has been
rejected, so as there are no users of the api, I feel it should be
deferrred until there is a user to make sure it all works and feels
proper.

thanks,

greg k-h

2008-12-17 18:51:52

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wednesday, December 17, 2008 8:44 am Rose, Gregory V wrote:
> As noted in the attached email to the netdev list, we (e1000_devel) will
> support the API.

Do you think you'll have those changes ready for 2.6.29? Would merging core
SR-IOV support now make that any more likely?

Thanks,
Jesse

2008-12-17 18:59:54

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wednesday, December 17, 2008 3:42 am Fischer, Anna wrote:
> I have two minor comments on this topic.
>
> 1) Currently the PF driver is called before the kernel initializes VFs and
> their resources, and the current API does not allow the PF driver to
> detect that easily if the allocation of the VFs and their resources
> has succeeded or not. It would be quite useful if the PF driver gets
> notified when the VFs have been created successfully as it might have
> to do further device-specific work *after* IOV has been enabled.

You're thinking of after the VFs are created the VF drivers (which may or may
not be part of the PF driver) may not be able to communicate back to the PF
driver that something else needs to be done (I remember seeing this in the
earlier thread, should have included it in my post, sorry)? I'm not sure if
it makes sense to add an interface like that to the core until we have feel
for what the PF/VF drivers are going to want... Or do you have something
specific in mind right now? If/until we have something in the core, it seems
like this could be done on a per PF/VF driver basis for now.

> 2) Configuration of SR-IOV: the current API allows to enable/disable
> VFs from userspace via SYSFS. At the moment I am not quite clear what
> exactly is supposed to control these capabilities. This could be
> Linux tools or, on a virtualized system, hypervisor control tools.
> One thing I am missing though is an in-kernel API for this which I
> think might be useful. After all the PF driver controls the device,
> and, for example, when a device error occurs (e.g. a hardware failure
> which only the PF driver will be able to detect, not Linux), then the
> PF driver might have to de-allocate all resources, shut down VFs and
> reset the device, or something like that. In that case the PF driver
> needs to have a way to notify the Linux SR-IOV code about this and
> initiate cleaning up of VFs and their resources. At the moment, this
> would have to go through userspace, I believe, and I think that is not
> an optimal solution. Yu, do you have an opinion on how this would be
> realized?

That's a good point, Yu?

--
Jesse Barnes, Intel Open Source Technology Center

2008-12-17 19:06:27

by Rose, Gregory V

[permalink] [raw]
Subject: RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support


-----Original Message-----
From: Jesse Barnes [mailto:[email protected]]

On Wednesday, December 17, 2008 8:44 am Rose, Gregory V wrote:
> As noted in the attached email to the netdev list, we (e1000_devel) will
> support the API.

Do you think you'll have those changes ready for 2.6.29? Would merging core
SR-IOV support now make that any more likely?

>>>>>>>>>

I'm not sure about readiness for 2.6.29. I can tell you that as soon as I get a Xen Dom0 kernel with these API's included it will take me less than a day to convert over to them from the current drivers I have that are using an older API from back in August. The drivers are mostly functional, they have a few bugs. I could do some quick regression testing to make sure that the API changes haven't broken anything and then some bug fixes to get everything ready for release. Maybe two or three weeks for the major bugs. I'll be out over the Christmas holidays so that puts us into middle or late January if I got the Xen Dom0 kernel today. That seems unlikely but it gives you an idea of the time required.

- Greg

2008-12-17 19:34:52

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Rose, Gregory V wrote:
> -----Original Message-----
> From: Jesse Barnes [mailto:[email protected]]
>
> On Wednesday, December 17, 2008 8:44 am Rose, Gregory V wrote:
>
>> As noted in the attached email to the netdev list, we (e1000_devel) will
>> support the API.
>>
>
> Do you think you'll have those changes ready for 2.6.29? Would merging core
> SR-IOV support now make that any more likely?
>
>
>
> I'm not sure about readiness for 2.6.29. I can tell you that as soon as I get a Xen Dom0 kernel with these API's included it will take me less than a day to convert over to them from the current drivers I have that are using an older API from back in August.

Which dom0 kernel are you using? Is it based on my pvops-based dom0 work?

J

2008-12-17 19:43:15

by Rose, Gregory V

[permalink] [raw]
Subject: RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support



Jeremy Fitzhardinge wrote:

> Which dom0 kernel are you using? Is it based on my pvops-based dom0 work?

The kernel I'm currently using is an ad-hoc patchwork of changes to the 2.6.18 Xen Dom0 kernel that was available back in August. The folks from OTC in Intel (Zhao Yu and his team) would be able to provide you more background on it as they did the work to enable MSI-X, SR-IOV and VT-d in that kernel so that my drivers would function. I don't see Zhao Yu on the distro list for this email so I'll add him.

- Greg

2008-12-17 19:43:33

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wednesday, December 17, 2008 11:05 am Rose, Gregory V wrote:
> -----Original Message-----
> From: Jesse Barnes [mailto:[email protected]]
>
> On Wednesday, December 17, 2008 8:44 am Rose, Gregory V wrote:
> > As noted in the attached email to the netdev list, we (e1000_devel) will
> > support the API.
>
> Do you think you'll have those changes ready for 2.6.29? Would merging
> core SR-IOV support now make that any more likely?
>
>
>
> I'm not sure about readiness for 2.6.29. I can tell you that as soon as I
> get a Xen Dom0 kernel with these API's included it will take me less than a
> day to convert over to them from the current drivers I have that are using
> an older API from back in August. The drivers are mostly functional, they
> have a few bugs. I could do some quick regression testing to make sure
> that the API changes haven't broken anything and then some bug fixes to get
> everything ready for release. Maybe two or three weeks for the major bugs.
> I'll be out over the Christmas holidays so that puts us into middle or
> late January if I got the Xen Dom0 kernel today. That seems unlikely but
> it gives you an idea of the time required.

Hm, that's not the answer I was hoping for. :) (Was looking for, "Yeah we
just need this bits queued and we'll send an update for e1000 right away." :)

I really don't want the SR-IOV stuff to sit out another merge cycle though...
Arg.

--
Jesse Barnes, Intel Open Source Technology Center

2008-12-17 19:58:59

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wed, Dec 17, 2008 at 11:42:54AM -0800, Jesse Barnes wrote:
>
> I really don't want the SR-IOV stuff to sit out another merge cycle though...
> Arg.

Why, is there some rush to get it in? As there is no in-kernel users of
it, I don't see the problem with postponing it until someone actually
needs it.

thanks,

greg k-h

2008-12-17 20:07:52

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

On Wednesday, December 17, 2008 11:51 am Greg KH wrote:
> On Wed, Dec 17, 2008 at 11:42:54AM -0800, Jesse Barnes wrote:
> > I really don't want the SR-IOV stuff to sit out another merge cycle
> > though... Arg.
>
> Why, is there some rush to get it in? As there is no in-kernel users of
> it, I don't see the problem with postponing it until someone actually
> needs it.

Well it *does* make development of SR-IOV drivers that much harder. As you
know, out of tree development is a pain. OTOH if any changes end up being
required, they can be done before the code is merged.

Anyway, hopefully we won't have to worry about it because some driver will
come along soon that uses Yu's code. :) If not, Yu might have to maintain a
separate git tree or something until the drivers are ready to be merged.

--
Jesse Barnes, Intel Open Source Technology Center

2008-12-18 02:14:20

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Fischer, Anna wrote:
> I have two minor comments on this topic.
>
> 1) Currently the PF driver is called before the kernel initializes VFs and
> their resources, and the current API does not allow the PF driver to
> detect that easily if the allocation of the VFs and their resources
> has succeeded or not. It would be quite useful if the PF driver gets
> notified when the VFs have been created successfully as it might have
> to do further device-specific work *after* IOV has been enabled.

If the VF allocation fails in the PCI layer, then the SR-IOV core will
invokes the callback again to notify the PF driver with zero VF count.
The PF driver does not have to concern about this even the PCI layer
code fails (and actually it's very rare).

And I'm not sure why the PF driver wants to do further work *after* the
VF is allocated. Does this mean PF driver have to set up some internal
resources related to SR-IOV/VF? If yes, I suggest the PF driver do it
before VF allocation. The design philosophy of SR-IOV/VF is that VF is
treated as hot-plug device, which means it should be immediately usable
by VF driver (e.g. VF driver is pre-loaded) after it appears in the PCI
subsystem. If that is not the purpose, then PF driver should handle it
not depending on the SR-IOV, right?

If you could elaborate your SR-IOV PF/VF h/w specific requirement, it
would be help for me to answer this question :-)

> 2) Configuration of SR-IOV: the current API allows to enable/disable
> VFs from userspace via SYSFS. At the moment I am not quite clear what
> exactly is supposed to control these capabilities. This could be
> Linux tools or, on a virtualized system, hypervisor control tools.

This depends on user application, you know, which depends on the usage
environment (i.e. native, KVM or Xen).

> One thing I am missing though is an in-kernel API for this which I
> think might be useful. After all the PF driver controls the device,
> and, for example, when a device error occurs (e.g. a hardware failure
> which only the PF driver will be able to detect, not Linux), then the
> PF driver might have to de-allocate all resources, shut down VFs and
> reset the device, or something like that. In that case the PF driver
> needs to have a way to notify the Linux SR-IOV code about this and
> initiate cleaning up of VFs and their resources. At the moment, this
> would have to go through userspace, I believe, and I think that is not
> an optimal solution. Yu, do you have an opinion on how this would be
> realized?

Yes, the PF driver can use pci_iov_unregister to disable SR-IOV in case
the fatal error occurs. This function also sends notification to user
level through 'uevent' so user application can aware the change.

Thanks,
Yu

2008-12-18 02:26:57

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Matthew Wilcox wrote:
> On Tue, Dec 16, 2008 at 03:23:53PM -0800, Jesse Barnes wrote:
>> I applied 1-9 to my linux-next branch; and at least patch #10 needs a respin,
>
> I still object to #2. We should have the flexibility to have 'struct
> resource's that are not in this array in the pci_dev. I would like to
> see the SR-IOV resources _not_ in this array (and indeed, I'd like to
> see PCI bridges keep their producer resources somewhere other than in
> this array). I accept that there are still some problems with this, but

I understand your concern, and agree that using the array as resource
manager is not the best way. But for now it's not possible as you know.
We need a better resource manager for PCI subsystem to manage the
various resources (traditional, device specific, bus related), which is
another independent work from SR-IOV change.

> patch #2 moves us further from being able to achieve this goal, not
> closer.

The array is obvious straightforward and can be easily replaced with a
more advanced resource manager in the future. So I don't think we going
further from or closer to the goal.

Thanks,
Yu

2008-12-18 02:39:26

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

Jesse Barnes wrote:
> On Wednesday, December 17, 2008 11:51 am Greg KH wrote:
>> On Wed, Dec 17, 2008 at 11:42:54AM -0800, Jesse Barnes wrote:
>>> I really don't want the SR-IOV stuff to sit out another merge cycle
>>> though... Arg.
>> Why, is there some rush to get it in? As there is no in-kernel users of
>> it, I don't see the problem with postponing it until someone actually
>> needs it.
>
> Well it *does* make development of SR-IOV drivers that much harder. As you
> know, out of tree development is a pain. OTOH if any changes end up being
> required, they can be done before the code is merged.

Yes, people write to me asking for the SR-IOV patch or update everyday
-- I guess they don't want to let their competitors know they are
working on it so they can't bring their questions up on the mailing list.

And I personally also have dozen of other patches related to PCI and KVM
subsystems which depend on the SR-IOV change.

> Anyway, hopefully we won't have to worry about it because some driver will
> come along soon that uses Yu's code. :) If not, Yu might have to maintain a
> separate git tree or something until the drivers are ready to be merged.

2008-12-18 06:39:23

by Fischer, Anna

[permalink] [raw]
Subject: RE: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support

> From: Zhao, Yu [mailto:[email protected]]
> Sent: 18 December 2008 02:14
> To: Fischer, Anna
> Cc: Jesse Barnes; [email protected]; Chiang, Alexander;
> Helgaas, Bjorn; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 0/13 v7] PCI: Linux kernel SR-IOV support
>
> Fischer, Anna wrote:
> > I have two minor comments on this topic.
> >
> > 1) Currently the PF driver is called before the kernel initializes
> VFs and
> > their resources, and the current API does not allow the PF driver to
> > detect that easily if the allocation of the VFs and their resources
> > has succeeded or not. It would be quite useful if the PF driver gets
> > notified when the VFs have been created successfully as it might have
> > to do further device-specific work *after* IOV has been enabled.
>
> If the VF allocation fails in the PCI layer, then the SR-IOV core will
> invokes the callback again to notify the PF driver with zero VF count.
> The PF driver does not have to concern about this even the PCI layer
> code fails (and actually it's very rare).

Yes, this is good.


> And I'm not sure why the PF driver wants to do further work *after* the
> VF is allocated. Does this mean PF driver have to set up some internal
> resources related to SR-IOV/VF? If yes, I suggest the PF driver do it
> before VF allocation. The design philosophy of SR-IOV/VF is that VF is
> treated as hot-plug device, which means it should be immediately usable
> by VF driver (e.g. VF driver is pre-loaded) after it appears in the PCI
> subsystem. If that is not the purpose, then PF driver should handle it
> not depending on the SR-IOV, right?

Yes, you are right. In fact I was assuming in this case that the PF driver
might have to allocate VF specific resources before a PF <-> VF
communication can be established but this can be done before the VF PCI
device appears, so I was wrong with this. The current API is sufficient
to handle all of this, so I am withdrawing my concern here ;-)


> If you could elaborate your SR-IOV PF/VF h/w specific requirement, it
> would be help for me to answer this question :-)
>
> > 2) Configuration of SR-IOV: the current API allows to enable/disable
> > VFs from userspace via SYSFS. At the moment I am not quite clear what
> > exactly is supposed to control these capabilities. This could be
> > Linux tools or, on a virtualized system, hypervisor control tools.
>
> This depends on user application, you know, which depends on the usage
> environment (i.e. native, KVM or Xen).
>
> > One thing I am missing though is an in-kernel API for this which I
> > think might be useful. After all the PF driver controls the device,
> > and, for example, when a device error occurs (e.g. a hardware failure
> > which only the PF driver will be able to detect, not Linux), then the
> > PF driver might have to de-allocate all resources, shut down VFs and
> > reset the device, or something like that. In that case the PF driver
> > needs to have a way to notify the Linux SR-IOV code about this and
> > initiate cleaning up of VFs and their resources. At the moment, this
> > would have to go through userspace, I believe, and I think that is
> not
> > an optimal solution. Yu, do you have an opinion on how this would be
> > realized?
>
> Yes, the PF driver can use pci_iov_unregister to disable SR-IOV in case
> the fatal error occurs. This function also sends notification to user
> level through 'uevent' so user application can aware the change.

If pci_iov_unregister is accessible for kernel drivers than this is in fact
all we need. Thanks for the clarification.


I think the patchset looks very good.

Acked-by: Anna Fischer <[email protected]>