2009-03-20 03:24:55

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 0/8] PCI: Linux kernel SR-IOV support

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

SR-IOV specification can be found at:
http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf
(it requires membership.)

Devices that support SR-IOV are available from following vendors:
http://download.intel.com/design/network/ProdBrf/320025.pdf
http://www.myri.com/vlsi/Lanai_Z8ES_Datasheet.pdf
http://www.neterion.com/products/pdfs/X3100ProductBrief.pdf

The patches to enable the SR-IOV capability of Intel 82576 NIC are
available at (a.k.a Physical Function driver):
http://patchwork.kernel.org/patch/8063/
http://patchwork.kernel.org/patch/8064/
http://patchwork.kernel.org/patch/8065/
http://patchwork.kernel.org/patch/8066/
And the driver for Intel 82576 Virtual Function are available at:
http://patchwork.kernel.org/patch/11029/
http://patchwork.kernel.org/patch/11028/


Major changes from v11 to v12:
1, fix using garbage entry pointer after the list_for_each (Matthew Wilcox)
2, use #ifdef around SR-IOV structure in the pci_dev (Matthew Wilcox)
3, enhance the Kconfig help text for the SR-IOV (Matthew Wilcox)

v10 to v11:
1, use pci_setup_device() to setup Virtual Function (Matthew Wilcox)
2, various coding style fixes (Matthew Wilcox)
3, wording and grammar fixes (Randy Dunlap)

v9 -> v10:
1, minor fix in pci_restore_iov_state().
2, respin against the latest tree.

v8 -> v9:
1, put a might_sleep() into SR-IOV API which sleeps (Andi Kleen)
2, block user config accesses before clearing VF Enable bit (Matthew Wilcox)

Yu Zhao (8):
PCI: initialize and release SR-IOV capability
PCI: restore saved SR-IOV state
PCI: reserve bus range for SR-IOV device
PCI: centralize device setup code
PCI: add SR-IOV API for Physical Function driver
PCI: handle SR-IOV Virtual Function Migration
PCI: document SR-IOV sysfs entries
PCI: manual for SR-IOV user and driver developer

Documentation/ABI/testing/sysfs-bus-pci | 27 ++
Documentation/DocBook/kernel-api.tmpl | 1 +
Documentation/PCI/pci-iov-howto.txt | 99 +++++
drivers/pci/Kconfig | 10 +
drivers/pci/Makefile | 2 +
drivers/pci/iov.c | 680 +++++++++++++++++++++++++++++++
drivers/pci/pci.c | 8 +
drivers/pci/pci.h | 53 +++
drivers/pci/probe.c | 86 +++--
include/linux/pci.h | 34 ++
include/linux/pci_regs.h | 33 ++
11 files changed, 994 insertions(+), 39 deletions(-)
create mode 100644 Documentation/PCI/pci-iov-howto.txt
create mode 100644 drivers/pci/iov.c


2009-03-20 03:25:25

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 1/8] PCI: initialize and release SR-IOV capability

If a device has the SR-IOV capability, initialize it (set the ARI
Capable Hierarchy in the lowest numbered PF if necessary; calculate
the System Page Size for the VF MMIO, probe the VF Offset, Stride
and BARs). A lock for the VF bus allocation is also initialized if
a PF is the lowest numbered PF.

Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
drivers/pci/Kconfig | 10 +++
drivers/pci/Makefile | 2 +
drivers/pci/iov.c | 182 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 7 ++
drivers/pci/pci.h | 37 +++++++++
drivers/pci/probe.c | 4 +
include/linux/pci.h | 11 +++
include/linux/pci_regs.h | 33 ++++++++
8 files changed, 286 insertions(+), 0 deletions(-)
create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 2a4501d..fdc864f 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -59,3 +59,13 @@ config HT_IRQ
This allows native hypertransport devices to use interrupts.

If unsure say Y.
+
+config PCI_IOV
+ bool "PCI IOV support"
+ depends on PCI
+ help
+ I/O Virtualization is a PCI feature supported by some devices
+ which allows them to create virtual devices which share their
+ physical resources.
+
+ If unsure, say N.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 3d07ce2..ba6af16 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -29,6 +29,8 @@ obj-$(CONFIG_DMAR) += dmar.o iova.o intel-iommu.o

obj-$(CONFIG_INTR_REMAP) += dmar.o intr_remapping.o

+obj-$(CONFIG_PCI_IOV) += iov.o
+
#
# Some architectures use the generic PCI setup functions
#
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 0000000..66cc414
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,182 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2009 Intel Corporation, Yu Zhao <[email protected]>
+ *
+ * PCI Express I/O Virtualization (IOV) support.
+ * Single Root IOV 1.0
+ */
+
+#include <linux/pci.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include <linux/delay.h>
+#include "pci.h"
+
+
+static int sriov_init(struct pci_dev *dev, int pos)
+{
+ int i;
+ int rc;
+ int nres;
+ u32 pgsz;
+ u16 ctrl, total, offset, stride;
+ struct pci_sriov *iov;
+ struct resource *res;
+ struct pci_dev *pdev;
+
+ if (dev->pcie_type != PCI_EXP_TYPE_RC_END &&
+ dev->pcie_type != PCI_EXP_TYPE_ENDPOINT)
+ return -ENODEV;
+
+ pci_read_config_word(dev, pos + PCI_SRIOV_CTRL, &ctrl);
+ if (ctrl & PCI_SRIOV_CTRL_VFE) {
+ pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, 0);
+ ssleep(1);
+ }
+
+ pci_read_config_word(dev, pos + PCI_SRIOV_TOTAL_VF, &total);
+ if (!total)
+ return 0;
+
+ ctrl = 0;
+ list_for_each_entry(pdev, &dev->bus->devices, bus_list)
+ if (pdev->is_physfn)
+ goto found;
+
+ pdev = NULL;
+ if (pci_ari_enabled(dev->bus))
+ ctrl |= PCI_SRIOV_CTRL_ARI;
+
+found:
+ pci_write_config_word(dev, pos + PCI_SRIOV_CTRL, ctrl);
+ pci_write_config_word(dev, pos + PCI_SRIOV_NUM_VF, total);
+ pci_read_config_word(dev, pos + PCI_SRIOV_VF_OFFSET, &offset);
+ pci_read_config_word(dev, pos + PCI_SRIOV_VF_STRIDE, &stride);
+ if (!offset || (total > 1 && !stride))
+ return -EIO;
+
+ pci_read_config_dword(dev, pos + PCI_SRIOV_SUP_PGSIZE, &pgsz);
+ i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
+ pgsz &= ~((1 << i) - 1);
+ if (!pgsz)
+ return -EIO;
+
+ pgsz &= ~(pgsz - 1);
+ pci_write_config_dword(dev, pos + PCI_SRIOV_SYS_PGSIZE, pgsz);
+
+ nres = 0;
+ for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ i += __pci_read_base(dev, pci_bar_unknown, res,
+ pos + PCI_SRIOV_BAR + i * 4);
+ if (!res->flags)
+ continue;
+ if (resource_size(res) & (PAGE_SIZE - 1)) {
+ rc = -EIO;
+ goto failed;
+ }
+ res->end = res->start + resource_size(res) * total - 1;
+ nres++;
+ }
+
+ iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+ if (!iov) {
+ rc = -ENOMEM;
+ goto failed;
+ }
+
+ iov->pos = pos;
+ iov->nres = nres;
+ iov->ctrl = ctrl;
+ iov->total = total;
+ iov->offset = offset;
+ iov->stride = stride;
+ iov->pgsz = pgsz;
+ iov->self = dev;
+ pci_read_config_dword(dev, pos + PCI_SRIOV_CAP, &iov->cap);
+ pci_read_config_byte(dev, pos + PCI_SRIOV_FUNC_LINK, &iov->link);
+
+ if (pdev)
+ iov->dev = pci_dev_get(pdev);
+ else {
+ iov->dev = dev;
+ mutex_init(&iov->lock);
+ }
+
+ dev->sriov = iov;
+ dev->is_physfn = 1;
+
+ return 0;
+
+failed:
+ for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ res->flags = 0;
+ }
+
+ return rc;
+}
+
+static void sriov_release(struct pci_dev *dev)
+{
+ if (dev == dev->sriov->dev)
+ mutex_destroy(&dev->sriov->lock);
+ else
+ pci_dev_put(dev->sriov->dev);
+
+ kfree(dev->sriov);
+ dev->sriov = NULL;
+}
+
+/**
+ * pci_iov_init - initialize the IOV capability
+ * @dev: the PCI device
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_iov_init(struct pci_dev *dev)
+{
+ int pos;
+
+ if (!dev->is_pcie)
+ return -ENODEV;
+
+ pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_SRIOV);
+ if (pos)
+ return sriov_init(dev, pos);
+
+ return -ENODEV;
+}
+
+/**
+ * pci_iov_release - release resources used by the IOV capability
+ * @dev: the PCI device
+ */
+void pci_iov_release(struct pci_dev *dev)
+{
+ if (dev->is_physfn)
+ sriov_release(dev);
+}
+
+/**
+ * pci_iov_resource_bar - get position of the SR-IOV BAR
+ * @dev: the PCI device
+ * @resno: the resource number
+ * @type: the BAR type to be filled in
+ *
+ * Returns position of the BAR encapsulated in the SR-IOV capability.
+ */
+int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type)
+{
+ if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCE_END)
+ return 0;
+
+ BUG_ON(!dev->is_physfn);
+
+ *type = pci_bar_unknown;
+
+ return dev->sriov->pos + PCI_SRIOV_BAR +
+ 4 * (resno - PCI_IOV_RESOURCES);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 6d61200..2eba2a5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2346,12 +2346,19 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags)
*/
int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type)
{
+ int reg;
+
if (resno < PCI_ROM_RESOURCE) {
*type = pci_bar_unknown;
return PCI_BASE_ADDRESS_0 + 4 * resno;
} else if (resno == PCI_ROM_RESOURCE) {
*type = pci_bar_mem32;
return dev->rom_base_reg;
+ } else if (resno < PCI_BRIDGE_RESOURCES) {
+ /* device specific resource */
+ reg = pci_iov_resource_bar(dev, resno, type);
+ if (reg)
+ return reg;
}

dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 07c0aa5..196be5e 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -195,4 +195,41 @@ static inline int pci_ari_enabled(struct pci_bus *bus)
return bus->self && bus->self->ari_enabled;
}

+/* Single Root I/O Virtualization */
+struct pci_sriov {
+ int pos; /* capability position */
+ int nres; /* number of resources */
+ u32 cap; /* SR-IOV Capabilities */
+ u16 ctrl; /* SR-IOV Control */
+ u16 total; /* total VFs associated with the PF */
+ u16 offset; /* first VF Routing ID offset */
+ u16 stride; /* following VF stride */
+ u32 pgsz; /* page size for BAR alignment */
+ u8 link; /* Function Dependency Link */
+ struct pci_dev *dev; /* lowest numbered PF */
+ struct pci_dev *self; /* this PF */
+ struct mutex lock; /* lock for VF bus */
+};
+
+#ifdef CONFIG_PCI_IOV
+extern int pci_iov_init(struct pci_dev *dev);
+extern void pci_iov_release(struct pci_dev *dev);
+extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type);
+#else
+static inline int pci_iov_init(struct pci_dev *dev)
+{
+ return -ENODEV;
+}
+static inline void pci_iov_release(struct pci_dev *dev)
+
+{
+}
+static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type)
+{
+ return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 55ec44a..03b6f29 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -785,6 +785,7 @@ static int pci_setup_device(struct pci_dev * dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
pci_vpd_release(dev);
+ pci_iov_release(dev);
}

/**
@@ -972,6 +973,9 @@ static void pci_init_capabilities(struct pci_dev *dev)

/* Alternative Routing-ID Forwarding */
pci_enable_ari(dev);
+
+ /* Single Root I/O Virtualization */
+ pci_iov_init(dev);
}

void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7bd624b..01eed8f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -93,6 +93,12 @@ enum {
/* #6: expansion ROM resource */
PCI_ROM_RESOURCE,

+ /* device specific resources */
+#ifdef CONFIG_PCI_IOV
+ PCI_IOV_RESOURCES,
+ PCI_IOV_RESOURCE_END = PCI_IOV_RESOURCES + PCI_SRIOV_NUM_BARS - 1,
+#endif
+
/* resources assigned to buses behind the bridge */
#define PCI_BRIDGE_RESOURCE_NUM 4

@@ -180,6 +186,7 @@ struct pci_cap_saved_state {

struct pcie_link_state;
struct pci_vpd;
+struct pci_sriov;

/*
* The pci_dev structure is used to describe PCI devices.
@@ -257,6 +264,7 @@ struct pci_dev {
unsigned int is_managed:1;
unsigned int is_pcie:1;
unsigned int state_saved:1;
+ unsigned int is_physfn:1;
pci_dev_flags_t dev_flags;
atomic_t enable_cnt; /* pci_enable_device has been called */

@@ -270,6 +278,9 @@ struct pci_dev {
struct list_head msi_list;
#endif
struct pci_vpd *vpd;
+#ifdef CONFIG_PCI_IOV
+ struct pci_sriov *sriov; /* SR-IOV capability related */
+#endif
};

extern struct pci_dev *alloc_pci_dev(void);
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 027815b..4ce5eb0 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -375,6 +375,7 @@
#define PCI_EXP_TYPE_UPSTREAM 0x5 /* Upstream Port */
#define PCI_EXP_TYPE_DOWNSTREAM 0x6 /* Downstream Port */
#define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCI/PCI-X Bridge */
+#define PCI_EXP_TYPE_RC_END 0x9 /* Root Complex Integrated Endpoint */
#define PCI_EXP_FLAGS_SLOT 0x0100 /* Slot implemented */
#define PCI_EXP_FLAGS_IRQ 0x3e00 /* Interrupt message number */
#define PCI_EXP_DEVCAP 4 /* Device capabilities */
@@ -498,6 +499,7 @@
#define PCI_EXT_CAP_ID_DSN 3
#define PCI_EXT_CAP_ID_PWR 4
#define PCI_EXT_CAP_ID_ARI 14
+#define PCI_EXT_CAP_ID_SRIOV 16

/* Advanced Error Reporting */
#define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */
@@ -615,4 +617,35 @@
#define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */
#define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */

+/* Single Root I/O Virtualization */
+#define PCI_SRIOV_CAP 0x04 /* SR-IOV Capabilities */
+#define PCI_SRIOV_CAP_VFM 0x01 /* VF Migration Capable */
+#define PCI_SRIOV_CAP_INTR(x) ((x) >> 21) /* Interrupt Message Number */
+#define PCI_SRIOV_CTRL 0x08 /* SR-IOV Control */
+#define PCI_SRIOV_CTRL_VFE 0x01 /* VF Enable */
+#define PCI_SRIOV_CTRL_VFM 0x02 /* VF Migration Enable */
+#define PCI_SRIOV_CTRL_INTR 0x04 /* VF Migration Interrupt Enable */
+#define PCI_SRIOV_CTRL_MSE 0x08 /* VF Memory Space Enable */
+#define PCI_SRIOV_CTRL_ARI 0x10 /* ARI Capable Hierarchy */
+#define PCI_SRIOV_STATUS 0x0a /* SR-IOV Status */
+#define PCI_SRIOV_STATUS_VFM 0x01 /* VF Migration Status */
+#define PCI_SRIOV_INITIAL_VF 0x0c /* Initial VFs */
+#define PCI_SRIOV_TOTAL_VF 0x0e /* Total VFs */
+#define PCI_SRIOV_NUM_VF 0x10 /* Number of VFs */
+#define PCI_SRIOV_FUNC_LINK 0x12 /* Function Dependency Link */
+#define PCI_SRIOV_VF_OFFSET 0x14 /* First VF Offset */
+#define PCI_SRIOV_VF_STRIDE 0x16 /* Following VF Stride */
+#define PCI_SRIOV_VF_DID 0x1a /* VF Device ID */
+#define PCI_SRIOV_SUP_PGSIZE 0x1c /* Supported Page Sizes */
+#define PCI_SRIOV_SYS_PGSIZE 0x20 /* System Page Size */
+#define PCI_SRIOV_BAR 0x24 /* VF BAR0 */
+#define PCI_SRIOV_NUM_BARS 6 /* Number of VF BARs */
+#define PCI_SRIOV_VFM 0x3c /* VF Migration State Array Offset*/
+#define PCI_SRIOV_VFM_BIR(x) ((x) & 7) /* State BIR */
+#define PCI_SRIOV_VFM_OFFSET(x) ((x) & ~7) /* State Offset */
+#define PCI_SRIOV_VFM_UA 0x0 /* Inactive.Unavailable */
+#define PCI_SRIOV_VFM_MI 0x1 /* Dormant.MigrateIn */
+#define PCI_SRIOV_VFM_MO 0x2 /* Active.MigrateOut */
+#define PCI_SRIOV_VFM_AV 0x3 /* Active.Available */
+
#endif /* LINUX_PCI_REGS_H */
--
1.5.6.4

2009-03-20 03:25:51

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 2/8] PCI: restore saved SR-IOV state

Restore the volatile registers in the SR-IOV capability after the
D3->D0 transition.

Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
drivers/pci/iov.c | 29 +++++++++++++++++++++++++++++
drivers/pci/pci.c | 1 +
drivers/pci/pci.h | 4 ++++
3 files changed, 34 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 66cc414..b121e47 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -129,6 +129,25 @@ static void sriov_release(struct pci_dev *dev)
dev->sriov = NULL;
}

+static void sriov_restore_state(struct pci_dev *dev)
+{
+ int i;
+ u16 ctrl;
+ struct pci_sriov *iov = dev->sriov;
+
+ pci_read_config_word(dev, iov->pos + PCI_SRIOV_CTRL, &ctrl);
+ if (ctrl & PCI_SRIOV_CTRL_VFE)
+ return;
+
+ for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++)
+ pci_update_resource(dev, i);
+
+ pci_write_config_dword(dev, iov->pos + PCI_SRIOV_SYS_PGSIZE, iov->pgsz);
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+ if (iov->ctrl & PCI_SRIOV_CTRL_VFE)
+ msleep(100);
+}
+
/**
* pci_iov_init - initialize the IOV capability
* @dev: the PCI device
@@ -180,3 +199,13 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
return dev->sriov->pos + PCI_SRIOV_BAR +
4 * (resno - PCI_IOV_RESOURCES);
}
+
+/**
+ * pci_restore_iov_state - restore the state of the IOV capability
+ * @dev: the PCI device
+ */
+void pci_restore_iov_state(struct pci_dev *dev)
+{
+ if (dev->is_physfn)
+ sriov_restore_state(dev);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 2eba2a5..8e21912 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -773,6 +773,7 @@ pci_restore_state(struct pci_dev *dev)
}
pci_restore_pcix_state(dev);
pci_restore_msi_state(dev);
+ pci_restore_iov_state(dev);

return 0;
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 196be5e..efd79a2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -216,6 +216,7 @@ extern int pci_iov_init(struct pci_dev *dev);
extern void pci_iov_release(struct pci_dev *dev);
extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
+extern void pci_restore_iov_state(struct pci_dev *dev);
#else
static inline int pci_iov_init(struct pci_dev *dev)
{
@@ -230,6 +231,9 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
{
return 0;
}
+static inline void pci_restore_iov_state(struct pci_dev *dev)
+{
+}
#endif /* CONFIG_PCI_IOV */

#endif /* DRIVERS_PCI_H */
--
1.5.6.4

2009-03-20 03:26:17

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 3/8] PCI: reserve bus range for SR-IOV device

Reserve the bus number range used by the Virtual Function when
pcibios_assign_all_busses() returns true.

Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
drivers/pci/iov.c | 36 ++++++++++++++++++++++++++++++++++++
drivers/pci/pci.h | 5 +++++
drivers/pci/probe.c | 3 +++
3 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b121e47..5ddfc09 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -14,6 +14,18 @@
#include "pci.h"


+static inline u8 virtfn_bus(struct pci_dev *dev, int id)
+{
+ return dev->bus->number + ((dev->devfn + dev->sriov->offset +
+ dev->sriov->stride * id) >> 8);
+}
+
+static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
+{
+ return (dev->devfn + dev->sriov->offset +
+ dev->sriov->stride * id) & 0xff;
+}
+
static int sriov_init(struct pci_dev *dev, int pos)
{
int i;
@@ -209,3 +221,27 @@ void pci_restore_iov_state(struct pci_dev *dev)
if (dev->is_physfn)
sriov_restore_state(dev);
}
+
+/**
+ * pci_iov_bus_range - find bus range used by Virtual Function
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+ int max = 0;
+ u8 busnr;
+ struct pci_dev *dev;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ if (!dev->is_physfn)
+ continue;
+ busnr = virtfn_bus(dev, dev->sriov->total - 1);
+ if (busnr > max)
+ max = busnr;
+ }
+
+ return max ? max - bus->number : 0;
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index efd79a2..7abdef6 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -217,6 +217,7 @@ extern void pci_iov_release(struct pci_dev *dev);
extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
extern void pci_restore_iov_state(struct pci_dev *dev);
+extern int pci_iov_bus_range(struct pci_bus *bus);
#else
static inline int pci_iov_init(struct pci_dev *dev)
{
@@ -234,6 +235,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
static inline void pci_restore_iov_state(struct pci_dev *dev)
{
}
+static inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+ return 0;
+}
#endif /* CONFIG_PCI_IOV */

#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 03b6f29..4c8abd0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1078,6 +1078,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
for (devfn = 0; devfn < 0x100; devfn += 8)
pci_scan_slot(bus, devfn);

+ /* Reserve buses for SR-IOV capability. */
+ max += pci_iov_bus_range(bus);
+
/*
* After performing arch-dependent fixup of the bus, look behind
* all PCI-to-PCI bridges on this bus.
--
1.5.6.4

2009-03-20 03:26:52

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 6/8] PCI: handle SR-IOV Virtual Function Migration

Add or remove a Virtual Function after receiving a Migrate In or Out
Request.

Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
drivers/pci/iov.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.h | 4 ++
include/linux/pci.h | 6 +++
3 files changed, 129 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index d0ff8ad..7227efc 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -179,6 +179,97 @@ static void virtfn_remove(struct pci_dev *dev, int id, int reset)
pci_dev_put(dev);
}

+static int sriov_migration(struct pci_dev *dev)
+{
+ u16 status;
+ struct pci_sriov *iov = dev->sriov;
+
+ if (!iov->nr_virtfn)
+ return 0;
+
+ if (!(iov->cap & PCI_SRIOV_CAP_VFM))
+ return 0;
+
+ pci_read_config_word(dev, iov->pos + PCI_SRIOV_STATUS, &status);
+ if (!(status & PCI_SRIOV_STATUS_VFM))
+ return 0;
+
+ schedule_work(&iov->mtask);
+
+ return 1;
+}
+
+static void sriov_migration_task(struct work_struct *work)
+{
+ int i;
+ u8 state;
+ u16 status;
+ struct pci_sriov *iov = container_of(work, struct pci_sriov, mtask);
+
+ for (i = iov->initial; i < iov->nr_virtfn; i++) {
+ state = readb(iov->mstate + i);
+ if (state == PCI_SRIOV_VFM_MI) {
+ writeb(PCI_SRIOV_VFM_AV, iov->mstate + i);
+ state = readb(iov->mstate + i);
+ if (state == PCI_SRIOV_VFM_AV)
+ virtfn_add(iov->self, i, 1);
+ } else if (state == PCI_SRIOV_VFM_MO) {
+ virtfn_remove(iov->self, i, 1);
+ writeb(PCI_SRIOV_VFM_UA, iov->mstate + i);
+ state = readb(iov->mstate + i);
+ if (state == PCI_SRIOV_VFM_AV)
+ virtfn_add(iov->self, i, 0);
+ }
+ }
+
+ pci_read_config_word(iov->self, iov->pos + PCI_SRIOV_STATUS, &status);
+ status &= ~PCI_SRIOV_STATUS_VFM;
+ pci_write_config_word(iov->self, iov->pos + PCI_SRIOV_STATUS, status);
+}
+
+static int sriov_enable_migration(struct pci_dev *dev, int nr_virtfn)
+{
+ int bir;
+ u32 table;
+ resource_size_t pa;
+ struct pci_sriov *iov = dev->sriov;
+
+ if (nr_virtfn <= iov->initial)
+ return 0;
+
+ pci_read_config_dword(dev, iov->pos + PCI_SRIOV_VFM, &table);
+ bir = PCI_SRIOV_VFM_BIR(table);
+ if (bir > PCI_STD_RESOURCE_END)
+ return -EIO;
+
+ table = PCI_SRIOV_VFM_OFFSET(table);
+ if (table + nr_virtfn > pci_resource_len(dev, bir))
+ return -EIO;
+
+ pa = pci_resource_start(dev, bir) + table;
+ iov->mstate = ioremap(pa, nr_virtfn);
+ if (!iov->mstate)
+ return -ENOMEM;
+
+ INIT_WORK(&iov->mtask, sriov_migration_task);
+
+ iov->ctrl |= PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR;
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+
+ return 0;
+}
+
+static void sriov_disable_migration(struct pci_dev *dev)
+{
+ struct pci_sriov *iov = dev->sriov;
+
+ iov->ctrl &= ~(PCI_SRIOV_CTRL_VFM | PCI_SRIOV_CTRL_INTR);
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+
+ cancel_work_sync(&iov->mtask);
+ iounmap(iov->mstate);
+}
+
static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
{
int rc;
@@ -261,6 +352,12 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
goto failed;
}

+ if (iov->cap & PCI_SRIOV_CAP_VFM) {
+ rc = sriov_enable_migration(dev, nr_virtfn);
+ if (rc)
+ goto failed;
+ }
+
kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
iov->nr_virtfn = nr_virtfn;

@@ -290,6 +387,9 @@ static void sriov_disable(struct pci_dev *dev)
if (!iov->nr_virtfn)
return;

+ if (iov->cap & PCI_SRIOV_CAP_VFM)
+ sriov_disable_migration(dev);
+
for (i = 0; i < iov->nr_virtfn; i++)
virtfn_remove(dev, i, 0);

@@ -559,3 +659,22 @@ void pci_disable_sriov(struct pci_dev *dev)
sriov_disable(dev);
}
EXPORT_SYMBOL_GPL(pci_disable_sriov);
+
+/**
+ * pci_sriov_migration - notify SR-IOV core of Virtual Function Migration
+ * @dev: the PCI device
+ *
+ * Returns IRQ_HANDLED if the IRQ is handled, or IRQ_NONE if not.
+ *
+ * Physical Function driver is responsible to register IRQ handler using
+ * VF Migration Interrupt Message Number, and call this function when the
+ * interrupt is generated by the hardware.
+ */
+irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+ if (!dev->is_physfn)
+ return IRQ_NONE;
+
+ return sriov_migration(dev) ? IRQ_HANDLED : IRQ_NONE;
+}
+EXPORT_SYMBOL_GPL(pci_sriov_migration);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1bdace3..dd7c63f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1,6 +1,8 @@
#ifndef DRIVERS_PCI_H
#define DRIVERS_PCI_H

+#include <linux/workqueue.h>
+
#define PCI_CFG_SPACE_SIZE 256
#define PCI_CFG_SPACE_EXP_SIZE 4096

@@ -212,6 +214,8 @@ struct pci_sriov {
struct pci_dev *dev; /* lowest numbered PF */
struct pci_dev *self; /* this PF */
struct mutex lock; /* lock for VF bus */
+ struct work_struct mtask; /* VF Migration task */
+ u8 __iomem *mstate; /* VF Migration State Array */
};

#ifdef CONFIG_PCI_IOV
diff --git a/include/linux/pci.h b/include/linux/pci.h
index a83f662..df78327 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -52,6 +52,7 @@
#include <asm/atomic.h>
#include <linux/device.h>
#include <linux/io.h>
+#include <linux/irqreturn.h>

/* Include the ID list */
#include <linux/pci_ids.h>
@@ -1212,6 +1213,7 @@ void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);
#ifdef CONFIG_PCI_IOV
extern int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
extern void pci_disable_sriov(struct pci_dev *dev);
+extern irqreturn_t pci_sriov_migration(struct pci_dev *dev);
#else
static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
{
@@ -1220,6 +1222,10 @@ static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
static inline void pci_disable_sriov(struct pci_dev *dev)
{
}
+static inline irqreturn_t pci_sriov_migration(struct pci_dev *dev)
+{
+ return IRQ_NONE;
+}
#endif

#endif /* __KERNEL__ */
--
1.5.6.4

2009-03-20 03:26:35

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 4/8] PCI: centralize device setup code

Move the device setup stuff into pci_setup_device() which will be used
to setup the Virtual Function later.

Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
drivers/pci/pci.h | 1 +
drivers/pci/probe.c | 79 ++++++++++++++++++++++++++-------------------------
2 files changed, 41 insertions(+), 39 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 7abdef6..80ad848 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -178,6 +178,7 @@ enum pci_bar_type {
pci_bar_mem64, /* A 64-bit memory BAR */
};

+extern int pci_setup_device(struct pci_dev *dev);
extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int reg);
extern int pci_resource_bar(struct pci_dev *dev, int resno,
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4c8abd0..f4ca550 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -674,6 +674,19 @@ static void pci_read_irq(struct pci_dev *dev)
dev->irq = irq;
}

+static void set_pcie_port_type(struct pci_dev *pdev)
+{
+ int pos;
+ u16 reg16;
+
+ pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
+ if (!pos)
+ return;
+ pdev->is_pcie = 1;
+ pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, &reg16);
+ pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4;
+}
+
#define LEGACY_IO_RESOURCE (IORESOURCE_IO | IORESOURCE_PCI_FIXED)

/**
@@ -683,12 +696,34 @@ static void pci_read_irq(struct pci_dev *dev)
* Initialize the device structure with information about the device's
* vendor,class,memory and IO-space addresses,IRQ lines etc.
* Called at initialisation of the PCI subsystem and by CardBus services.
- * Returns 0 on success and -1 if unknown type of device (not normal, bridge
- * or CardBus).
+ * Returns 0 on success and negative if unknown type of device (not normal,
+ * bridge or CardBus).
*/
-static int pci_setup_device(struct pci_dev * dev)
+int pci_setup_device(struct pci_dev *dev)
{
u32 class;
+ u8 hdr_type;
+ struct pci_slot *slot;
+
+ if (pci_read_config_byte(dev, PCI_HEADER_TYPE, &hdr_type))
+ return -EIO;
+
+ dev->sysdata = dev->bus->sysdata;
+ dev->dev.parent = dev->bus->bridge;
+ dev->dev.bus = &pci_bus_type;
+ dev->hdr_type = hdr_type & 0x7f;
+ dev->multifunction = !!(hdr_type & 0x80);
+ dev->cfg_size = pci_cfg_space_size(dev);
+ dev->error_state = pci_channel_io_normal;
+ set_pcie_port_type(dev);
+
+ list_for_each_entry(slot, &dev->bus->slots, list)
+ if (PCI_SLOT(dev->devfn) == slot->number)
+ dev->slot = slot;
+
+ /* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
+ set this higher, assuming the system even supports it. */
+ dev->dma_mask = 0xffffffff;

dev_set_name(&dev->dev, "%04x:%02x:%02x.%d", pci_domain_nr(dev->bus),
dev->bus->number, PCI_SLOT(dev->devfn),
@@ -708,7 +743,6 @@ static int pci_setup_device(struct pci_dev * dev)

/* Early fixups, before probing the BARs */
pci_fixup_device(pci_fixup_early, dev);
- class = dev->class >> 8;

switch (dev->hdr_type) { /* header type */
case PCI_HEADER_TYPE_NORMAL: /* standard header */
@@ -770,7 +804,7 @@ static int pci_setup_device(struct pci_dev * dev)
default: /* unknown header */
dev_err(&dev->dev, "unknown header type %02x, "
"ignoring device\n", dev->hdr_type);
- return -1;
+ return -EIO;

bad:
dev_err(&dev->dev, "ignoring class %02x (doesn't match header "
@@ -804,19 +838,6 @@ static void pci_release_dev(struct device *dev)
kfree(pci_dev);
}

-static void set_pcie_port_type(struct pci_dev *pdev)
-{
- int pos;
- u16 reg16;
-
- pos = pci_find_capability(pdev, PCI_CAP_ID_EXP);
- if (!pos)
- return;
- pdev->is_pcie = 1;
- pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, &reg16);
- pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4;
-}
-
/**
* pci_cfg_space_size - get the configuration space size of the PCI device.
* @dev: PCI device
@@ -892,9 +913,7 @@ EXPORT_SYMBOL(alloc_pci_dev);
static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
{
struct pci_dev *dev;
- struct pci_slot *slot;
u32 l;
- u8 hdr_type;
int delay = 1;

if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, &l))
@@ -921,34 +940,16 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
}
}

- if (pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, &hdr_type))
- return NULL;
-
dev = alloc_pci_dev();
if (!dev)
return NULL;

dev->bus = bus;
- dev->sysdata = bus->sysdata;
- dev->dev.parent = bus->bridge;
- dev->dev.bus = &pci_bus_type;
dev->devfn = devfn;
- dev->hdr_type = hdr_type & 0x7f;
- dev->multifunction = !!(hdr_type & 0x80);
dev->vendor = l & 0xffff;
dev->device = (l >> 16) & 0xffff;
- dev->cfg_size = pci_cfg_space_size(dev);
- dev->error_state = pci_channel_io_normal;
- set_pcie_port_type(dev);
-
- list_for_each_entry(slot, &bus->slots, list)
- if (PCI_SLOT(devfn) == slot->number)
- dev->slot = slot;

- /* Assume 32-bit PCI; let 64-bit PCI cards (which are far rarer)
- set this higher, assuming the system even supports it. */
- dev->dma_mask = 0xffffffff;
- if (pci_setup_device(dev) < 0) {
+ if (pci_setup_device(dev)) {
kfree(dev);
return NULL;
}
--
1.5.6.4

2009-03-20 03:27:20

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 7/8] PCI: document SR-IOV sysfs entries

Reviewed-by: Randy Dunlap <[email protected]>
Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-pci | 27 +++++++++++++++++++++++++++
1 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index e638e15..36edf03 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -52,3 +52,30 @@ Description:
that some devices may have malformatted data. If the
underlying VPD has a writable section then the
corresponding section of this file will be writable.
+
+What: /sys/bus/pci/devices/.../virtfnN
+Date: March 2009
+Contact: Yu Zhao <[email protected]>
+Description:
+ This symbolic link appears when hardware supports the SR-IOV
+ capability and the Physical Function driver has enabled it.
+ The symbolic link points to the PCI device sysfs entry of the
+ Virtual Function whose index is N (0...MaxVFs-1).
+
+What: /sys/bus/pci/devices/.../dep_link
+Date: March 2009
+Contact: Yu Zhao <[email protected]>
+Description:
+ This symbolic link appears when hardware supports the SR-IOV
+ capability and the Physical Function driver has enabled it,
+ and this device has vendor specific dependencies with others.
+ The symbolic link points to the PCI device sysfs entry of
+ Physical Function this device depends on.
+
+What: /sys/bus/pci/devices/.../physfn
+Date: March 2009
+Contact: Yu Zhao <[email protected]>
+Description:
+ This symbolic link appears when a device is a Virtual Function.
+ The symbolic link points to the PCI device sysfs entry of the
+ Physical Function this device associates with.
--
1.5.6.4

2009-03-20 03:27:52

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 5/8] PCI: add SR-IOV API for Physical Function driver

Add or remove the Virtual Function when the SR-IOV is enabled or
disabled by the device driver. This can happen anytime rather than
only at the device probe stage.

Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
drivers/pci/iov.c | 314 +++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.h | 2 +
include/linux/pci.h | 19 +++-
3 files changed, 334 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 5ddfc09..d0ff8ad 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -13,6 +13,7 @@
#include <linux/delay.h>
#include "pci.h"

+#define VIRTFN_ID_LEN 16

static inline u8 virtfn_bus(struct pci_dev *dev, int id)
{
@@ -26,6 +27,284 @@ static inline u8 virtfn_devfn(struct pci_dev *dev, int id)
dev->sriov->stride * id) & 0xff;
}

+static struct pci_bus *virtfn_add_bus(struct pci_bus *bus, int busnr)
+{
+ int rc;
+ struct pci_bus *child;
+
+ if (bus->number == busnr)
+ return bus;
+
+ child = pci_find_bus(pci_domain_nr(bus), busnr);
+ if (child)
+ return child;
+
+ child = pci_add_new_bus(bus, NULL, busnr);
+ if (!child)
+ return NULL;
+
+ child->subordinate = busnr;
+ child->dev.parent = bus->bridge;
+ rc = pci_bus_add_child(child);
+ if (rc) {
+ pci_remove_bus(child);
+ return NULL;
+ }
+
+ return child;
+}
+
+static void virtfn_remove_bus(struct pci_bus *bus, int busnr)
+{
+ struct pci_bus *child;
+
+ if (bus->number == busnr)
+ return;
+
+ child = pci_find_bus(pci_domain_nr(bus), busnr);
+ BUG_ON(!child);
+
+ if (list_empty(&child->devices))
+ pci_remove_bus(child);
+}
+
+static int virtfn_add(struct pci_dev *dev, int id, int reset)
+{
+ int i;
+ int rc;
+ u64 size;
+ char buf[VIRTFN_ID_LEN];
+ struct pci_dev *virtfn;
+ struct resource *res;
+ struct pci_sriov *iov = dev->sriov;
+
+ virtfn = alloc_pci_dev();
+ if (!virtfn)
+ return -ENOMEM;
+
+ mutex_lock(&iov->dev->sriov->lock);
+ virtfn->bus = virtfn_add_bus(dev->bus, virtfn_bus(dev, id));
+ if (!virtfn->bus) {
+ kfree(virtfn);
+ mutex_unlock(&iov->dev->sriov->lock);
+ return -ENOMEM;
+ }
+ virtfn->devfn = virtfn_devfn(dev, id);
+ virtfn->vendor = dev->vendor;
+ pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_DID, &virtfn->device);
+ pci_setup_device(virtfn);
+ virtfn->dev.parent = dev->dev.parent;
+
+ for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ if (!res->parent)
+ continue;
+ virtfn->resource[i].name = pci_name(virtfn);
+ virtfn->resource[i].flags = res->flags;
+ size = resource_size(res);
+ do_div(size, iov->total);
+ virtfn->resource[i].start = res->start + size * id;
+ virtfn->resource[i].end = virtfn->resource[i].start + size - 1;
+ rc = request_resource(res, &virtfn->resource[i]);
+ BUG_ON(rc);
+ }
+
+ if (reset)
+ pci_execute_reset_function(virtfn);
+
+ pci_device_add(virtfn, virtfn->bus);
+ mutex_unlock(&iov->dev->sriov->lock);
+
+ virtfn->physfn = pci_dev_get(dev);
+ virtfn->is_virtfn = 1;
+
+ rc = pci_bus_add_device(virtfn);
+ if (rc)
+ goto failed1;
+ sprintf(buf, "virtfn%u", id);
+ rc = sysfs_create_link(&dev->dev.kobj, &virtfn->dev.kobj, buf);
+ if (rc)
+ goto failed1;
+ rc = sysfs_create_link(&virtfn->dev.kobj, &dev->dev.kobj, "physfn");
+ if (rc)
+ goto failed2;
+
+ kobject_uevent(&virtfn->dev.kobj, KOBJ_CHANGE);
+
+ return 0;
+
+failed2:
+ sysfs_remove_link(&dev->dev.kobj, buf);
+failed1:
+ pci_dev_put(dev);
+ mutex_lock(&iov->dev->sriov->lock);
+ pci_remove_bus_device(virtfn);
+ virtfn_remove_bus(dev->bus, virtfn_bus(dev, id));
+ mutex_unlock(&iov->dev->sriov->lock);
+
+ return rc;
+}
+
+static void virtfn_remove(struct pci_dev *dev, int id, int reset)
+{
+ char buf[VIRTFN_ID_LEN];
+ struct pci_bus *bus;
+ struct pci_dev *virtfn;
+ struct pci_sriov *iov = dev->sriov;
+
+ bus = pci_find_bus(pci_domain_nr(dev->bus), virtfn_bus(dev, id));
+ if (!bus)
+ return;
+
+ virtfn = pci_get_slot(bus, virtfn_devfn(dev, id));
+ if (!virtfn)
+ return;
+
+ pci_dev_put(virtfn);
+
+ if (reset) {
+ device_release_driver(&virtfn->dev);
+ pci_execute_reset_function(virtfn);
+ }
+
+ sprintf(buf, "virtfn%u", id);
+ sysfs_remove_link(&dev->dev.kobj, buf);
+ sysfs_remove_link(&virtfn->dev.kobj, "physfn");
+
+ mutex_lock(&iov->dev->sriov->lock);
+ pci_remove_bus_device(virtfn);
+ virtfn_remove_bus(dev->bus, virtfn_bus(dev, id));
+ mutex_unlock(&iov->dev->sriov->lock);
+
+ pci_dev_put(dev);
+}
+
+static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
+{
+ int rc;
+ int i, j;
+ int nres;
+ u16 offset, stride, initial;
+ struct resource *res;
+ struct pci_dev *pdev;
+ struct pci_sriov *iov = dev->sriov;
+
+ if (!nr_virtfn)
+ return 0;
+
+ if (iov->nr_virtfn)
+ return -EINVAL;
+
+ pci_read_config_word(dev, iov->pos + PCI_SRIOV_INITIAL_VF, &initial);
+ if (initial > iov->total ||
+ (!(iov->cap & PCI_SRIOV_CAP_VFM) && (initial != iov->total)))
+ return -EIO;
+
+ if (nr_virtfn < 0 || nr_virtfn > iov->total ||
+ (!(iov->cap & PCI_SRIOV_CAP_VFM) && (nr_virtfn > initial)))
+ return -EINVAL;
+
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, nr_virtfn);
+ pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_OFFSET, &offset);
+ pci_read_config_word(dev, iov->pos + PCI_SRIOV_VF_STRIDE, &stride);
+ if (!offset || (nr_virtfn > 1 && !stride))
+ return -EIO;
+
+ nres = 0;
+ for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ if (res->parent)
+ nres++;
+ }
+ if (nres != iov->nres) {
+ dev_err(&dev->dev, "not enough MMIO resources for SR-IOV\n");
+ return -ENOMEM;
+ }
+
+ iov->offset = offset;
+ iov->stride = stride;
+
+ if (virtfn_bus(dev, nr_virtfn - 1) > dev->bus->subordinate) {
+ dev_err(&dev->dev, "SR-IOV: bus number out of range\n");
+ return -ENOMEM;
+ }
+
+ if (iov->link != dev->devfn) {
+ pdev = pci_get_slot(dev->bus, iov->link);
+ if (!pdev)
+ return -ENODEV;
+
+ pci_dev_put(pdev);
+
+ if (!pdev->is_physfn)
+ return -ENODEV;
+
+ rc = sysfs_create_link(&dev->dev.kobj,
+ &pdev->dev.kobj, "dep_link");
+ if (rc)
+ return rc;
+ }
+
+ iov->ctrl |= PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE;
+ pci_block_user_cfg_access(dev);
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+ msleep(100);
+ pci_unblock_user_cfg_access(dev);
+
+ iov->initial = initial;
+ if (nr_virtfn < initial)
+ initial = nr_virtfn;
+
+ for (i = 0; i < initial; i++) {
+ rc = virtfn_add(dev, i, 0);
+ if (rc)
+ goto failed;
+ }
+
+ kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE);
+ iov->nr_virtfn = nr_virtfn;
+
+ return 0;
+
+failed:
+ for (j = 0; j < i; j++)
+ virtfn_remove(dev, j, 0);
+
+ iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
+ pci_block_user_cfg_access(dev);
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+ ssleep(1);
+ pci_unblock_user_cfg_access(dev);
+
+ if (iov->link != dev->devfn)
+ sysfs_remove_link(&dev->dev.kobj, "dep_link");
+
+ return rc;
+}
+
+static void sriov_disable(struct pci_dev *dev)
+{
+ int i;
+ struct pci_sriov *iov = dev->sriov;
+
+ if (!iov->nr_virtfn)
+ return;
+
+ for (i = 0; i < iov->nr_virtfn; i++)
+ virtfn_remove(dev, i, 0);
+
+ iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
+ pci_block_user_cfg_access(dev);
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
+ ssleep(1);
+ pci_unblock_user_cfg_access(dev);
+
+ if (iov->link != dev->devfn)
+ sysfs_remove_link(&dev->dev.kobj, "dep_link");
+
+ iov->nr_virtfn = 0;
+}
+
static int sriov_init(struct pci_dev *dev, int pos)
{
int i;
@@ -132,6 +411,8 @@ failed:

static void sriov_release(struct pci_dev *dev)
{
+ BUG_ON(dev->sriov->nr_virtfn);
+
if (dev == dev->sriov->dev)
mutex_destroy(&dev->sriov->lock);
else
@@ -155,6 +436,7 @@ static void sriov_restore_state(struct pci_dev *dev)
pci_update_resource(dev, i);

pci_write_config_dword(dev, iov->pos + PCI_SRIOV_SYS_PGSIZE, iov->pgsz);
+ pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, iov->nr_virtfn);
pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
if (iov->ctrl & PCI_SRIOV_CTRL_VFE)
msleep(100);
@@ -245,3 +527,35 @@ int pci_iov_bus_range(struct pci_bus *bus)

return max ? max - bus->number : 0;
}
+
+/**
+ * pci_enable_sriov - enable the SR-IOV capability
+ * @dev: the PCI device
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
+{
+ might_sleep();
+
+ if (!dev->is_physfn)
+ return -ENODEV;
+
+ return sriov_enable(dev, nr_virtfn);
+}
+EXPORT_SYMBOL_GPL(pci_enable_sriov);
+
+/**
+ * pci_disable_sriov - disable the SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_disable_sriov(struct pci_dev *dev)
+{
+ might_sleep();
+
+ if (!dev->is_physfn)
+ return;
+
+ sriov_disable(dev);
+}
+EXPORT_SYMBOL_GPL(pci_disable_sriov);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 80ad848..1bdace3 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -203,6 +203,8 @@ struct pci_sriov {
u32 cap; /* SR-IOV Capabilities */
u16 ctrl; /* SR-IOV Control */
u16 total; /* total VFs associated with the PF */
+ u16 initial; /* initial VFs associated with the PF */
+ u16 nr_virtfn; /* number of VFs available */
u16 offset; /* first VF Routing ID offset */
u16 stride; /* following VF stride */
u32 pgsz; /* page size for BAR alignment */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 01eed8f..a83f662 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -265,6 +265,7 @@ struct pci_dev {
unsigned int is_pcie:1;
unsigned int state_saved:1;
unsigned int is_physfn:1;
+ unsigned int is_virtfn:1;
pci_dev_flags_t dev_flags;
atomic_t enable_cnt; /* pci_enable_device has been called */

@@ -279,7 +280,10 @@ struct pci_dev {
#endif
struct pci_vpd *vpd;
#ifdef CONFIG_PCI_IOV
- struct pci_sriov *sriov; /* SR-IOV capability related */
+ union {
+ struct pci_sriov *sriov; /* SR-IOV capability related */
+ struct pci_dev *physfn; /* the PF this VF is associated with */
+ };
#endif
};

@@ -1205,5 +1209,18 @@ int pci_ext_cfg_avail(struct pci_dev *dev);

void __iomem *pci_ioremap_bar(struct pci_dev *pdev, int bar);

+#ifdef CONFIG_PCI_IOV
+extern int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+extern void pci_disable_sriov(struct pci_dev *dev);
+#else
+static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn)
+{
+ return -ENODEV;
+}
+static inline void pci_disable_sriov(struct pci_dev *dev)
+{
+}
+#endif
+
#endif /* __KERNEL__ */
#endif /* LINUX_PCI_H */
--
1.5.6.4

2009-03-20 03:27:36

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH v12 8/8] PCI: manual for SR-IOV user and driver developer

Reviewed-by: Randy Dunlap <[email protected]>
Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
---
Documentation/DocBook/kernel-api.tmpl | 1 +
Documentation/PCI/pci-iov-howto.txt | 99 +++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+), 0 deletions(-)
create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index bc962cd..58c1945 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -199,6 +199,7 @@ X!Edrivers/pci/hotplug.c
-->
!Edrivers/pci/probe.c
!Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
</sect1>
<sect1><title>PCI Hotplug Support Library</title>
!Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 0000000..fc73ef5
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,99 @@
+ PCI Express I/O Virtualization Howto
+ Copyright (C) 2009 Intel Corporation
+ Yu Zhao <[email protected]>
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function (PF)
+while the virtual devices are referred to as Virtual Functions (VF).
+Allocation of the VF can be dynamically controlled by the PF via
+registers encapsulated in the capability. By default, this feature is
+not enabled and the PF behaves as traditional PCIe device. Once it's
+turned on, each VF's PCI configuration space can be accessed by its own
+Bus, Device and Function Number (Routing ID). And each VF also has PCI
+Memory Space, which is used to map its register set. VF device driver
+operates on the register set so it can be functional and appear as a
+real existing PCI device.
+
+2. User Guide
+
+2.1 How can I enable SR-IOV capability
+
+The device driver (PF driver) will control the enabling and disabling
+of the capability via API provided by SR-IOV core. If the hardware
+has SR-IOV capability, loading its PF driver would enable it and all
+VFs associated with the PF.
+
+2.2 How can I use the Virtual Functions
+
+The VF is treated as hot-plugged PCI devices in the kernel, so they
+should be able to work in the same way as real PCI devices. The VF
+requires device driver that is same as a normal PCI device's.
+
+3. Developer Guide
+
+3.1 SR-IOV API
+
+To enable SR-IOV capability:
+ int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+ 'nr_virtfn' is number of VFs to be enabled.
+
+To disable SR-IOV capability:
+ void pci_disable_sriov(struct pci_dev *dev);
+
+To notify SR-IOV core of Virtual Function Migration:
+ irqreturn_t pci_sriov_migration(struct pci_dev *dev);
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of the SR-IOV API.
+
+static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
+{
+ pci_enable_sriov(dev, NR_VIRTFN);
+
+ ...
+
+ return 0;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+ pci_disable_sriov(dev);
+
+ ...
+}
+
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+ ...
+
+ return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+ ...
+
+ return 0;
+}
+
+static void dev_shutdown(struct pci_dev *dev)
+{
+ ...
+}
+
+static struct pci_driver dev_driver = {
+ .name = "SR-IOV Physical Function driver",
+ .id_table = dev_id_table,
+ .probe = dev_probe,
+ .remove = __devexit_p(dev_remove),
+ .suspend = dev_suspend,
+ .resume = dev_resume,
+ .shutdown = dev_shutdown,
+};
--
1.5.6.4

2009-03-20 16:32:22

by Chetan.Loke

[permalink] [raw]
Subject: RE: [PATCH v12 7/8] PCI: document SR-IOV sysfs entries

[email protected] wrote:
> Reviewed-by: Randy Dunlap <[email protected]>
> Reviewed-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-bus-pci | 27
> +++++++++++++++++++++++++++
> 1 files changed, 27 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci
> b/Documentation/ABI/testing/sysfs-bus-pci
> index e638e15..36edf03 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci


A very handy feature would be the ability to send Resets(BME etc) via a sysfs-entry to all the VF's attached to a PF. This way one can test the drivers/firmware really easily.

BME
|
|->PF
|->VF0
|->VF1
..
|->VFn


Thanks
Chetan-

2009-03-20 17:54:28

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH v12 1/8] PCI: initialize and release SR-IOV capability

On Fri, 20 Mar 2009 11:25:11 +0800
Yu Zhao <[email protected]> wrote:

> If a device has the SR-IOV capability, initialize it (set the ARI
> Capable Hierarchy in the lowest numbered PF if necessary; calculate
> the System Page Size for the VF MMIO, probe the VF Offset, Stride
> and BARs). A lock for the VF bus allocation is also initialized if
> a PF is the lowest numbered PF.
>
> Reviewed-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>

I applied this series to my linux-next branch, but there were a few
conflicts here and there, so please check it out. Looks like from
start to finish this took about 6 months to get banged into shape,
thanks for staying on it, Yu!

--
Jesse Barnes, Intel Open Source Technology Center

2009-03-21 14:04:42

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH v12 1/8] PCI: initialize and release SR-IOV capability

On Sat, Mar 21, 2009 at 01:54:09AM +0800, Jesse Barnes wrote:
> On Fri, 20 Mar 2009 11:25:11 +0800
> Yu Zhao <[email protected]> wrote:
>
> > If a device has the SR-IOV capability, initialize it (set the ARI
> > Capable Hierarchy in the lowest numbered PF if necessary; calculate
> > the System Page Size for the VF MMIO, probe the VF Offset, Stride
> > and BARs). A lock for the VF bus allocation is also initialized if
> > a PF is the lowest numbered PF.
> >
> > Reviewed-by: Matthew Wilcox <[email protected]>
> > Signed-off-by: Yu Zhao <[email protected]>
>
> I applied this series to my linux-next branch, but there were a few
> conflicts here and there, so please check it out. Looks like from
> start to finish this took about 6 months to get banged into shape,
> thanks for staying on it, Yu!

Yes, I checked them and found there is conflict between the SR-IOV
changes and Yinghai's 'PCI/x86: detect host bridge config space size
w/o using quirks'. Following is the fix, thanks!


New pci_cfg_space_size() needs invalid pdev->class, put it in the
right place in the pci_setup_device().

Signed-off-by: Yu Zhao <[email protected]>
---
drivers/pci/probe.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 56c71e5..e2f3dd0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -713,7 +713,6 @@ int pci_setup_device(struct pci_dev *dev)
dev->dev.bus = &pci_bus_type;
dev->hdr_type = hdr_type & 0x7f;
dev->multifunction = !!(hdr_type & 0x80);
- dev->cfg_size = pci_cfg_space_size(dev);
dev->error_state = pci_channel_io_normal;
set_pcie_port_type(dev);

@@ -738,6 +737,9 @@ int pci_setup_device(struct pci_dev *dev)
dev_dbg(&dev->dev, "found [%04x:%04x] class %06x header type %02x\n",
dev->vendor, dev->device, class, dev->hdr_type);

+ /* need to have dev->class ready */
+ dev->cfg_size = pci_cfg_space_size(dev);
+
/* "Unknown power state" */
dev->current_state = PCI_UNKNOWN;

@@ -959,9 +961,6 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
return NULL;
}

- /* need to have dev->class ready */
- dev->cfg_size = pci_cfg_space_size(dev);
-
return dev;
}

--
1.5.6.4

2009-03-26 22:51:05

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH v12 1/8] PCI: initialize and release SR-IOV capability

On Sat, 21 Mar 2009 22:05:11 +0800
Yu Zhao <[email protected]> wrote:

> On Sat, Mar 21, 2009 at 01:54:09AM +0800, Jesse Barnes wrote:
> > On Fri, 20 Mar 2009 11:25:11 +0800
> > Yu Zhao <[email protected]> wrote:
> >
> > > If a device has the SR-IOV capability, initialize it (set the ARI
> > > Capable Hierarchy in the lowest numbered PF if necessary;
> > > calculate the System Page Size for the VF MMIO, probe the VF
> > > Offset, Stride and BARs). A lock for the VF bus allocation is
> > > also initialized if a PF is the lowest numbered PF.
> > >
> > > Reviewed-by: Matthew Wilcox <[email protected]>
> > > Signed-off-by: Yu Zhao <[email protected]>
> >
> > I applied this series to my linux-next branch, but there were a few
> > conflicts here and there, so please check it out. Looks like from
> > start to finish this took about 6 months to get banged into shape,
> > thanks for staying on it, Yu!
>
> Yes, I checked them and found there is conflict between the SR-IOV
> changes and Yinghai's 'PCI/x86: detect host bridge config space size
> w/o using quirks'. Following is the fix, thanks!
>
>
> New pci_cfg_space_size() needs invalid pdev->class, put it in the
> right place in the pci_setup_device().
>
> Signed-off-by: Yu Zhao <[email protected]>

Applied, thanks Yu.

--
Jesse Barnes, Intel Open Source Technology Center