2008-10-22 09:33:22

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greetings,

Following patches are intended to support SR-IOV capability in the
Linux kernel. With these patches, people can turn a PCI device with
the capability into multiple ones from software perspective, which
will benefit KVM and achieve other purposes such as QoS, security,
and etc.

Changes from v5 to v6:
1, update ABI document to include SR-IOV sysfs entries (Greg KH)
2, fix two coding style problems (Ingo Molnar)

---

[PATCH 1/16 v6] PCI: remove unnecessary arg of pci_update_resource()
[PATCH 2/16 v6] PCI: define PCI resource names in an 'enum'
[PATCH 3/16 v6] PCI: export __pci_read_base
[PATCH 4/16 v6] PCI: make pci_alloc_child_bus() be able to handle NULL bridge
[PATCH 5/16 v6] PCI: add a wrapper for resource_alignment()
[PATCH 6/16 v6] PCI: add a new function to map BAR offset
[PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources()
[PATCH 8/16 v6] PCI: add boot options to reassign resources
[PATCH 9/16 v6] PCI: add boot option to align MMIO resources
[PATCH 10/16 v6] PCI: cleanup pci_bus_add_devices()
[PATCH 11/16 v6] PCI: split a new function from pci_bus_add_devices()
[PATCH 12/16 v6] PCI: support the SR-IOV capability
[PATCH 13/16 v6] PCI: reserve bus range for SR-IOV device
[PATCH 14/16 v6] PCI: document for SR-IOV user and developer
[PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries
[PATCH 16/16 v6] PCI: document the new PCI boot parameters

---

Single Root I/O Virtualization (SR-IOV) capability defined by PCI-SIG
is intended to enable multiple system software to share PCI hardware
resources. PCI device that supports this capability can be extended
to one Physical Functions plus multiple Virtual Functions. Physical
Function, which could be considered as the "real" PCI device, reflects
the hardware instance and manages all physical resources. Virtual
Functions are associated with a Physical Function and shares physical
resources with the Physical Function.Software can control allocation of
Virtual Functions via registers encapsulated in the capability structure.

SR-IOV specification can be found at
http://www.pcisig.com/members/downloads/specifications/iov/sr-iov1.0_11Sep07.pdf

Devices that support SR-IOV are available from following vendors:
http://download.intel.com/design/network/ProdBrf/320025.pdf
http://www.netxen.com/products/chipsolutions/NX3031.html
http://www.neterion.com/products/x3100.html


2008-10-22 09:35:27

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 1/16 v6] PCI: remove unnecessary arg of pci_update_resource()

This cleanup removes unnecessary argument 'struct resource *res' in
pci_update_resource(), so it takes same arguments as other companion
functions (pci_assign_resource(), etc.).

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.c | 4 ++--
drivers/pci/setup-res.c | 7 ++++---
include/linux/pci.h | 2 +-
3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4db261e..ae62f01 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -376,8 +376,8 @@ pci_restore_bars(struct pci_dev *dev)
return;
}

- for (i = 0; i < numres; i ++)
- pci_update_resource(dev, &dev->resource[i], i);
+ for (i = 0; i < numres; i++)
+ pci_update_resource(dev, i);
}

static struct pci_platform_pm_ops *pci_platform_pm;
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index 2dbd96c..b7ca679 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -26,11 +26,12 @@
#include "pci.h"


-void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno)
+void pci_update_resource(struct pci_dev *dev, int resno)
{
struct pci_bus_region region;
u32 new, check, mask;
int reg;
+ struct resource *res = dev->resource + resno;

/*
* Ignore resources for unimplemented BARs and unused resource slots
@@ -162,7 +163,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno)
} else {
res->flags &= ~IORESOURCE_STARTALIGN;
if (resno < PCI_BRIDGE_RESOURCES)
- pci_update_resource(dev, res, resno);
+ pci_update_resource(dev, resno);
}

return ret;
@@ -197,7 +198,7 @@ int pci_assign_resource_fixed(struct pci_dev *dev, int resno)
dev_err(&dev->dev, "BAR %d: can't allocate %s resource %pR\n",
resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res);
} else if (resno < PCI_BRIDGE_RESOURCES) {
- pci_update_resource(dev, res, resno);
+ pci_update_resource(dev, resno);
}

return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 085187b..43e1fc1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -626,7 +626,7 @@ int pcix_get_mmrbc(struct pci_dev *dev);
int pcix_set_mmrbc(struct pci_dev *dev, int mmrbc);
int pcie_get_readrq(struct pci_dev *dev);
int pcie_set_readrq(struct pci_dev *dev, int rq);
-void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno);
+void pci_update_resource(struct pci_dev *dev, int resno);
int __must_check pci_assign_resource(struct pci_dev *dev, int i);
int pci_select_bars(struct pci_dev *dev, unsigned long flags);

--
1.5.6.4

2008-10-22 09:36:21

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 3/16 v6] PCI: export __pci_read_base

Export __pci_read_base() so it can be used by whole PCI subsystem.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.h | 9 +++++++++
drivers/pci/probe.c | 20 +++++++++-----------
2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b205ab8..fbbc6ad 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -157,6 +157,15 @@ struct pci_slot_attribute {
};
#define to_pci_slot_attr(s) container_of(s, struct pci_slot_attribute, attr)

+enum pci_bar_type {
+ pci_bar_unknown, /* Standard PCI BAR probe */
+ pci_bar_io, /* An io port BAR */
+ pci_bar_mem32, /* A 32-bit memory BAR */
+ pci_bar_mem64, /* A 64-bit memory BAR */
+};
+
+extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
+ struct resource *res, unsigned int reg);
extern void pci_enable_ari(struct pci_dev *dev);
/**
* pci_ari_enabled - query ARI forwarding status
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index a52784c..db3e5a7 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -135,13 +135,6 @@ static u64 pci_size(u64 base, u64 maxbase, u64 mask)
return size;
}

-enum pci_bar_type {
- pci_bar_unknown, /* Standard PCI BAR probe */
- pci_bar_io, /* An io port BAR */
- pci_bar_mem32, /* A 32-bit memory BAR */
- pci_bar_mem64, /* A 64-bit memory BAR */
-};
-
static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar)
{
if ((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO) {
@@ -156,11 +149,16 @@ static inline enum pci_bar_type decode_bar(struct resource *res, u32 bar)
return pci_bar_mem32;
}

-/*
- * If the type is not unknown, we assume that the lowest bit is 'enable'.
- * Returns 1 if the BAR was 64-bit and 0 if it was 32-bit.
+/**
+ * pci_read_base - read a PCI BAR
+ * @dev: the PCI device
+ * @type: type of the BAR
+ * @res: resource buffer to be filled in
+ * @pos: BAR position in the config space
+ *
+ * Returns 1 if the BAR is 64-bit, or 0 if 32-bit.
*/
-static int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
+int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int pos)
{
u32 l, sz, mask;
--
1.5.6.4

2008-10-22 09:37:01

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum'

This patch moves all definitions of the PCI resource names to an 'enum',
and also replaces some hard-coded resource variables with symbol
names. This change eases introduction of device specific resources.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci-sysfs.c | 4 +++-
drivers/pci/pci.c | 19 ++-----------------
drivers/pci/probe.c | 2 +-
drivers/pci/proc.c | 7 ++++---
include/linux/pci.h | 37 ++++++++++++++++++++++++-------------
5 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 110022d..5c456ab 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -101,11 +101,13 @@ resource_show(struct device * dev, struct device_attribute *attr, char * buf)
struct pci_dev * pci_dev = to_pci_dev(dev);
char * str = buf;
int i;
- int max = 7;
+ int max;
resource_size_t start, end;

if (pci_dev->subordinate)
max = DEVICE_COUNT_RESOURCE;
+ else
+ max = PCI_BRIDGE_RESOURCES;

for (i = 0; i < max; i++) {
struct resource *res = &pci_dev->resource[i];
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index ae62f01..40284dc 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -359,24 +359,9 @@ pci_find_parent_resource(const struct pci_dev *dev, struct resource *res)
static void
pci_restore_bars(struct pci_dev *dev)
{
- int i, numres;
-
- switch (dev->hdr_type) {
- case PCI_HEADER_TYPE_NORMAL:
- numres = 6;
- break;
- case PCI_HEADER_TYPE_BRIDGE:
- numres = 2;
- break;
- case PCI_HEADER_TYPE_CARDBUS:
- numres = 1;
- break;
- default:
- /* Should never get here, but just in case... */
- return;
- }
+ int i;

- for (i = 0; i < numres; i++)
+ for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
pci_update_resource(dev, i);
}

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index aaaf0a1..a52784c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -426,7 +426,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
child->subordinate = 0xff;

/* Set up default resource pointers and names.. */
- for (i = 0; i < 4; i++) {
+ for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) {
child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i];
child->resource[i]->name = child->name;
}
diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
index e1098c3..f6f2a59 100644
--- a/drivers/pci/proc.c
+++ b/drivers/pci/proc.c
@@ -352,15 +352,16 @@ static int show_device(struct seq_file *m, void *v)
dev->vendor,
dev->device,
dev->irq);
- /* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */
- for (i=0; i<7; i++) {
+
+ /* only print standard and ROM resources to preserve compatibility */
+ for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
resource_size_t start, end;
pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, "\t%16llx",
(unsigned long long)(start |
(dev->resource[i].flags & PCI_REGION_FLAG_MASK)));
}
- for (i=0; i<7; i++) {
+ for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
resource_size_t start, end;
pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, "\t%16llx",
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 43e1fc1..2ada2b6 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -76,7 +76,30 @@ enum pci_mmap_state {
#define PCI_DMA_FROMDEVICE 2
#define PCI_DMA_NONE 3

-#define DEVICE_COUNT_RESOURCE 12
+/*
+ * For PCI devices, the region numbers are assigned this way:
+ */
+enum {
+ /* #0-5: standard PCI regions */
+ PCI_STD_RESOURCES,
+ PCI_STD_RESOURCES_END = 5,
+
+ /* #6: expansion ROM */
+ PCI_ROM_RESOURCE,
+
+ /* address space assigned to buses behind the bridge */
+#ifndef PCI_BRIDGE_RES_NUM
+#define PCI_BRIDGE_RES_NUM 4
+#endif
+ PCI_BRIDGE_RESOURCES,
+ PCI_BRIDGE_RES_END = PCI_BRIDGE_RESOURCES + PCI_BRIDGE_RES_NUM - 1,
+
+ /* total resources associated with a PCI device */
+ PCI_NUM_RESOURCES,
+
+ /* preserve this for compatibility */
+ DEVICE_COUNT_RESOURCE
+};

typedef int __bitwise pci_power_t;

@@ -262,18 +285,6 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev,
hlist_add_head(&new_cap->next, &pci_dev->saved_cap_space);
}

-/*
- * For PCI devices, the region numbers are assigned this way:
- *
- * 0-5 standard PCI regions
- * 6 expansion ROM
- * 7-10 bridges: address space assigned to buses behind the bridge
- */
-
-#define PCI_ROM_RESOURCE 6
-#define PCI_BRIDGE_RESOURCES 7
-#define PCI_NUM_RESOURCES 11
-
#ifndef PCI_BUS_NUM_RESOURCES
#define PCI_BUS_NUM_RESOURCES 16
#endif
--
1.5.6.4

2008-10-22 09:37:51

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 5/16 v6] PCI: add a wrapper for resource_alignment()

Add a wrapper for resource_alignment() so it can handle device specific
resource alignment.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.c | 20 ++++++++++++++++++++
drivers/pci/pci.h | 1 +
drivers/pci/setup-bus.c | 4 ++--
drivers/pci/setup-res.c | 7 ++++---
4 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 40284dc..a9b554e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1904,6 +1904,26 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags)
return bars;
}

+/**
+ * pci_resource_alignment - get a PCI BAR resource alignment
+ * @dev: the PCI device
+ * @resno: the resource number
+ *
+ * Returns alignment size on success, or 0 on error.
+ */
+int pci_resource_alignment(struct pci_dev *dev, int resno)
+{
+ resource_size_t align;
+ struct resource *res = dev->resource + resno;
+
+ align = resource_alignment(res);
+ if (align)
+ return align;
+
+ dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno);
+ return 0;
+}
+
static void __devinit pci_no_domains(void)
{
#ifdef CONFIG_PCI_DOMAINS
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fbbc6ad..baa3d23 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -166,6 +166,7 @@ enum pci_bar_type {

extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int reg);
+extern int pci_resource_alignment(struct pci_dev *dev, int resno);
extern void pci_enable_ari(struct pci_dev *dev);
/**
* pci_ari_enabled - query ARI forwarding status
diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index ea979f2..90a9c0a 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -25,6 +25,7 @@
#include <linux/ioport.h>
#include <linux/cache.h>
#include <linux/slab.h>
+#include "pci.h"


static void pbus_assign_resources_sorted(struct pci_bus *bus)
@@ -351,8 +352,7 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask, unsigned long
if (r->parent || (r->flags & mask) != type)
continue;
r_size = resource_size(r);
- /* For bridges size != alignment */
- align = resource_alignment(r);
+ align = pci_resource_alignment(dev, i);
order = __ffs(align) - 20;
if (order > 11) {
dev_warn(&dev->dev, "BAR %d bad alignment %llx: "
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index b7ca679..88a9c70 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -133,7 +133,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno)
size = resource_size(res);
min = (res->flags & IORESOURCE_IO) ? PCIBIOS_MIN_IO : PCIBIOS_MIN_MEM;

- align = resource_alignment(res);
+ align = pci_resource_alignment(dev, resno);
if (!align) {
dev_err(&dev->dev, "BAR %d: can't allocate resource (bogus "
"alignment) %pR flags %#lx\n",
@@ -224,7 +224,7 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head)
if (!(r->flags) || r->parent)
continue;

- r_align = resource_alignment(r);
+ r_align = pci_resource_alignment(dev, i);
if (!r_align) {
dev_warn(&dev->dev, "BAR %d: bogus alignment "
"%pR flags %#lx\n",
@@ -236,7 +236,8 @@ void pdev_sort_resources(struct pci_dev *dev, struct resource_list *head)
struct resource_list *ln = list->next;

if (ln)
- align = resource_alignment(ln->res);
+ align = pci_resource_alignment(ln->dev,
+ ln->res - ln->dev->resource);

if (r_align > align) {
tmp = kmalloc(sizeof(*tmp), GFP_KERNEL);
--
1.5.6.4

2008-10-22 09:38:52

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 4/16 v6] PCI: make pci_alloc_child_bus() be able to handle NULL bridge

Make pci_alloc_child_bus() be able to allocate buses without bridge
devices. Some SR-IOV devices can occupy more than one bus number,
but there is no explicit bridges because that have internal routing
mechanism.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/probe.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index db3e5a7..4b12b58 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -401,12 +401,10 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
if (!child)
return NULL;

- child->self = bridge;
child->parent = parent;
child->ops = parent->ops;
child->sysdata = parent->sysdata;
child->bus_flags = parent->bus_flags;
- child->bridge = get_device(&bridge->dev);

/* initialize some portions of the bus device, but don't register it
* now as the parent is not properly set up yet. This device will get
@@ -423,6 +421,11 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
child->primary = parent->secondary;
child->subordinate = 0xff;

+ if (!bridge)
+ return child;
+
+ child->self = bridge;
+ child->bridge = get_device(&bridge->dev);
/* Set up default resource pointers and names.. */
for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) {
child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i];
--
1.5.6.4

2008-10-22 09:39:23

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 6/16 v6] PCI: add a new function to map BAR offset

Add a function to map resource number to corresponding register so
people can get the offset and type of device specific BARs.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/pci.c | 22 ++++++++++++++++++++++
drivers/pci/pci.h | 2 ++
drivers/pci/setup-res.c | 13 +++++--------
3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a9b554e..b02167a 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1924,6 +1924,28 @@ int pci_resource_alignment(struct pci_dev *dev, int resno)
return 0;
}

+/**
+ * pci_resource_bar - get position of the BAR associated with a resource
+ * @dev: the PCI device
+ * @resno: the resource number
+ * @type: the BAR type to be filled in
+ *
+ * Returns BAR position in config space, or 0 if the BAR is invalid.
+ */
+int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type)
+{
+ if (resno < PCI_ROM_RESOURCE) {
+ *type = pci_bar_unknown;
+ return PCI_BASE_ADDRESS_0 + 4 * resno;
+ } else if (resno == PCI_ROM_RESOURCE) {
+ *type = pci_bar_mem32;
+ return dev->rom_base_reg;
+ }
+
+ dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno);
+ return 0;
+}
+
static void __devinit pci_no_domains(void)
{
#ifdef CONFIG_PCI_DOMAINS
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index baa3d23..d707477 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -167,6 +167,8 @@ enum pci_bar_type {
extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
struct resource *res, unsigned int reg);
extern int pci_resource_alignment(struct pci_dev *dev, int resno);
+extern int pci_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type);
extern void pci_enable_ari(struct pci_dev *dev);
/**
* pci_ari_enabled - query ARI forwarding status
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index 88a9c70..5812f4b 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -31,6 +31,7 @@ void pci_update_resource(struct pci_dev *dev, int resno)
struct pci_bus_region region;
u32 new, check, mask;
int reg;
+ enum pci_bar_type type;
struct resource *res = dev->resource + resno;

/*
@@ -62,17 +63,13 @@ void pci_update_resource(struct pci_dev *dev, int resno)
else
mask = (u32)PCI_BASE_ADDRESS_MEM_MASK;

- if (resno < 6) {
- reg = PCI_BASE_ADDRESS_0 + 4 * resno;
- } else if (resno == PCI_ROM_RESOURCE) {
+ reg = pci_resource_bar(dev, resno, &type);
+ if (!reg)
+ return;
+ if (type != pci_bar_unknown) {
if (!(res->flags & IORESOURCE_ROM_ENABLE))
return;
new |= PCI_ROM_ADDRESS_ENABLE;
- reg = dev->rom_base_reg;
- } else {
- /* Hmm, non-standard resource. */
-
- return; /* kill uninitialised var warning */
}

pci_write_config_dword(dev, reg, new);
--
1.5.6.4

2008-10-22 09:40:03

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 8/16 v6] PCI: add boot options to reassign resources

This patch adds boot options so user can reassign device resources
of all devices under a bus.

The boot options can be used as:
pci=assign-mmio=0000:01,assign-pio=0000:02
'[dddd:]bb' is the domain and bus number.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
arch/x86/pci/common.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/pci/i386.c | 10 ++++---
arch/x86/pci/pci.h | 3 ++
3 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index b67732b..06e1ce0 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -137,6 +137,72 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev)
}
}

+static char *pci_assign_pio;
+static char *pci_assign_mmio;
+
+static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus)
+{
+ int i;
+ int type = 0;
+ int domain, busnr;
+
+ if (!bus->self)
+ return 0;
+
+ for (i = 0; i < 2; i++) {
+ char *str = i ? pci_assign_pio : pci_assign_mmio;
+
+ while (str && *str) {
+ if (sscanf(str, "%04x:%02x", &domain, &busnr) != 2) {
+ if (sscanf(str, "%02x", &busnr) != 1)
+ break;
+ domain = 0;
+ }
+
+ if (pci_domain_nr(bus) == domain &&
+ bus->number == busnr) {
+ type |= i ? IORESOURCE_IO : IORESOURCE_MEM;
+ break;
+ }
+
+ str = strchr(str, ';');
+ if (str)
+ str++;
+ }
+ }
+
+ return type;
+}
+
+static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus)
+{
+ int i;
+ int type = pcibios_bus_resource_needs_fixup(bus);
+
+ if (!type)
+ return;
+
+ for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) {
+ struct resource *res = bus->resource[i];
+
+ if (!res)
+ continue;
+ if (res->flags & type)
+ res->flags = 0;
+ }
+}
+
+int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno)
+{
+ struct pci_bus *bus;
+
+ for (bus = dev->bus; bus && bus != pci_root_bus; bus = bus->parent)
+ if (pcibios_bus_resource_needs_fixup(bus))
+ return 1;
+
+ return 0;
+}
+
/*
* Called after each bus is probed, but before its children
* are examined.
@@ -147,6 +213,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *b)
struct pci_dev *dev;

pci_read_bridge_bases(b);
+ pcibios_fixup_bus_resources(b);
list_for_each_entry(dev, &b->devices, bus_list)
pcibios_fixup_device_resources(dev);
}
@@ -519,6 +586,12 @@ char * __devinit pcibios_setup(char *str)
} else if (!strcmp(str, "skip_isa_align")) {
pci_probe |= PCI_CAN_SKIP_ISA_ALIGN;
return NULL;
+ } else if (!strncmp(str, "assign-pio=", 11)) {
+ pci_assign_pio = str + 11;
+ return NULL;
+ } else if (!strncmp(str, "assign-mmio=", 12)) {
+ pci_assign_mmio = str + 12;
+ return NULL;
}
return str;
}
diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 8729bde..ea82a5b 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -169,10 +169,12 @@ static void __init pcibios_allocate_resources(int pass)
(unsigned long long) r->start,
(unsigned long long) r->end,
r->flags, enabled, pass);
- pr = pci_find_parent_resource(dev, r);
- if (pr && !request_resource(pr, r))
- continue;
- dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx);
+ if (!pcibios_resource_needs_fixup(dev, idx)) {
+ pr = pci_find_parent_resource(dev, r);
+ if (pr && !request_resource(pr, r))
+ continue;
+ dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx);
+ }
/* We'll assign a new address later */
r->end -= r->start;
r->start = 0;
diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h
index 15b9cf6..f22737d 100644
--- a/arch/x86/pci/pci.h
+++ b/arch/x86/pci/pci.h
@@ -117,6 +117,9 @@ extern int __init pcibios_init(void);
extern int __init pci_mmcfg_arch_init(void);
extern void __init pci_mmcfg_arch_free(void);

+/* pci-common.c */
+extern int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno);
+
/*
* AMD Fam10h CPUs are buggy, and cannot access MMIO config space
* on their northbrige except through the * %eax register. As such, you MUST
--
1.5.6.4

2008-10-22 09:39:42

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources()

This cleanup makes pcibios_allocate_resources() easier to read.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
arch/x86/pci/i386.c | 28 ++++++++++++++--------------
1 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
index 844df0c..8729bde 100644
--- a/arch/x86/pci/i386.c
+++ b/arch/x86/pci/i386.c
@@ -147,7 +147,7 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
static void __init pcibios_allocate_resources(int pass)
{
struct pci_dev *dev = NULL;
- int idx, disabled;
+ int idx, enabled;
u16 command;
struct resource *r, *pr;

@@ -160,22 +160,22 @@ static void __init pcibios_allocate_resources(int pass)
if (!r->start) /* Address not assigned at all */
continue;
if (r->flags & IORESOURCE_IO)
- disabled = !(command & PCI_COMMAND_IO);
+ enabled = command & PCI_COMMAND_IO;
else
- disabled = !(command & PCI_COMMAND_MEMORY);
- if (pass == disabled) {
- dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n",
+ enabled = command & PCI_COMMAND_MEMORY;
+ if (pass == enabled)
+ continue;
+ dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n",
(unsigned long long) r->start,
(unsigned long long) r->end,
- r->flags, disabled, pass);
- pr = pci_find_parent_resource(dev, r);
- if (!pr || request_resource(pr, r) < 0) {
- dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx);
- /* We'll assign a new address later */
- r->end -= r->start;
- r->start = 0;
- }
- }
+ r->flags, enabled, pass);
+ pr = pci_find_parent_resource(dev, r);
+ if (pr && !request_resource(pr, r))
+ continue;
+ dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx);
+ /* We'll assign a new address later */
+ r->end -= r->start;
+ r->start = 0;
}
if (!pass) {
r = &dev->resource[PCI_ROM_RESOURCE];
--
1.5.6.4

2008-10-22 09:40:42

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 9/16 v6] PCI: add boot option to align MMIO resources

This patch adds boot option to align MMIO resource for a device.
The alignment is a bigger value between the PAGE_SIZE and the
resource size.

The boot option can be used as:
pci=align-mmio=0000:01:02.3
'[0000:]01:02.3' is the domain, bus, device and function number
of the device.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
arch/x86/pci/common.c | 37 +++++++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 20 ++++++++++++++++++--
include/linux/pci.h | 1 +
3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 06e1ce0..3c5d230 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -139,6 +139,7 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev)

static char *pci_assign_pio;
static char *pci_assign_mmio;
+static char *pci_align_mmio;

static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus)
{
@@ -192,6 +193,36 @@ static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus)
}
}

+int pcibios_resource_alignment(struct pci_dev *dev, int resno)
+{
+ int domain, busnr, slot, func;
+ char *str = pci_align_mmio;
+
+ if (dev->resource[resno].flags & IORESOURCE_IO)
+ return 0;
+
+ while (str && *str) {
+ if (sscanf(str, "%04x:%02x:%02x.%d",
+ &domain, &busnr, &slot, &func) != 4) {
+ if (sscanf(str, "%02x:%02x.%d",
+ &busnr, &slot, &func) != 3)
+ break;
+ domain = 0;
+ }
+
+ if (pci_domain_nr(dev->bus) == domain &&
+ dev->bus->number == busnr &&
+ dev->devfn == PCI_DEVFN(slot, func))
+ return PAGE_SIZE;
+
+ str = strchr(str, ';');
+ if (str)
+ str++;
+ }
+
+ return 0;
+}
+
int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno)
{
struct pci_bus *bus;
@@ -200,6 +231,9 @@ int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno)
if (pcibios_bus_resource_needs_fixup(bus))
return 1;

+ if (pcibios_resource_alignment(dev, resno))
+ return 1;
+
return 0;
}

@@ -592,6 +626,9 @@ char * __devinit pcibios_setup(char *str)
} else if (!strncmp(str, "assign-mmio=", 12)) {
pci_assign_mmio = str + 12;
return NULL;
+ } else if (!strncmp(str, "align-mmio=", 11)) {
+ pci_align_mmio = str + 11;
+ return NULL;
}
return str;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b02167a..11ecd6f 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1015,6 +1015,20 @@ int __attribute__ ((weak)) pcibios_set_pcie_reset_state(struct pci_dev *dev,
}

/**
+ * pcibios_resource_alignment - get resource alignment requirement
+ * @dev: the PCI device
+ * @resno: resource number
+ *
+ * Queries the resource alignment from PCI low level code. Returns positive
+ * if there is alignment requirement of the resource, or 0 otherwise.
+ */
+int __attribute__ ((weak)) pcibios_resource_alignment(struct pci_dev *dev,
+ int resno)
+{
+ return 0;
+}
+
+/**
* pci_set_pcie_reset_state - set reset state for device dev
* @dev: the PCI-E device reset
* @state: Reset state to enter into
@@ -1913,12 +1927,14 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags)
*/
int pci_resource_alignment(struct pci_dev *dev, int resno)
{
- resource_size_t align;
+ resource_size_t align, bios_align;
struct resource *res = dev->resource + resno;

+ bios_align = pcibios_resource_alignment(dev, resno);
+
align = resource_alignment(res);
if (align)
- return align;
+ return align > bios_align ? align : bios_align;

dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno);
return 0;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2ada2b6..6ac69af 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1121,6 +1121,7 @@ int pcibios_add_platform_entries(struct pci_dev *dev);
void pcibios_disable_device(struct pci_dev *dev);
int pcibios_set_pcie_reset_state(struct pci_dev *dev,
enum pcie_reset_state state);
+int pcibios_resource_alignment(struct pci_dev *dev, int resno);

#ifdef CONFIG_PCI_MMCONFIG
extern void __init pci_mmcfg_early_init(void);
--
1.5.6.4

2008-10-22 09:41:26

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 10/16 v6] PCI: cleanup pci_bus_add_devices()

This cleanup makes pci_bus_add_devices() easier to read.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/bus.c | 56 +++++++++++++++++++++++++------------------------
drivers/pci/remove.c | 2 +
2 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 999cc40..7a21602 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -71,7 +71,7 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
}

/**
- * add a single device
+ * pci_bus_add_device - add a single device
* @dev: device to add
*
* This adds a single pci device to the global
@@ -105,7 +105,7 @@ int pci_bus_add_device(struct pci_dev *dev)
void pci_bus_add_devices(struct pci_bus *bus)
{
struct pci_dev *dev;
- struct pci_bus *child_bus;
+ struct pci_bus *child;
int retval;

list_for_each_entry(dev, &bus->devices, bus_list) {
@@ -120,39 +120,41 @@ void pci_bus_add_devices(struct pci_bus *bus)
list_for_each_entry(dev, &bus->devices, bus_list) {
BUG_ON(!dev->is_added);

+ child = dev->subordinate;
/*
* If there is an unattached subordinate bus, attach
* it and then scan for unattached PCI devices.
*/
- if (dev->subordinate) {
- if (list_empty(&dev->subordinate->node)) {
- down_write(&pci_bus_sem);
- list_add_tail(&dev->subordinate->node,
- &dev->bus->children);
- up_write(&pci_bus_sem);
- }
- pci_bus_add_devices(dev->subordinate);
-
- /* register the bus with sysfs as the parent is now
- * properly registered. */
- child_bus = dev->subordinate;
- if (child_bus->is_added)
- continue;
- child_bus->dev.parent = child_bus->bridge;
- retval = device_register(&child_bus->dev);
- if (retval)
- dev_err(&dev->dev, "Error registering pci_bus,"
- " continuing...\n");
- else {
- child_bus->is_added = 1;
- retval = device_create_file(&child_bus->dev,
- &dev_attr_cpuaffinity);
- }
+ if (!child)
+ continue;
+ if (list_empty(&child->node)) {
+ down_write(&pci_bus_sem);
+ list_add_tail(&child->node,
+ &dev->bus->children);
+ up_write(&pci_bus_sem);
+ }
+ pci_bus_add_devices(child);
+
+ /*
+ * register the bus with sysfs as the parent is now
+ * properly registered.
+ */
+ if (child->is_added)
+ continue;
+ child->dev.parent = child->bridge;
+ retval = device_register(&child->dev);
+ if (retval)
+ dev_err(&dev->dev, "Error registering pci_bus,"
+ " continuing...\n");
+ else {
+ child->is_added = 1;
+ retval = device_create_file(&child->dev,
+ &dev_attr_cpuaffinity);
if (retval)
dev_err(&dev->dev, "Error creating cpuaffinity"
" file, continuing...\n");

- retval = device_create_file(&child_bus->dev,
+ retval = device_create_file(&child->dev,
&dev_attr_cpulistaffinity);
if (retval)
dev_err(&dev->dev,
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 042e089..bfa0869 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -72,6 +72,8 @@ void pci_remove_bus(struct pci_bus *pci_bus)
list_del(&pci_bus->node);
up_write(&pci_bus_sem);
pci_remove_legacy_files(pci_bus);
+ if (!pci_bus->is_added)
+ return;
device_remove_file(&pci_bus->dev, &dev_attr_cpuaffinity);
device_remove_file(&pci_bus->dev, &dev_attr_cpulistaffinity);
device_unregister(&pci_bus->dev);
--
1.5.6.4

2008-10-22 09:42:47

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 12/16 v6] PCI: support the SR-IOV capability

Support Single Root I/O Virtualization (SR-IOV) capability.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/Kconfig | 12 +
drivers/pci/Makefile | 2 +
drivers/pci/iov.c | 592 ++++++++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci-sysfs.c | 4 +
drivers/pci/pci.c | 14 +
drivers/pci/pci.h | 48 ++++
drivers/pci/probe.c | 4 +
include/linux/pci.h | 39 +++
include/linux/pci_regs.h | 21 ++
9 files changed, 736 insertions(+), 0 deletions(-)
create mode 100644 drivers/pci/iov.c

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index e1ca425..e7c0836 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -50,3 +50,15 @@ config HT_IRQ
This allows native hypertransport devices to use interrupts.

If unsure say Y.
+
+config PCI_IOV
+ bool "PCI SR-IOV support"
+ depends on PCI
+ select PCI_MSI
+ default n
+ help
+ This option allows device drivers to enable Single Root I/O
+ Virtualization. Each Virtual Function's PCI configuration
+ space can be accessed using its own Bus, Device and Function
+ Number (Routing ID). Each Virtual Function also has PCI Memory
+ Space, which is used to map its own register set.
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 4b47f4e..abbfcfa 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -55,3 +55,5 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
ifeq ($(CONFIG_PCI_DEBUG),y)
EXTRA_CFLAGS += -DDEBUG
endif
+
+obj-$(CONFIG_PCI_IOV) += iov.o
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
new file mode 100644
index 0000000..dd299aa
--- /dev/null
+++ b/drivers/pci/iov.c
@@ -0,0 +1,592 @@
+/*
+ * drivers/pci/iov.c
+ *
+ * Copyright (C) 2008 Intel Corporation
+ *
+ * PCI Express Single Root I/O Virtualization capability support.
+ */
+
+#include <linux/ctype.h>
+#include <linux/string.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <asm/page.h>
+#include "pci.h"
+
+
+#define iov_config_attr(field) \
+static ssize_t field##_show(struct device *dev, \
+ struct device_attribute *attr, char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sprintf(buf, "%d\n", pdev->iov->field); \
+}
+
+iov_config_attr(status);
+iov_config_attr(totalvfs);
+iov_config_attr(initialvfs);
+iov_config_attr(numvfs);
+
+static inline void vf_rid(struct pci_dev *dev, int vfn, u8 *busnr, u8 *devfn)
+{
+ u16 rid;
+
+ rid = (dev->bus->number << 8) + dev->devfn +
+ dev->iov->offset + dev->iov->stride * vfn;
+ *busnr = rid >> 8;
+ *devfn = rid & 0xff;
+}
+
+static int vf_add(struct pci_dev *dev, int vfn)
+{
+ int i;
+ int rc;
+ u8 busnr, devfn;
+ struct pci_dev *vf;
+ struct pci_bus *bus;
+ struct resource *res;
+ resource_size_t size;
+
+ vf_rid(dev, vfn, &busnr, &devfn);
+
+ vf = alloc_pci_dev();
+ if (!vf)
+ return -ENOMEM;
+
+ if (dev->bus->number == busnr)
+ vf->bus = bus = dev->bus;
+ else {
+ list_for_each_entry(bus, &dev->bus->children, node)
+ if (bus->number == busnr) {
+ vf->bus = bus;
+ break;
+ }
+ BUG_ON(!vf->bus);
+ }
+
+ vf->sysdata = bus->sysdata;
+ vf->dev.parent = dev->dev.parent;
+ vf->dev.bus = dev->dev.bus;
+ vf->devfn = devfn;
+ vf->hdr_type = PCI_HEADER_TYPE_NORMAL;
+ vf->multifunction = 0;
+ vf->vendor = dev->vendor;
+ pci_read_config_word(dev, dev->iov->cap + PCI_IOV_VF_DID, &vf->device);
+ vf->cfg_size = PCI_CFG_SPACE_EXP_SIZE;
+ vf->error_state = pci_channel_io_normal;
+ vf->is_pcie = 1;
+ vf->pcie_type = PCI_EXP_TYPE_ENDPOINT;
+ vf->dma_mask = 0xffffffff;
+
+ dev_set_name(&vf->dev, "%04x:%02x:%02x.%d", pci_domain_nr(bus),
+ busnr, PCI_SLOT(devfn), PCI_FUNC(devfn));
+
+ pci_read_config_byte(vf, PCI_REVISION_ID, &vf->revision);
+ vf->class = dev->class;
+ vf->current_state = PCI_UNKNOWN;
+ vf->irq = 0;
+
+ for (i = 0; i < PCI_IOV_NUM_BAR; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ if (!res->parent)
+ continue;
+ vf->resource[i].name = pci_name(vf);
+ vf->resource[i].flags = res->flags;
+ size = resource_size(res);
+ do_div(size, dev->iov->totalvfs);
+ vf->resource[i].start = res->start + size * vfn;
+ vf->resource[i].end = vf->resource[i].start + size - 1;
+ rc = request_resource(res, &vf->resource[i]);
+ BUG_ON(rc);
+ }
+
+ vf->subsystem_vendor = dev->subsystem_vendor;
+ pci_read_config_word(vf, PCI_SUBSYSTEM_ID, &vf->subsystem_device);
+
+ pci_device_add(vf, bus);
+ return pci_bus_add_device(vf);
+}
+
+static void vf_remove(struct pci_dev *dev, int vfn)
+{
+ u8 busnr, devfn;
+ struct pci_dev *vf;
+
+ vf_rid(dev, vfn, &busnr, &devfn);
+
+ vf = pci_get_bus_and_slot(busnr, devfn);
+ if (!vf)
+ return;
+
+ pci_dev_put(vf);
+ pci_remove_bus_device(vf);
+}
+
+static int iov_enable(struct pci_dev *dev)
+{
+ int rc;
+ int i, j;
+ u16 ctrl;
+ struct pci_iov *iov = dev->iov;
+
+ if (!iov->callback)
+ return -ENODEV;
+
+ if (!iov->numvfs)
+ return -EINVAL;
+
+ if (iov->status)
+ return 0;
+
+ rc = iov->callback(dev, PCI_IOV_ENABLE);
+ if (rc)
+ return rc;
+
+ pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl);
+ ctrl |= (PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE);
+ pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl);
+ ssleep(1);
+
+ for (i = 0; i < iov->numvfs; i++) {
+ rc = vf_add(dev, i);
+ if (rc)
+ goto failed;
+ }
+
+ iov->status = 1;
+ return 0;
+
+failed:
+ for (j = 0; j < i; j++)
+ vf_remove(dev, j);
+
+ pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl);
+ ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE);
+ pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl);
+ ssleep(1);
+
+ return rc;
+}
+
+static int iov_disable(struct pci_dev *dev)
+{
+ int i;
+ int rc;
+ u16 ctrl;
+ struct pci_iov *iov = dev->iov;
+
+ if (!iov->callback)
+ return -ENODEV;
+
+ if (!iov->status)
+ return 0;
+
+ rc = iov->callback(dev, PCI_IOV_DISABLE);
+ if (rc)
+ return rc;
+
+ for (i = 0; i < iov->numvfs; i++)
+ vf_remove(dev, i);
+
+ pci_read_config_word(dev, iov->cap + PCI_IOV_CTRL, &ctrl);
+ ctrl &= ~(PCI_IOV_CTRL_VFE | PCI_IOV_CTRL_MSE);
+ pci_write_config_word(dev, iov->cap + PCI_IOV_CTRL, ctrl);
+ ssleep(1);
+
+ iov->status = 0;
+ return 0;
+}
+
+static int iov_set_numvfs(struct pci_dev *dev, int numvfs)
+{
+ int rc;
+ u16 offset, stride;
+ struct pci_iov *iov = dev->iov;
+
+ if (!iov->callback)
+ return -ENODEV;
+
+ if (numvfs < 0 || numvfs > iov->initialvfs || iov->status)
+ return -EINVAL;
+
+ if (numvfs == iov->numvfs)
+ return 0;
+
+ rc = iov->callback(dev, PCI_IOV_NUMVFS | iov->numvfs);
+ if (rc)
+ return rc;
+
+ pci_write_config_word(dev, iov->cap + PCI_IOV_NUM_VF, numvfs);
+ pci_read_config_word(dev, iov->cap + PCI_IOV_VF_OFFSET, &offset);
+ pci_read_config_word(dev, iov->cap + PCI_IOV_VF_STRIDE, &stride);
+ if ((numvfs && !offset) || (numvfs > 1 && !stride))
+ return -EIO;
+
+ iov->offset = offset;
+ iov->stride = stride;
+ iov->numvfs = numvfs;
+ return 0;
+}
+
+static ssize_t status_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int rc;
+ long enable;
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ rc = strict_strtol(buf, 0, &enable);
+ if (rc)
+ return rc;
+
+ mutex_lock(&pdev->iov->ops_lock);
+ switch (enable) {
+ case 0:
+ rc = iov_disable(pdev);
+ break;
+ case 1:
+ rc = iov_enable(pdev);
+ break;
+ default:
+ rc = -EINVAL;
+ }
+ mutex_unlock(&pdev->iov->ops_lock);
+
+ return rc ? rc : count;
+}
+
+static ssize_t numvfs_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int rc;
+ long numvfs;
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ rc = strict_strtol(buf, 0, &numvfs);
+ if (rc)
+ return rc;
+
+ mutex_lock(&pdev->iov->ops_lock);
+ rc = iov_set_numvfs(pdev, numvfs);
+ mutex_unlock(&pdev->iov->ops_lock);
+
+ return rc ? rc : count;
+}
+
+static DEVICE_ATTR(totalvfs, S_IRUGO, totalvfs_show, NULL);
+static DEVICE_ATTR(initialvfs, S_IRUGO, initialvfs_show, NULL);
+static DEVICE_ATTR(numvfs, S_IWUSR | S_IRUGO, numvfs_show, numvfs_store);
+static DEVICE_ATTR(enable, S_IWUSR | S_IRUGO, status_show, status_store);
+
+static struct attribute *iov_attrs[] = {
+ &dev_attr_totalvfs.attr,
+ &dev_attr_initialvfs.attr,
+ &dev_attr_numvfs.attr,
+ &dev_attr_enable.attr,
+ NULL
+};
+
+static struct attribute_group iov_attr_group = {
+ .attrs = iov_attrs,
+ .name = "iov",
+};
+
+static int iov_alloc_bus(struct pci_bus *bus, int busnr)
+{
+ int i;
+ int rc;
+ struct pci_dev *dev;
+ struct pci_bus *child;
+
+ list_for_each_entry(dev, &bus->devices, bus_list)
+ if (dev->iov)
+ break;
+
+ BUG_ON(!dev->iov);
+ pci_dev_get(dev);
+ mutex_lock(&dev->iov->bus_lock);
+
+ for (i = bus->number + 1; i <= busnr; i++) {
+ list_for_each_entry(child, &bus->children, node)
+ if (child->number == i)
+ break;
+ if (child->number == i)
+ continue;
+ child = pci_add_new_bus(bus, NULL, i);
+ if (!child)
+ return -ENOMEM;
+
+ child->subordinate = i;
+ child->dev.parent = bus->bridge;
+ rc = pci_bus_add_child(child);
+ if (rc)
+ return rc;
+ }
+
+ mutex_unlock(&dev->iov->bus_lock);
+
+ return 0;
+}
+
+static void iov_release_bus(struct pci_bus *bus)
+{
+ struct pci_dev *dev, *tmp;
+ struct pci_bus *child, *next;
+
+ list_for_each_entry(dev, &bus->devices, bus_list)
+ if (dev->iov)
+ break;
+
+ BUG_ON(!dev->iov);
+ mutex_lock(&dev->iov->bus_lock);
+
+ list_for_each_entry(tmp, &bus->devices, bus_list)
+ if (tmp->iov && tmp->iov->callback)
+ goto done;
+
+ list_for_each_entry_safe(child, next, &bus->children, node)
+ if (!child->bridge)
+ pci_remove_bus(child);
+done:
+ mutex_unlock(&dev->iov->bus_lock);
+ pci_dev_put(dev);
+}
+
+/**
+ * pci_iov_init - initialize device's SR-IOV capability
+ * @dev: the PCI device
+ *
+ * Returns 0 on success, or negative on failure.
+ *
+ * The major differences between Virtual Function and PCI device are:
+ * 1) the device with multiple bus numbers uses internal routing, so
+ * there is no explicit bridge device in this case.
+ * 2) Virtual Function memory spaces are designated by BARs encapsulated
+ * in the capability structure, and the BARs in Virtual Function PCI
+ * configuration space are read-only zero.
+ */
+int pci_iov_init(struct pci_dev *dev)
+{
+ int i;
+ int pos;
+ u32 pgsz;
+ u16 ctrl, total, initial, offset, stride;
+ struct pci_iov *iov;
+ struct resource *res;
+
+ if (!dev->is_pcie || (dev->pcie_type != PCI_EXP_TYPE_RC_END &&
+ dev->pcie_type != PCI_EXP_TYPE_ENDPOINT))
+ return -ENODEV;
+
+ pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_IOV);
+ if (!pos)
+ return -ENODEV;
+
+ ctrl = pci_ari_enabled(dev) ? PCI_IOV_CTRL_ARI : 0;
+ pci_write_config_word(dev, pos + PCI_IOV_CTRL, ctrl);
+ ssleep(1);
+
+ pci_read_config_word(dev, pos + PCI_IOV_TOTAL_VF, &total);
+ pci_read_config_word(dev, pos + PCI_IOV_INITIAL_VF, &initial);
+ pci_write_config_word(dev, pos + PCI_IOV_NUM_VF, initial);
+ pci_read_config_word(dev, pos + PCI_IOV_VF_OFFSET, &offset);
+ pci_read_config_word(dev, pos + PCI_IOV_VF_STRIDE, &stride);
+ if (!total || initial > total || (initial && !offset) ||
+ (initial > 1 && !stride))
+ return -EIO;
+
+ pci_read_config_dword(dev, pos + PCI_IOV_SUP_PGSIZE, &pgsz);
+ i = PAGE_SHIFT > 12 ? PAGE_SHIFT - 12 : 0;
+ pgsz &= ~((1 << i) - 1);
+ if (!pgsz)
+ return -EIO;
+
+ pgsz &= ~(pgsz - 1);
+ pci_write_config_dword(dev, pos + PCI_IOV_SYS_PGSIZE, pgsz);
+
+ iov = kzalloc(sizeof(*iov), GFP_KERNEL);
+ if (!iov)
+ return -ENOMEM;
+
+ iov->cap = pos;
+ iov->totalvfs = total;
+ iov->initialvfs = initial;
+ iov->offset = offset;
+ iov->stride = stride;
+ iov->align = pgsz << 12;
+
+ for (i = 0; i < PCI_IOV_NUM_BAR; i++) {
+ res = dev->resource + PCI_IOV_RESOURCES + i;
+ pos = iov->cap + PCI_IOV_BAR_0 + i * 4;
+ i += __pci_read_base(dev, pci_bar_unknown, res, pos);
+ if (!res->flags)
+ continue;
+ res->flags &= ~IORESOURCE_SIZEALIGN;
+ res->end = res->start + resource_size(res) * total - 1;
+ }
+
+ mutex_init(&iov->ops_lock);
+ mutex_init(&iov->bus_lock);
+
+ dev->iov = iov;
+
+ return 0;
+}
+
+/**
+ * pci_iov_release - release resources used by SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_iov_release(struct pci_dev *dev)
+{
+ if (!dev->iov)
+ return;
+
+ mutex_destroy(&dev->iov->ops_lock);
+ mutex_destroy(&dev->iov->bus_lock);
+ kfree(dev->iov);
+ dev->iov = NULL;
+}
+
+/**
+ * pci_iov_create_sysfs - create sysfs for SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_iov_create_sysfs(struct pci_dev *dev)
+{
+ if (!dev->iov)
+ return;
+
+ sysfs_create_group(&dev->dev.kobj, &iov_attr_group);
+}
+
+/**
+ * pci_iov_remove_sysfs - remove sysfs of SR-IOV capability
+ * @dev: the PCI device
+ */
+void pci_iov_remove_sysfs(struct pci_dev *dev)
+{
+ if (!dev->iov)
+ return;
+
+ sysfs_remove_group(&dev->dev.kobj, &iov_attr_group);
+}
+
+int pci_iov_resource_align(struct pci_dev *dev, int resno)
+{
+ if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END)
+ return 0;
+
+ BUG_ON(!dev->iov);
+
+ return dev->iov->align;
+}
+
+int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type)
+{
+ if (resno < PCI_IOV_RESOURCES || resno > PCI_IOV_RESOURCES_END)
+ return 0;
+
+ BUG_ON(!dev->iov);
+
+ *type = pci_bar_unknown;
+ return dev->iov->cap + PCI_IOV_BAR_0 +
+ 4 * (resno - PCI_IOV_RESOURCES);
+}
+
+/**
+ * pci_iov_register - register SR-IOV service
+ * @dev: the PCI device
+ * @callback: callback function for SR-IOV events
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_iov_register(struct pci_dev *dev,
+ int (*callback)(struct pci_dev *, u32))
+{
+ u8 busnr, devfn;
+ struct pci_iov *iov = dev->iov;
+
+ if (!iov)
+ return -ENODEV;
+
+ if (!callback || iov->callback)
+ return -EINVAL;
+
+ vf_rid(dev, iov->totalvfs - 1, &busnr, &devfn);
+ if (busnr > dev->bus->subordinate)
+ return -EIO;
+
+ iov->callback = callback;
+ return iov_alloc_bus(dev->bus, busnr);
+}
+EXPORT_SYMBOL_GPL(pci_iov_register);
+
+/**
+ * pci_iov_unregister - unregister SR-IOV service
+ * @dev: the PCI device
+ */
+void pci_iov_unregister(struct pci_dev *dev)
+{
+ struct pci_iov *iov = dev->iov;
+
+ if (!iov || !iov->callback)
+ return;
+
+ iov->callback = NULL;
+ iov_release_bus(dev->bus);
+}
+EXPORT_SYMBOL_GPL(pci_iov_unregister);
+
+/**
+ * pci_iov_enable - enable SR-IOV capability
+ * @dev: the PCI device
+ * @numvfs: number of VFs to be available
+ *
+ * Returns 0 on success, or negative on failure.
+ */
+int pci_iov_enable(struct pci_dev *dev, int numvfs)
+{
+ int rc;
+ struct pci_iov *iov = dev->iov;
+
+ if (!iov)
+ return -ENODEV;
+
+ if (!iov->callback)
+ return -EINVAL;
+
+ mutex_lock(&iov->ops_lock);
+ rc = iov_set_numvfs(dev, numvfs);
+ if (rc)
+ goto done;
+ rc = iov_enable(dev);
+done:
+ mutex_unlock(&iov->ops_lock);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(pci_iov_enable);
+
+/**
+ * pci_iov_disable - disable SR-IOV capability
+ * @dev: the PCI device
+ *
+ * Should be called upon Physical Function driver removal, and power
+ * state change. All previous allocated Virtual Functions are reclaimed.
+ */
+void pci_iov_disable(struct pci_dev *dev)
+{
+ struct pci_iov *iov = dev->iov;
+
+ if (!iov || !iov->callback)
+ return;
+
+ mutex_lock(&iov->ops_lock);
+ iov_disable(dev);
+ mutex_unlock(&iov->ops_lock);
+}
+EXPORT_SYMBOL_GPL(pci_iov_disable);
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 5c456ab..18881f2 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -847,6 +847,9 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev)
/* Active State Power Management */
pcie_aspm_create_sysfs_dev_files(dev);

+ /* Single Root I/O Virtualization */
+ pci_iov_create_sysfs(dev);
+
return 0;
}

@@ -932,6 +935,7 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev)
}

pcie_aspm_remove_sysfs_dev_files(dev);
+ pci_iov_remove_sysfs(dev);
}

/**
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 11ecd6f..10a43b2 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1936,6 +1936,13 @@ int pci_resource_alignment(struct pci_dev *dev, int resno)
if (align)
return align > bios_align ? align : bios_align;

+ if (resno > PCI_ROM_RESOURCE && resno < PCI_BRIDGE_RESOURCES) {
+ /* device specific resource */
+ align = pci_iov_resource_align(dev, resno);
+ if (align)
+ return align > bios_align ? align : bios_align;
+ }
+
dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno);
return 0;
}
@@ -1950,12 +1957,19 @@ int pci_resource_alignment(struct pci_dev *dev, int resno)
*/
int pci_resource_bar(struct pci_dev *dev, int resno, enum pci_bar_type *type)
{
+ int reg;
+
if (resno < PCI_ROM_RESOURCE) {
*type = pci_bar_unknown;
return PCI_BASE_ADDRESS_0 + 4 * resno;
} else if (resno == PCI_ROM_RESOURCE) {
*type = pci_bar_mem32;
return dev->rom_base_reg;
+ } else if (resno < PCI_BRIDGE_RESOURCES) {
+ /* device specific resource */
+ reg = pci_iov_resource_bar(dev, resno, type);
+ if (reg)
+ return reg;
}

dev_err(&dev->dev, "BAR: invalid resource #%d\n", resno);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d707477..7735d92 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -181,4 +181,52 @@ static inline int pci_ari_enabled(struct pci_dev *dev)
return dev->ari_enabled;
}

+/* Single Root I/O Virtualization */
+struct pci_iov {
+ int cap; /* capability position */
+ int align; /* page size used to map memory space */
+ int status; /* status of SR-IOV */
+ u16 totalvfs; /* total VFs associated with the PF */
+ u16 initialvfs; /* initial VFs associated with the PF */
+ u16 numvfs; /* number of VFs available */
+ u16 offset; /* first VF Routing ID offset */
+ u16 stride; /* following VF stride */
+ struct mutex ops_lock; /* lock for SR-IOV operations */
+ struct mutex bus_lock; /* lock for VF bus */
+ int (*callback)(struct pci_dev *, u32); /* event callback function */
+};
+
+#ifdef CONFIG_PCI_IOV
+extern int pci_iov_init(struct pci_dev *dev);
+extern void pci_iov_release(struct pci_dev *dev);
+void pci_iov_create_sysfs(struct pci_dev *dev);
+void pci_iov_remove_sysfs(struct pci_dev *dev);
+extern int pci_iov_resource_align(struct pci_dev *dev, int resno);
+extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type);
+#else
+static inline int pci_iov_init(struct pci_dev *dev)
+{
+ return -EIO;
+}
+static inline void pci_iov_release(struct pci_dev *dev)
+{
+}
+static inline void pci_iov_create_sysfs(struct pci_dev *dev)
+{
+}
+static inline void pci_iov_remove_sysfs(struct pci_dev *dev)
+{
+}
+static inline int pci_iov_resource_align(struct pci_dev *dev, int resno)
+{
+ return 0;
+}
+static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
+ enum pci_bar_type *type)
+{
+ return 0;
+}
+#endif /* CONFIG_PCI_IOV */
+
#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4b12b58..18ce9c0 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -779,6 +779,7 @@ static int pci_setup_device(struct pci_dev * dev)
static void pci_release_capabilities(struct pci_dev *dev)
{
pci_vpd_release(dev);
+ pci_iov_release(dev);
}

/**
@@ -962,6 +963,9 @@ static void pci_init_capabilities(struct pci_dev *dev)

/* Alternative Routing-ID Forwarding */
pci_enable_ari(dev);
+
+ /* Single Root I/O Virtualization */
+ pci_iov_init(dev);
}

void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 80d88f8..77af7e0 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -87,6 +87,12 @@ enum {
/* #6: expansion ROM */
PCI_ROM_RESOURCE,

+ /* device specific resources */
+#ifdef CONFIG_PCI_IOV
+ PCI_IOV_RESOURCES,
+ PCI_IOV_RESOURCES_END = PCI_IOV_RESOURCES + PCI_IOV_NUM_BAR - 1,
+#endif
+
/* address space assigned to buses behind the bridge */
#ifndef PCI_BRIDGE_RES_NUM
#define PCI_BRIDGE_RES_NUM 4
@@ -165,6 +171,7 @@ struct pci_cap_saved_state {

struct pcie_link_state;
struct pci_vpd;
+struct pci_iov;

/*
* The pci_dev structure is used to describe PCI devices.
@@ -253,6 +260,7 @@ struct pci_dev {
struct list_head msi_list;
#endif
struct pci_vpd *vpd;
+ struct pci_iov *iov;
};

extern struct pci_dev *alloc_pci_dev(void);
@@ -1147,5 +1155,36 @@ static inline void * pci_ioremap_bar(struct pci_dev *pdev, int bar)
}
#endif

+/* SR-IOV events masks */
+#define PCI_IOV_NUM_VIRTFN 0x0000FFFFU /* NumVFs to be set */
+/* SR-IOV events values */
+#define PCI_IOV_ENABLE 0x00010000U /* SR-IOV enable request */
+#define PCI_IOV_DISABLE 0x00020000U /* SR-IOV disable request */
+#define PCI_IOV_NUMVFS 0x00040000U /* SR-IOV disable request */
+
+#ifdef CONFIG_PCI_IOV
+extern int pci_iov_enable(struct pci_dev *dev, int numvfs);
+extern void pci_iov_disable(struct pci_dev *dev);
+extern int pci_iov_register(struct pci_dev *dev,
+ int (*callback)(struct pci_dev *dev, u32 event));
+extern void pci_iov_unregister(struct pci_dev *dev);
+#else
+static inline int pci_iov_enable(struct pci_dev *dev, int numvfs)
+{
+ return -EIO;
+}
+static inline void pci_iov_disable(struct pci_dev *dev)
+{
+}
+static inline int pci_iov_register(struct pci_dev *dev,
+ int (*callback)(struct pci_dev *dev, u32 event))
+{
+ return -EIO;
+}
+static inline void pci_iov_unregister(struct pci_dev *dev)
+{
+}
+#endif /* CONFIG_PCI_IOV */
+
#endif /* __KERNEL__ */
#endif /* LINUX_PCI_H */
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index eb6686b..1b28b3f 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -363,6 +363,7 @@
#define PCI_EXP_TYPE_UPSTREAM 0x5 /* Upstream Port */
#define PCI_EXP_TYPE_DOWNSTREAM 0x6 /* Downstream Port */
#define PCI_EXP_TYPE_PCI_BRIDGE 0x7 /* PCI/PCI-X Bridge */
+#define PCI_EXP_TYPE_RC_END 0x9 /* Root Complex Integrated Endpoint */
#define PCI_EXP_FLAGS_SLOT 0x0100 /* Slot implemented */
#define PCI_EXP_FLAGS_IRQ 0x3e00 /* Interrupt message number */
#define PCI_EXP_DEVCAP 4 /* Device capabilities */
@@ -434,6 +435,7 @@
#define PCI_EXT_CAP_ID_DSN 3
#define PCI_EXT_CAP_ID_PWR 4
#define PCI_EXT_CAP_ID_ARI 14
+#define PCI_EXT_CAP_ID_IOV 16

/* Advanced Error Reporting */
#define PCI_ERR_UNCOR_STATUS 4 /* Uncorrectable Error Status */
@@ -551,4 +553,23 @@
#define PCI_ARI_CTRL_ACS 0x0002 /* ACS Function Groups Enable */
#define PCI_ARI_CTRL_FG(x) (((x) >> 4) & 7) /* Function Group */

+/* Single Root I/O Virtualization */
+#define PCI_IOV_CAP 0x04 /* SR-IOV Capabilities */
+#define PCI_IOV_CTRL 0x08 /* SR-IOV Control */
+#define PCI_IOV_CTRL_VFE 0x01 /* VF Enable */
+#define PCI_IOV_CTRL_MSE 0x08 /* VF Memory Space Enable */
+#define PCI_IOV_CTRL_ARI 0x10 /* ARI Capable Hierarchy */
+#define PCI_IOV_STATUS 0x0a /* SR-IOV Status */
+#define PCI_IOV_INITIAL_VF 0x0c /* Initial VFs */
+#define PCI_IOV_TOTAL_VF 0x0e /* Total VFs */
+#define PCI_IOV_NUM_VF 0x10 /* Number of VFs */
+#define PCI_IOV_FUNC_LINK 0x12 /* Function Dependency Link */
+#define PCI_IOV_VF_OFFSET 0x14 /* First VF Offset */
+#define PCI_IOV_VF_STRIDE 0x16 /* Following VF Stride */
+#define PCI_IOV_VF_DID 0x1a /* VF Device ID */
+#define PCI_IOV_SUP_PGSIZE 0x1c /* Supported Page Sizes */
+#define PCI_IOV_SYS_PGSIZE 0x20 /* System Page Size */
+#define PCI_IOV_BAR_0 0x24 /* VF BAR0 */
+#define PCI_IOV_NUM_BAR 6 /* Number of VF BARs */
+
#endif /* LINUX_PCI_REGS_H */
--
1.5.6.4

2008-10-22 09:42:27

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 11/16 v6] PCI: split a new function from pci_bus_add_devices()

This patch splits a new function from pci_bus_add_devices(). The new
function can be used to register PCI bus to the device core and create
its sysfs entries.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/bus.c | 47 ++++++++++++++++++++++++++++-------------------
include/linux/pci.h | 1 +
2 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 7a21602..1713c35 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -91,6 +91,32 @@ int pci_bus_add_device(struct pci_dev *dev)
}

/**
+ * pci_bus_add_child - add a child bus
+ * @bus: bus to add
+ *
+ * This adds sysfs entries for a single bus
+ */
+int pci_bus_add_child(struct pci_bus *bus)
+{
+ int retval;
+
+ if (bus->bridge)
+ bus->dev.parent = bus->bridge;
+
+ retval = device_register(&bus->dev);
+ if (retval)
+ return retval;
+
+ bus->is_added = 1;
+
+ retval = device_create_file(&bus->dev, &dev_attr_cpuaffinity);
+ if (retval)
+ return retval;
+
+ return device_create_file(&bus->dev, &dev_attr_cpulistaffinity);
+}
+
+/**
* pci_bus_add_devices - insert newly discovered PCI devices
* @bus: bus to check for new devices
*
@@ -141,26 +167,9 @@ void pci_bus_add_devices(struct pci_bus *bus)
*/
if (child->is_added)
continue;
- child->dev.parent = child->bridge;
- retval = device_register(&child->dev);
+ retval = pci_bus_add_child(child);
if (retval)
- dev_err(&dev->dev, "Error registering pci_bus,"
- " continuing...\n");
- else {
- child->is_added = 1;
- retval = device_create_file(&child->dev,
- &dev_attr_cpuaffinity);
- if (retval)
- dev_err(&dev->dev, "Error creating cpuaffinity"
- " file, continuing...\n");
-
- retval = device_create_file(&child->dev,
- &dev_attr_cpulistaffinity);
- if (retval)
- dev_err(&dev->dev,
- "Error creating cpulistaffinity"
- " file, continuing...\n");
- }
+ dev_err(&dev->dev, "Error adding bus, continuing\n");
}
}

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6ac69af..80d88f8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -528,6 +528,7 @@ struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn);
void pci_device_add(struct pci_dev *dev, struct pci_bus *bus);
unsigned int pci_scan_child_bus(struct pci_bus *bus);
int __must_check pci_bus_add_device(struct pci_dev *dev);
+int pci_bus_add_child(struct pci_bus *bus);
void pci_read_bridge_bases(struct pci_bus *child);
struct resource *pci_find_parent_resource(const struct pci_dev *dev,
struct resource *res);
--
1.5.6.4

2008-10-22 09:43:09

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 13/16 v6] PCI: reserve bus range for SR-IOV device

Reserve bus range for SR-IOV at device scanning stage.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
drivers/pci/iov.c | 24 ++++++++++++++++++++++++
drivers/pci/pci.h | 5 +++++
drivers/pci/probe.c | 3 +++
3 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index dd299aa..c86bd54 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -498,6 +498,30 @@ int pci_iov_resource_bar(struct pci_dev *dev, int resno,
}

/**
+ * pci_iov_bus_range - find bus range used by SR-IOV capability
+ * @bus: the PCI bus
+ *
+ * Returns max number of buses (exclude current one) used by Virtual
+ * Functions.
+ */
+int pci_iov_bus_range(struct pci_bus *bus)
+{
+ int max = 0;
+ u8 busnr, devfn;
+ struct pci_dev *dev;
+
+ list_for_each_entry(dev, &bus->devices, bus_list) {
+ if (!dev->iov)
+ continue;
+ vf_rid(dev, dev->iov->totalvfs - 1, &busnr, &devfn);
+ if (busnr > max)
+ max = busnr;
+ }
+
+ return max ? max - bus->number : 0;
+}
+
+/**
* pci_iov_register - register SR-IOV service
* @dev: the PCI device
* @callback: callback function for SR-IOV events
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 7735d92..5206ae7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -204,6 +204,7 @@ void pci_iov_remove_sysfs(struct pci_dev *dev);
extern int pci_iov_resource_align(struct pci_dev *dev, int resno);
extern int pci_iov_resource_bar(struct pci_dev *dev, int resno,
enum pci_bar_type *type);
+extern int pci_iov_bus_range(struct pci_bus *bus);
#else
static inline int pci_iov_init(struct pci_dev *dev)
{
@@ -227,6 +228,10 @@ static inline int pci_iov_resource_bar(struct pci_dev *dev, int resno,
{
return 0;
}
+extern inline int pci_iov_bus_range(struct pci_bus *bus)
+{
+ return 0;
+}
#endif /* CONFIG_PCI_IOV */

#endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 18ce9c0..50a1380 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1068,6 +1068,9 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
for (devfn = 0; devfn < 0x100; devfn += 8)
pci_scan_slot(bus, devfn);

+ /* Reserve buses for SR-IOV capability. */
+ max += pci_iov_bus_range(bus);
+
/*
* After performing arch-dependent fixup of the bus, look behind
* all PCI-to-PCI bridges on this bus.
--
1.5.6.4

2008-10-22 09:43:28

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 14/16 v6] PCI: document for SR-IOV user and developer

Create HOW-TO for SR-IOV user and driver developer.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
Documentation/DocBook/kernel-api.tmpl | 1 +
Documentation/PCI/pci-iov-howto.txt | 181 +++++++++++++++++++++++++++++++++
2 files changed, 182 insertions(+), 0 deletions(-)
create mode 100644 Documentation/PCI/pci-iov-howto.txt

diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl
index 9d0058e..9a15c50 100644
--- a/Documentation/DocBook/kernel-api.tmpl
+++ b/Documentation/DocBook/kernel-api.tmpl
@@ -251,6 +251,7 @@ X!Edrivers/pci/hotplug.c
-->
!Edrivers/pci/probe.c
!Edrivers/pci/rom.c
+!Edrivers/pci/iov.c
</sect1>
<sect1><title>PCI Hotplug Support Library</title>
!Edrivers/pci/hotplug/pci_hotplug_core.c
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt
new file mode 100644
index 0000000..5632723
--- /dev/null
+++ b/Documentation/PCI/pci-iov-howto.txt
@@ -0,0 +1,181 @@
+ PCI Express Single Root I/O Virtualization HOWTO
+ Copyright (C) 2008 Intel Corporation
+
+
+1. Overview
+
+1.1 What is SR-IOV
+
+Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
+capability which makes one physical device appear as multiple virtual
+devices. The physical device is referred to as Physical Function while
+the virtual devices are referred to as Virtual Functions. Allocation
+of Virtual Functions can be dynamically controlled by Physical Function
+via registers encapsulated in the capability. By default, this feature
+is not enabled and the Physical Function behaves as traditional PCIe
+device. Once it's turned on, each Virtual Function's PCI configuration
+space can be accessed by its own Bus, Device and Function Number (Routing
+ID). And each Virtual Function also has PCI Memory Space, which is used
+to map its register set. Virtual Function device driver operates on the
+register set so it can be functional and appear as a real existing PCI
+device.
+
+2. User Guide
+
+2.1 How can I manage SR-IOV
+
+If a device supports SR-IOV, then there should be some entries under
+Physical Function's PCI device directory. These entries are in directory:
+ - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/
+ (XXXX:BB:DD:F is the domain, bus, device and function number)
+
+To enable or disable SR-IOV:
+ - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/enable
+ (writing 1/0 means enable/disable VFs, state change will
+ notify PF driver)
+
+To change number of Virtual Functions:
+ - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/numvfs
+ (writing positive integer to this file will change NumVFs)
+
+The total and initial number of VFs can get from:
+ - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/totalvfs
+ - /sys/bus/pci/devices/XXXX:BB:DD.F/iov/initialvfs
+
+2.2 How can I use Virtual Functions
+
+Virtual Functions are treated as hot-plugged PCI devices in the kernel,
+so they should be able to work in the same way as real PCI devices.
+NOTE: Virtual Function device driver must be loaded to make it work.
+
+
+3. Developer Guide
+
+3.1 SR-IOV APIs
+
+To register SR-IOV service, Physical Function device driver needs to call:
+ int pci_iov_register(struct pci_dev *dev,
+ int (*callback)(struct pci_dev *, u32))
+ The 'callback' is a callback function that the SR-IOV code will invoke
+ it when events related to VFs happen (e.g. user enable/disable VFs).
+ The first argument is PF itself, the second argument is event type and
+ value. For now, following events type are supported:
+ - PCI_IOV_ENABLE: SR-IOV enable request
+ - PCI_IOV_DISABLE: SR-IOV disable request
+ - PCI_IOV_NUMVFS: changing Number of VFs request
+ And event values can be extract using following masks:
+ - PCI_IOV_NUM_VIRTFN: Number of Virtual Functions
+
+To unregister SR-IOV service, Physical Function device driver needs to call:
+ void pci_iov_unregister(struct pci_dev *dev)
+
+To enable SR-IOV, Physical Function device driver needs to call:
+ int pci_iov_enable(struct pci_dev *dev, int numvfs)
+ 'numvfs' is the number of VFs that PF wants to enable.
+
+To disable SR-IOV, Physical Function device driver needs to call:
+ void pci_iov_disable(struct pci_dev *dev)
+
+Note: above two functions sleeps 1 second waiting on hardware transaction
+completion according to SR-IOV specification.
+
+3.2 Usage example
+
+Following piece of code illustrates the usage of APIs above.
+
+static int callback(struct pci_dev *dev, u32 event)
+{
+ int numvfs;
+
+ if (event & PCI_IOV_ENABLE) {
+ /*
+ * request to enable SR-IOV.
+ * Note: if the PF driver want to support PM, it has
+ * to check the device power state here to see if this
+ * request is allowed or not.
+ */
+ ...
+
+ } else if (event & PCI_IOV_DISABLE) {
+ /*
+ * request to disable SR-IOV.
+ */
+ ...
+
+ } else if (event & PCI_IOV_NUMVFS) {
+ /*
+ * request to change NumVFs.
+ */
+ numvfs = event & PCI_IOV_NUM_VIRTFN;
+ ...
+
+ } else
+ return -EINVAL;
+
+ return 0;
+}
+
+static int __devinit dev_probe(struct pci_dev *dev,
+ const struct pci_device_id *id)
+{
+ int err;
+ int numvfs;
+
+ ...
+ err = pci_iov_register(dev, callback);
+ ...
+ err = pci_iov_enable(dev, numvfs);
+ ...
+
+ return err;
+}
+
+static void __devexit dev_remove(struct pci_dev *dev)
+{
+ ...
+ pci_iov_disable(dev);
+ ...
+ pci_iov_unregister(dev);
+ ...
+}
+
+#ifdef CONFIG_PM
+/*
+ * If Physical Function supports the power management, then the
+ * SR-IOV needs to be disabled before the adapter goes to sleep,
+ * because Virtual Functions will not work when the adapter is in
+ * the power-saving mode.
+ * The SR-IOV can be enabled again after the adapter wakes up.
+ */
+static int dev_suspend(struct pci_dev *dev, pm_message_t state)
+{
+ ...
+ pci_iov_disable(dev);
+ ...
+
+ return 0;
+}
+
+static int dev_resume(struct pci_dev *dev)
+{
+ int err;
+ int numvfs;
+
+ ...
+ rc = pci_iov_enable(dev, numvfs);
+ ...
+
+ return 0;
+}
+#endif
+
+static struct pci_driver dev_driver = {
+ .name = "SR-IOV Physical Function driver",
+ .id_table = dev_id_table,
+ .probe = dev_probe,
+ .remove = __devexit_p(dev_remove),
+#ifdef CONFIG_PM
+ .suspend = dev_suspend,
+ .resume = dev_resume,
+#endif
+};
--
1.5.6.4

2008-10-22 09:43:45

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Document the new PCI[x86] boot parameters.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
Documentation/kernel-parameters.txt | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 53ba7c7..5482ae0 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file
cbmemsize=nn[KMG] The fixed amount of bus space which is
reserved for the CardBus bridge's memory
window. The default value is 64 megabytes.
+ assign-mmio=[dddd:]bb [X86] reassign memory resources of all
+ devices under bus [dddd:]bb (dddd is the domain
+ number and bb is the bus number).
+ assign-pio=[dddd:]bb [X86] reassign io port resources of all
+ devices under bus [dddd:]bb (dddd is the domain
+ number and bb is the bus number).
+ align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
+ device to minimum PAGE_SIZE alignment (dddd is
+ the domain number and bb, dd and f is the bus,
+ device and function number).

pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
Management.
--
1.5.6.4

2008-10-22 09:44:53

by Zhao, Yu

[permalink] [raw]
Subject: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

Document the SR-IOV sysfs entries.

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>

---
Documentation/ABI/testing/sysfs-bus-pci | 33 +++++++++++++++++++++++++++++++
1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index ceddcff..41cce8f 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -9,3 +9,36 @@ Description:
that some devices may have malformatted data. If the
underlying VPD has a writable section then the
corresponding section of this file will be writable.
+
+What: /sys/bus/pci/devices/.../iov/enable
+Date: October 2008
+Contact: Yu Zhao <[email protected]>
+Description:
+ This file appears when a device has the SR-IOV capability.
+ It holds the status of the capability, and could be written
+ (0/1) to disable and enable the capability if the PF driver
+ supports this operation.
+
+What: /sys/bus/pci/devices/.../iov/initialvfs
+Date: October 2008
+Contact: Yu Zhao <[email protected]>
+Description:
+ This file appears when a device has the SR-IOV capability.
+ It holds the number of initial Virtual Functions (read-only).
+
+What: /sys/bus/pci/devices/.../iov/totalvfs
+Date: October 2008
+Contact: Yu Zhao <[email protected]>
+Description:
+ This file appears when a device has the SR-IOV capability.
+ It holds the number of total Virtual Functions (read-only).
+
+
+What: /sys/bus/pci/devices/.../iov/numvfs
+Date: October 2008
+Contact: Yu Zhao <[email protected]>
+Description:
+ This file appears when a device has the SR-IOV capability.
+ It holds the number of available Virtual Functions, and
+ could be written (1 ~ InitialVFs) to change the number if
+ the PF driver supports this operation.
--
1.5.6.4

2008-10-22 14:24:37

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum'

On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote:
> This patch moves all definitions of the PCI resource names to an 'enum',
> and also replaces some hard-coded resource variables with symbol
> names. This change eases introduction of device specific resources.

Thanks for removing a bunch of magic numbers from the code.

> static void
> pci_restore_bars(struct pci_dev *dev)
> {
> - int i, numres;
> -
> - switch (dev->hdr_type) {
> - case PCI_HEADER_TYPE_NORMAL:
> - numres = 6;
> - break;
> - case PCI_HEADER_TYPE_BRIDGE:
> - numres = 2;
> - break;
> - case PCI_HEADER_TYPE_CARDBUS:
> - numres = 1;
> - break;
> - default:
> - /* Should never get here, but just in case... */
> - return;
> - }
> + int i;
>
> - for (i = 0; i < numres; i++)
> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
> pci_update_resource(dev, i);
> }

The behavior of this function used to depend on dev->hdr_type. Now
we don't look at hdr_type at all, so we do the same thing for all
devices.

For example, for a CardBus device, we used to call pci_update_resource()
only for BAR 0; now we call it for BARs 0-6.

Maybe this is safe, but I can't tell from the patch, so I think you
should explain *why* it's safe in the changelog.

> +/*
> + * For PCI devices, the region numbers are assigned this way:
> + */
> +enum {
> + /* #0-5: standard PCI regions */
> + PCI_STD_RESOURCES,
> + PCI_STD_RESOURCES_END = 5,
> +
> + /* #6: expansion ROM */
> + PCI_ROM_RESOURCE,
> +
> + /* address space assigned to buses behind the bridge */
> +#ifndef PCI_BRIDGE_RES_NUM
> +#define PCI_BRIDGE_RES_NUM 4
> +#endif
> + PCI_BRIDGE_RESOURCES,
> + PCI_BRIDGE_RES_END = PCI_BRIDGE_RESOURCES + PCI_BRIDGE_RES_NUM - 1,

Since you used "PCI_STD_RESOURCES_END" above, maybe you should use
"PCI_BRIDGE_RESOURCES_END" instead of "PCI_BRIDGE_RES_END".

Bjorn

2008-10-22 14:27:51

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Wednesday 22 October 2008 02:45:31 am Yu Zhao wrote:
> Document the new PCI[x86] boot parameters.
>
> Cc: Alex Chiang <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jesse Barnes <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
>
> ---
> Documentation/kernel-parameters.txt | 10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 53ba7c7..5482ae0 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file
> cbmemsize=nn[KMG] The fixed amount of bus space which is
> reserved for the CardBus bridge's memory
> window. The default value is 64 megabytes.
> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
> + devices under bus [dddd:]bb (dddd is the domain
> + number and bb is the bus number).
> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
> + devices under bus [dddd:]bb (dddd is the domain
> + number and bb is the bus number).
> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
> + device to minimum PAGE_SIZE alignment (dddd is
> + the domain number and bb, dd and f is the bus,
> + device and function number).
>
> pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
> Management.

I think it's nicer to have the documentation change included in the
patch that implements the change. For example, I think this and
patch 9/16 "add boot option to align ..." should be folded into
a single patch. And similarly for the other documentation patches.

Bjorn

2008-10-22 14:35:45

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 9/16 v6] PCI: add boot option to align MMIO resources

On Wednesday 22 October 2008 02:43:24 am Yu Zhao wrote:
> This patch adds boot option to align MMIO resource for a device.
> The alignment is a bigger value between the PAGE_SIZE and the
> resource size.

It looks like this forces alignment on PAGE_SIZE, not "a bigger
value between the PAGE_SIZE and the resource size." Can you
clarify the changelog to specify exactly what alignment this
option forces?

> The boot option can be used as:
> pci=align-mmio=0000:01:02.3
> '[0000:]01:02.3' is the domain, bus, device and function number
> of the device.

I think you also support using multiple "align-mmio=DDDD:BB:dd.f"
options separated by ";", but I had to read the code to figure that
out. Can you give an example of this in the changelog and the
kernel-parameters.txt patch?

Bjorn

> Cc: Alex Chiang <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jesse Barnes <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
>
> ---
> arch/x86/pci/common.c | 37 +++++++++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 20 ++++++++++++++++++--
> include/linux/pci.h | 1 +
> 3 files changed, 56 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index 06e1ce0..3c5d230 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -139,6 +139,7 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev)
>
> static char *pci_assign_pio;
> static char *pci_assign_mmio;
> +static char *pci_align_mmio;
>
> static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus)
> {
> @@ -192,6 +193,36 @@ static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus)
> }
> }
>
> +int pcibios_resource_alignment(struct pci_dev *dev, int resno)
> +{
> + int domain, busnr, slot, func;
> + char *str = pci_align_mmio;
> +
> + if (dev->resource[resno].flags & IORESOURCE_IO)
> + return 0;
> +
> + while (str && *str) {
> + if (sscanf(str, "%04x:%02x:%02x.%d",
> + &domain, &busnr, &slot, &func) != 4) {
> + if (sscanf(str, "%02x:%02x.%d",
> + &busnr, &slot, &func) != 3)
> + break;
> + domain = 0;
> + }
> +
> + if (pci_domain_nr(dev->bus) == domain &&
> + dev->bus->number == busnr &&
> + dev->devfn == PCI_DEVFN(slot, func))
> + return PAGE_SIZE;
> +
> + str = strchr(str, ';');
> + if (str)
> + str++;
> + }
> +
> + return 0;
> +}
> +
> int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno)
> {
> struct pci_bus *bus;
> @@ -200,6 +231,9 @@ int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno)
> if (pcibios_bus_resource_needs_fixup(bus))
> return 1;
>
> + if (pcibios_resource_alignment(dev, resno))
> + return 1;
> +
> return 0;
> }
>
> @@ -592,6 +626,9 @@ char * __devinit pcibios_setup(char *str)
> } else if (!strncmp(str, "assign-mmio=", 12)) {
> pci_assign_mmio = str + 12;
> return NULL;
> + } else if (!strncmp(str, "align-mmio=", 11)) {
> + pci_align_mmio = str + 11;
> + return NULL;
> }
> return str;
> }
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b02167a..11ecd6f 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1015,6 +1015,20 @@ int __attribute__ ((weak)) pcibios_set_pcie_reset_state(struct pci_dev *dev,
> }
>
> /**
> + * pcibios_resource_alignment - get resource alignment requirement
> + * @dev: the PCI device
> + * @resno: resource number
> + *
> + * Queries the resource alignment from PCI low level code. Returns positive
> + * if there is alignment requirement of the resource, or 0 otherwise.
> + */
> +int __attribute__ ((weak)) pcibios_resource_alignment(struct pci_dev *dev,
> + int resno)
> +{
> + return 0;
> +}
> +
> +/**
> * pci_set_pcie_reset_state - set reset state for device dev
> * @dev: the PCI-E device reset
> * @state: Reset state to enter into
> @@ -1913,12 +1927,14 @@ int pci_select_bars(struct pci_dev *dev, unsigned long flags)
> */
> int pci_resource_alignment(struct pci_dev *dev, int resno)
> {
> - resource_size_t align;
> + resource_size_t align, bios_align;
> struct resource *res = dev->resource + resno;
>
> + bios_align = pcibios_resource_alignment(dev, resno);
> +
> align = resource_alignment(res);
> if (align)
> - return align;
> + return align > bios_align ? align : bios_align;
>
> dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno);
> return 0;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 2ada2b6..6ac69af 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1121,6 +1121,7 @@ int pcibios_add_platform_entries(struct pci_dev *dev);
> void pcibios_disable_device(struct pci_dev *dev);
> int pcibios_set_pcie_reset_state(struct pci_dev *dev,
> enum pcie_reset_state state);
> +int pcibios_resource_alignment(struct pci_dev *dev, int resno);
>
> #ifdef CONFIG_PCI_MMCONFIG
> extern void __init pci_mmcfg_early_init(void);

2008-10-22 14:36:01

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 8/16 v6] PCI: add boot options to reassign resources

On Wednesday 22 October 2008 02:43:03 am Yu Zhao wrote:
> This patch adds boot options so user can reassign device resources
> of all devices under a bus.
>
> The boot options can be used as:
> pci=assign-mmio=0000:01,assign-pio=0000:02
> '[dddd:]bb' is the domain and bus number.

I think this example is incorrect because you look for ";" to
separate options, not ",".

Bjorn

> Cc: Alex Chiang <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jesse Barnes <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
>
> ---
> arch/x86/pci/common.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/pci/i386.c | 10 ++++---
> arch/x86/pci/pci.h | 3 ++
> 3 files changed, 82 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
> index b67732b..06e1ce0 100644
> --- a/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -137,6 +137,72 @@ static void __devinit pcibios_fixup_device_resources(struct pci_dev *dev)
> }
> }
>
> +static char *pci_assign_pio;
> +static char *pci_assign_mmio;
> +
> +static int pcibios_bus_resource_needs_fixup(struct pci_bus *bus)
> +{
> + int i;
> + int type = 0;
> + int domain, busnr;
> +
> + if (!bus->self)
> + return 0;
> +
> + for (i = 0; i < 2; i++) {
> + char *str = i ? pci_assign_pio : pci_assign_mmio;
> +
> + while (str && *str) {
> + if (sscanf(str, "%04x:%02x", &domain, &busnr) != 2) {
> + if (sscanf(str, "%02x", &busnr) != 1)
> + break;
> + domain = 0;
> + }
> +
> + if (pci_domain_nr(bus) == domain &&
> + bus->number == busnr) {
> + type |= i ? IORESOURCE_IO : IORESOURCE_MEM;
> + break;
> + }
> +
> + str = strchr(str, ';');
> + if (str)
> + str++;
> + }
> + }
> +
> + return type;
> +}
> +
> +static void __devinit pcibios_fixup_bus_resources(struct pci_bus *bus)
> +{
> + int i;
> + int type = pcibios_bus_resource_needs_fixup(bus);
> +
> + if (!type)
> + return;
> +
> + for (i = 0; i < PCI_BUS_NUM_RESOURCES; i++) {
> + struct resource *res = bus->resource[i];
> +
> + if (!res)
> + continue;
> + if (res->flags & type)
> + res->flags = 0;
> + }
> +}
> +
> +int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno)
> +{
> + struct pci_bus *bus;
> +
> + for (bus = dev->bus; bus && bus != pci_root_bus; bus = bus->parent)
> + if (pcibios_bus_resource_needs_fixup(bus))
> + return 1;
> +
> + return 0;
> +}
> +
> /*
> * Called after each bus is probed, but before its children
> * are examined.
> @@ -147,6 +213,7 @@ void __devinit pcibios_fixup_bus(struct pci_bus *b)
> struct pci_dev *dev;
>
> pci_read_bridge_bases(b);
> + pcibios_fixup_bus_resources(b);
> list_for_each_entry(dev, &b->devices, bus_list)
> pcibios_fixup_device_resources(dev);
> }
> @@ -519,6 +586,12 @@ char * __devinit pcibios_setup(char *str)
> } else if (!strcmp(str, "skip_isa_align")) {
> pci_probe |= PCI_CAN_SKIP_ISA_ALIGN;
> return NULL;
> + } else if (!strncmp(str, "assign-pio=", 11)) {
> + pci_assign_pio = str + 11;
> + return NULL;
> + } else if (!strncmp(str, "assign-mmio=", 12)) {
> + pci_assign_mmio = str + 12;
> + return NULL;
> }
> return str;
> }
> diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
> index 8729bde..ea82a5b 100644
> --- a/arch/x86/pci/i386.c
> +++ b/arch/x86/pci/i386.c
> @@ -169,10 +169,12 @@ static void __init pcibios_allocate_resources(int pass)
> (unsigned long long) r->start,
> (unsigned long long) r->end,
> r->flags, enabled, pass);
> - pr = pci_find_parent_resource(dev, r);
> - if (pr && !request_resource(pr, r))
> - continue;
> - dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx);
> + if (!pcibios_resource_needs_fixup(dev, idx)) {
> + pr = pci_find_parent_resource(dev, r);
> + if (pr && !request_resource(pr, r))
> + continue;
> + dev_err(&dev->dev, "BAR %d: can't allocate resource\n", idx);
> + }
> /* We'll assign a new address later */
> r->end -= r->start;
> r->start = 0;
> diff --git a/arch/x86/pci/pci.h b/arch/x86/pci/pci.h
> index 15b9cf6..f22737d 100644
> --- a/arch/x86/pci/pci.h
> +++ b/arch/x86/pci/pci.h
> @@ -117,6 +117,9 @@ extern int __init pcibios_init(void);
> extern int __init pci_mmcfg_arch_init(void);
> extern void __init pci_mmcfg_arch_free(void);
>
> +/* pci-common.c */
> +extern int pcibios_resource_needs_fixup(struct pci_dev *dev, int resno);
> +
> /*
> * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
> * on their northbrige except through the * %eax register. As such, you MUST

2008-10-22 14:51:27

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 8/16 v6] PCI: add boot options to reassign resources

Bjorn Helgaas wrote:
> On Wednesday 22 October 2008 02:43:03 am Yu Zhao wrote:
>> This patch adds boot options so user can reassign device resources
>> of all devices under a bus.
>>
>> The boot options can be used as:
>> pci=assign-mmio=0000:01,assign-pio=0000:02
>> '[dddd:]bb' is the domain and bus number.
>
> I think this example is incorrect because you look for ";" to
> separate options, not ",".

The semicolon is used to separate multiple parameters for assign-mmio
and assign-pio. E.g., 'pci=assign-mmio=0000:01;0001:02;0004:03'. And the
comma separates different parameters for 'pci='.

2008-10-22 14:51:41

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum'

On Wednesday 22 October 2008 08:44:24 am Yu Zhao wrote:
> Bjorn Helgaas wrote:
> > On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote:
> >> This patch moves all definitions of the PCI resource names to an 'enum',
> >> and also replaces some hard-coded resource variables with symbol
> >> names. This change eases introduction of device specific resources.
> >
> > Thanks for removing a bunch of magic numbers from the code.
> >
> >> static void
> >> pci_restore_bars(struct pci_dev *dev)
> >> {
> >> - int i, numres;
> >> -
> >> - switch (dev->hdr_type) {
> >> - case PCI_HEADER_TYPE_NORMAL:
> >> - numres = 6;
> >> - break;
> >> - case PCI_HEADER_TYPE_BRIDGE:
> >> - numres = 2;
> >> - break;
> >> - case PCI_HEADER_TYPE_CARDBUS:
> >> - numres = 1;
> >> - break;
> >> - default:
> >> - /* Should never get here, but just in case... */
> >> - return;
> >> - }
> >> + int i;
> >>
> >> - for (i = 0; i < numres; i++)
> >> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
> >> pci_update_resource(dev, i);
> >> }
> >
> > The behavior of this function used to depend on dev->hdr_type. Now
> > we don't look at hdr_type at all, so we do the same thing for all
> > devices.
> >
> > For example, for a CardBus device, we used to call pci_update_resource()
> > only for BAR 0; now we call it for BARs 0-6.
> >
> > Maybe this is safe, but I can't tell from the patch, so I think you
> > should explain *why* it's safe in the changelog.
>
> It's safe because pci_update_resource() will ignore unused resources.
> E.g., for a Cardbus, only BAR 0 is used and its 'flags' is set, then
> pci_update_resource() only updates it. BAR 1-6 are ignored since their
> 'flags' are 0.
>
> I'll put more explanation in the changelog.

This is a logically separate change from merely substituting enum
names for magic numbers, so you might even consider splitting it
into a separate patch. Better bisection and all that, you know :-)

Bjorn

2008-10-22 14:52:23

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum'

Bjorn Helgaas wrote:
> On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote:
>> This patch moves all definitions of the PCI resource names to an 'enum',
>> and also replaces some hard-coded resource variables with symbol
>> names. This change eases introduction of device specific resources.
>
> Thanks for removing a bunch of magic numbers from the code.
>
>> static void
>> pci_restore_bars(struct pci_dev *dev)
>> {
>> - int i, numres;
>> -
>> - switch (dev->hdr_type) {
>> - case PCI_HEADER_TYPE_NORMAL:
>> - numres = 6;
>> - break;
>> - case PCI_HEADER_TYPE_BRIDGE:
>> - numres = 2;
>> - break;
>> - case PCI_HEADER_TYPE_CARDBUS:
>> - numres = 1;
>> - break;
>> - default:
>> - /* Should never get here, but just in case... */
>> - return;
>> - }
>> + int i;
>>
>> - for (i = 0; i < numres; i++)
>> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
>> pci_update_resource(dev, i);
>> }
>
> The behavior of this function used to depend on dev->hdr_type. Now
> we don't look at hdr_type at all, so we do the same thing for all
> devices.
>
> For example, for a CardBus device, we used to call pci_update_resource()
> only for BAR 0; now we call it for BARs 0-6.
>
> Maybe this is safe, but I can't tell from the patch, so I think you
> should explain *why* it's safe in the changelog.

It's safe because pci_update_resource() will ignore unused resources.
E.g., for a Cardbus, only BAR 0 is used and its 'flags' is set, then
pci_update_resource() only updates it. BAR 1-6 are ignored since their
'flags' are 0.

I'll put more explanation in the changelog.

Thanks,
Yu

2008-10-22 14:53:57

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 9/16 v6] PCI: add boot option to align MMIO resources

Bjorn Helgaas wrote:
> On Wednesday 22 October 2008 02:43:24 am Yu Zhao wrote:
>> This patch adds boot option to align MMIO resource for a device.
>> The alignment is a bigger value between the PAGE_SIZE and the
>> resource size.
>
> It looks like this forces alignment on PAGE_SIZE, not "a bigger
> value between the PAGE_SIZE and the resource size." Can you
> clarify the changelog to specify exactly what alignment this
> option forces?

I guess following would explain your question.

>> int pci_resource_alignment(struct pci_dev *dev, int resno)
>> {
>> - resource_size_t align;
>> + resource_size_t align, bios_align;
>> struct resource *res = dev->resource + resno;
>>
>> + bios_align = pcibios_resource_alignment(dev, resno);
>> +
>> align = resource_alignment(res);
>> if (align)
>> - return align;
>> + return align > bios_align ? align : bios_align;
>>
>> dev_err(&dev->dev, "alignment: invalid resource #%d\n", resno);
>> return 0;

2008-10-22 15:01:58

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum'

Bjorn Helgaas wrote:
> On Wednesday 22 October 2008 08:44:24 am Yu Zhao wrote:
>> Bjorn Helgaas wrote:
>>> On Wednesday 22 October 2008 02:40:41 am Yu Zhao wrote:
>>>> This patch moves all definitions of the PCI resource names to an 'enum',
>>>> and also replaces some hard-coded resource variables with symbol
>>>> names. This change eases introduction of device specific resources.
>>> Thanks for removing a bunch of magic numbers from the code.
>>>
>>>> static void
>>>> pci_restore_bars(struct pci_dev *dev)
>>>> {
>>>> - int i, numres;
>>>> -
>>>> - switch (dev->hdr_type) {
>>>> - case PCI_HEADER_TYPE_NORMAL:
>>>> - numres = 6;
>>>> - break;
>>>> - case PCI_HEADER_TYPE_BRIDGE:
>>>> - numres = 2;
>>>> - break;
>>>> - case PCI_HEADER_TYPE_CARDBUS:
>>>> - numres = 1;
>>>> - break;
>>>> - default:
>>>> - /* Should never get here, but just in case... */
>>>> - return;
>>>> - }
>>>> + int i;
>>>>
>>>> - for (i = 0; i < numres; i++)
>>>> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
>>>> pci_update_resource(dev, i);
>>>> }
>>> The behavior of this function used to depend on dev->hdr_type. Now
>>> we don't look at hdr_type at all, so we do the same thing for all
>>> devices.
>>>
>>> For example, for a CardBus device, we used to call pci_update_resource()
>>> only for BAR 0; now we call it for BARs 0-6.
>>>
>>> Maybe this is safe, but I can't tell from the patch, so I think you
>>> should explain *why* it's safe in the changelog.
>> It's safe because pci_update_resource() will ignore unused resources.
>> E.g., for a Cardbus, only BAR 0 is used and its 'flags' is set, then
>> pci_update_resource() only updates it. BAR 1-6 are ignored since their
>> 'flags' are 0.
>>
>> I'll put more explanation in the changelog.
>
> This is a logically separate change from merely substituting enum
> names for magic numbers, so you might even consider splitting it
> into a separate patch. Better bisection and all that, you know :-)

Will do.

Thanks,
Yu

2008-10-22 17:04:37

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Bjorn Helgaas wrote:
> On Wednesday 22 October 2008 02:45:31 am Yu Zhao wrote:
>> Document the new PCI[x86] boot parameters.
>>
>> Cc: Alex Chiang <[email protected]>
>> Cc: Grant Grundler <[email protected]>
>> Cc: Greg KH <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Jesse Barnes <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Randy Dunlap <[email protected]>
>> Cc: Roland Dreier <[email protected]>
>> Signed-off-by: Yu Zhao <[email protected]>
>>
>> ---
>> Documentation/kernel-parameters.txt | 10 ++++++++++
>> 1 files changed, 10 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index 53ba7c7..5482ae0 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file
>> cbmemsize=nn[KMG] The fixed amount of bus space which is
>> reserved for the CardBus bridge's memory
>> window. The default value is 64 megabytes.
>> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
>> + devices under bus [dddd:]bb (dddd is the domain
>> + number and bb is the bus number).
>> + assign-pio=[dddd:]bb [X86] reassign io port resources of all

"io" in text should be "IO" or "I/O". (Small "io" is OK as a parameter placeholder.)

>> + devices under bus [dddd:]bb (dddd is the domain
>> + number and bb is the bus number).
>> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
>> + device to minimum PAGE_SIZE alignment (dddd is
>> + the domain number and bb, dd and f is the bus,

are the bus,

>> + device and function number).
>>
>> pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
>> Management.
>
> I think it's nicer to have the documentation change included in the
> patch that implements the change. For example, I think this and
> patch 9/16 "add boot option to align ..." should be folded into
> a single patch. And similarly for the other documentation patches.
>
> Bjorn

2008-10-23 07:10:45

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources()

On Wed, Oct 22, 2008 at 1:42 AM, Yu Zhao <[email protected]> wrote:
> This cleanup makes pcibios_allocate_resources() easier to read.
>
> Cc: Alex Chiang <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jesse Barnes <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
>
> ---
> arch/x86/pci/i386.c | 28 ++++++++++++++--------------
> 1 files changed, 14 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
> index 844df0c..8729bde 100644
> --- a/arch/x86/pci/i386.c
> +++ b/arch/x86/pci/i386.c
> @@ -147,7 +147,7 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
> static void __init pcibios_allocate_resources(int pass)
> {
> struct pci_dev *dev = NULL;
> - int idx, disabled;
> + int idx, enabled;
> u16 command;
> struct resource *r, *pr;
>
> @@ -160,22 +160,22 @@ static void __init pcibios_allocate_resources(int pass)
> if (!r->start) /* Address not assigned at all */
> continue;
> if (r->flags & IORESOURCE_IO)
> - disabled = !(command & PCI_COMMAND_IO);
> + enabled = command & PCI_COMMAND_IO;
> else
> - disabled = !(command & PCI_COMMAND_MEMORY);
> - if (pass == disabled) {
> - dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n",
> + enabled = command & PCI_COMMAND_MEMORY;
> + if (pass == enabled)
> + continue;

it seems you change the flow here for MMIO
because PCI_COMMAND_MEMORY is 2.

YH

2008-10-23 07:46:17

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 7/16 v6] PCI: cleanup pcibios_allocate_resources()

On Thu, Oct 23, 2008 at 03:10:26PM +0800, Yinghai Lu wrote:
> On Wed, Oct 22, 2008 at 1:42 AM, Yu Zhao <[email protected]> wrote:
> > diff --git a/arch/x86/pci/i386.c b/arch/x86/pci/i386.c
> > index 844df0c..8729bde 100644
> > --- a/arch/x86/pci/i386.c
> > +++ b/arch/x86/pci/i386.c
> > @@ -147,7 +147,7 @@ static void __init pcibios_allocate_bus_resources(struct list_head *bus_list)
> > static void __init pcibios_allocate_resources(int pass)
> > {
> > struct pci_dev *dev = NULL;
> > - int idx, disabled;
> > + int idx, enabled;
> > u16 command;
> > struct resource *r, *pr;
> >
> > @@ -160,22 +160,22 @@ static void __init pcibios_allocate_resources(int pass)
> > if (!r->start) /* Address not assigned at all */
> > continue;
> > if (r->flags & IORESOURCE_IO)
> > - disabled = !(command & PCI_COMMAND_IO);
> > + enabled = command & PCI_COMMAND_IO;
> > else
> > - disabled = !(command & PCI_COMMAND_MEMORY);
> > - if (pass == disabled) {
> > - dev_dbg(&dev->dev, "resource %#08llx-%#08llx (f=%lx, d=%d, p=%d)\n",
> > + enabled = command & PCI_COMMAND_MEMORY;
> > + if (pass == enabled)
> > + continue;
>
> it seems you change the flow here for MMIO
> because PCI_COMMAND_MEMORY is 2.
>
> YH

Nice finding! Will change it back to 'disable' next version.

Thanks,
Yu

2008-11-06 04:34:44

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
> Document the new PCI[x86] boot parameters.
>
> Cc: Alex Chiang <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jesse Barnes <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
>
> ---
> Documentation/kernel-parameters.txt | 10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index 53ba7c7..5482ae0 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file
> cbmemsize=nn[KMG] The fixed amount of bus space which is
> reserved for the CardBus bridge's memory
> window. The default value is 64 megabytes.
> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
> + devices under bus [dddd:]bb (dddd is the domain
> + number and bb is the bus number).
> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
> + devices under bus [dddd:]bb (dddd is the domain
> + number and bb is the bus number).
> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
> + device to minimum PAGE_SIZE alignment (dddd is
> + the domain number and bb, dd and f is the bus,
> + device and function number).

This seems like a big problem. How are we going to know to add these
command line options for devices we haven't even seen/known about yet?

How do we know the bus ids aren't going to change between boots (hint,
they are, pci bus ids change all the time...)

We need to be able to do this kind of thing dynamically, not fixed at
boot time, which seems way to early to even know about this, right?

thanks,

greg k-h

2008-11-06 04:35:27

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Wed, Oct 22, 2008 at 04:45:15PM +0800, Yu Zhao wrote:
> Document the SR-IOV sysfs entries.
>
> Cc: Alex Chiang <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jesse Barnes <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
>
> ---
> Documentation/ABI/testing/sysfs-bus-pci | 33 +++++++++++++++++++++++++++++++
> 1 files changed, 33 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index ceddcff..41cce8f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -9,3 +9,36 @@ Description:
> that some devices may have malformatted data. If the
> underlying VPD has a writable section then the
> corresponding section of this file will be writable.
> +
> +What: /sys/bus/pci/devices/.../iov/enable

Are you sure this is still the correct location with your change to
struct device?

thanks,

greg k-h

2008-11-06 04:48:22

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Wed, Nov 05, 2008 at 08:33:18PM -0800, Greg KH wrote:
> On Wed, Oct 22, 2008 at 04:45:15PM +0800, Yu Zhao wrote:
> > Document the SR-IOV sysfs entries.
> >
> > Cc: Alex Chiang <[email protected]>
> > Cc: Grant Grundler <[email protected]>
> > Cc: Greg KH <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Jesse Barnes <[email protected]>
> > Cc: Matthew Wilcox <[email protected]>
> > Cc: Randy Dunlap <[email protected]>
> > Cc: Roland Dreier <[email protected]>
> > Signed-off-by: Yu Zhao <[email protected]>
> >
> > ---
> > Documentation/ABI/testing/sysfs-bus-pci | 33 +++++++++++++++++++++++++++++++
> > 1 files changed, 33 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> > index ceddcff..41cce8f 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > @@ -9,3 +9,36 @@ Description:
> > that some devices may have malformatted data. If the
> > underlying VPD has a writable section then the
> > corresponding section of this file will be writable.
> > +
> > +What: /sys/bus/pci/devices/.../iov/enable
>
> Are you sure this is still the correct location with your change to
> struct device?

Nevermind, this is correct.

But the bigger problem is that userspace doesn't know when these
attributes show up. So tools like udev and HAL and others can't look
for them as they never get notified, and they don't even know if they
should be looking for them or not.

Is there any way to tie these attributes to the "main" pci device so
that they get created before the device is announced to the world?
Doing that would solve this issue.

thanks,

greg k-h

2008-11-06 04:50:33

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Wed, Oct 22, 2008 at 04:38:09PM +0800, Yu Zhao wrote:
> Greetings,
>
> Following patches are intended to support SR-IOV capability in the
> Linux kernel. With these patches, people can turn a PCI device with
> the capability into multiple ones from software perspective, which
> will benefit KVM and achieve other purposes such as QoS, security,
> and etc.

Is there any actual users of this API around yet? How was it tested as
there is no hardware to test on? Which drivers are going to have to be
rewritten to take advantage of this new interface?

thanks,

greg k-h

2008-11-06 15:40:29

by H L

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support


Greetings (from a new lurker to the list),

To your question Greg, "yes" and "sort of" ;-). I have started taking a look at these patches with a strong interest in understanding how they work. I've built a kernel with them and tried out a few things with real SR-IOV hardware.

--
Lance Hartmann




--- On Wed, 11/5/08, Greg KH <[email protected]> wrote:
>
> Is there any actual users of this API around yet? How was
> it tested as
> there is no hardware to test on? Which drivers are going
> to have to be
> rewritten to take advantage of this new interface?
>
> thanks,
>
> greg k-h
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/virtualization



2008-11-06 15:47:40

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 07:40:12AM -0800, H L wrote:
>
> Greetings (from a new lurker to the list),

Welcome!

> To your question Greg, "yes" and "sort of" ;-). I have started taking
> a look at these patches with a strong interest in understanding how
> they work. I've built a kernel with them and tried out a few things
> with real SR-IOV hardware.

Did you have to modify individual drivers to take advantage of this
code? It looks like the core code will run on this type of hardware,
but there seems to be no real advantage until a driver is modified to
use it, right?

Or am I missing some great advantage to having this code without
modified drivers?

thanks,

greg k-h

2008-11-06 16:48:56

by H L

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

I have not modified any existing drivers, but instead I threw together a bare-bones module enabling me to make a call to pci_iov_register() and then poke at an SR-IOV adapter's /sys entries for which no driver was loaded.

It appears from my perusal thus far that drivers using these new SR-IOV patches will require modification; i.e. the driver associated with the Physical Function (PF) will be required to make the pci_iov_register() call along with the requisite notify() function. Essentially this suggests to me a model for the PF driver to perform any "global actions" or setup on behalf of VFs before enabling them after which VF drivers could be associated.

I have so far only seen Yu Zhao's "7-patch" set. I've not yet looked at his subsequently tendered "15-patch" set so I don't know what has changed. The hardware/firmware implementation for any given SR-IOV compatible device, will determine the extent of differences required between a PF driver and a VF driver.

--
Lance Hartmann


--- On Thu, 11/6/08, Greg KH <[email protected]> wrote:


> Date: Thursday, November 6, 2008, 9:43 AM
> On Thu, Nov 06, 2008 at 07:40:12AM -0800, H L wrote:
> >
> > Greetings (from a new lurker to the list),
>
> Welcome!
>
> > To your question Greg, "yes" and "sort
> of" ;-). I have started taking
> > a look at these patches with a strong interest in
> understanding how
> > they work. I've built a kernel with them and
> tried out a few things
> > with real SR-IOV hardware.
>
> Did you have to modify individual drivers to take advantage
> of this
> code? It looks like the core code will run on this type of
> hardware,
> but there seems to be no real advantage until a driver is
> modified to
> use it, right?
>
> Or am I missing some great advantage to having this code
> without
> modified drivers?
>
> thanks,
>
> greg k-h



2008-11-06 16:51:20

by H L

[permalink] [raw]
Subject: git repository for SR-IOV development?


Has anyone initiated or given consideration to the creation of a git repository (say, on kernel.org) for SR-IOV development?

--
Lance Hartmann




2008-11-06 16:59:06

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support


A: No.
Q: Should I include quotations after my reply?

On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> I have not modified any existing drivers, but instead I threw together
> a bare-bones module enabling me to make a call to pci_iov_register()
> and then poke at an SR-IOV adapter's /sys entries for which no driver
> was loaded.
>
> It appears from my perusal thus far that drivers using these new
> SR-IOV patches will require modification; i.e. the driver associated
> with the Physical Function (PF) will be required to make the
> pci_iov_register() call along with the requisite notify() function.
> Essentially this suggests to me a model for the PF driver to perform
> any "global actions" or setup on behalf of VFs before enabling them
> after which VF drivers could be associated.

Where would the VF drivers have to be associated? On the "pci_dev"
level or on a higher one?

Will all drivers that want to bind to a "VF" device need to be
rewritten?

> I have so far only seen Yu Zhao's "7-patch" set. I've not yet looked
> at his subsequently tendered "15-patch" set so I don't know what has
> changed. The hardware/firmware implementation for any given SR-IOV
> compatible device, will determine the extent of differences required
> between a PF driver and a VF driver.

Yeah, that's what I'm worried/curious about. Without seeing the code
for such a driver, how can we properly evaluate if this infrastructure
is the correct one and the proper way to do all of this?

thanks,

greg k-h

2008-11-06 16:59:43

by Greg KH

[permalink] [raw]
Subject: Re: git repository for SR-IOV development?

On Thu, Nov 06, 2008 at 08:51:09AM -0800, H L wrote:
>
> Has anyone initiated or given consideration to the creation of a git
> repository (say, on kernel.org) for SR-IOV development?

Why? It's only a few patches, right? Why would it need a whole new git
tree?

thanks,

greg k-h

2008-11-06 17:39:25

by Fischer, Anna

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

> On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > I have not modified any existing drivers, but instead I threw
> together
> > a bare-bones module enabling me to make a call to pci_iov_register()
> > and then poke at an SR-IOV adapter's /sys entries for which no driver
> > was loaded.
> >
> > It appears from my perusal thus far that drivers using these new
> > SR-IOV patches will require modification; i.e. the driver associated
> > with the Physical Function (PF) will be required to make the
> > pci_iov_register() call along with the requisite notify() function.
> > Essentially this suggests to me a model for the PF driver to perform
> > any "global actions" or setup on behalf of VFs before enabling them
> > after which VF drivers could be associated.
>
> Where would the VF drivers have to be associated? On the "pci_dev"
> level or on a higher one?

A VF appears to the Linux OS as a standard (full, additional) PCI device. The driver is associated in the same way as for a normal PCI device. Ideally, you would use SR-IOV devices on a virtualized system, for example, using Xen. A VF can then be assigned to a guest domain as a full PCI device.

> Will all drivers that want to bind to a "VF" device need to be
> rewritten?

Currently, any vendor providing a SR-IOV device needs to provide a PF driver and a VF driver that runs on their hardware. A VF driver does not necessarily need to know much about SR-IOV but just run on the presented PCI device. You might want to have a communication channel between PF and VF driver though, for various reasons, if such a channel is not provided in hardware.

> > I have so far only seen Yu Zhao's "7-patch" set. I've not yet looked
> > at his subsequently tendered "15-patch" set so I don't know what has
> > changed. The hardware/firmware implementation for any given SR-IOV
> > compatible device, will determine the extent of differences required
> > between a PF driver and a VF driver.
>
> Yeah, that's what I'm worried/curious about. Without seeing the code
> for such a driver, how can we properly evaluate if this infrastructure
> is the correct one and the proper way to do all of this?

Yu's API allows a PF driver to register with the Linux PCI code and use it to activate VFs and allocate their resources. The PF driver needs to be modified to work with that API. While you can argue about how that API is supposed to look like, it is clear that such an API is required in some form. The PF driver needs to know when VFs are active as it might want to allocate further (device-specific) resources to VFs or initiate further (device-specific) configurations. While probably a lot of SR-IOV specific code has to be in the PF driver, there is also support required from the Linux PCI subsystem, which is to some extend provided by Yu's patches.

Anna

2008-11-06 17:48:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
> On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > I have not modified any existing drivers, but instead I threw together
> > a bare-bones module enabling me to make a call to pci_iov_register()
> > and then poke at an SR-IOV adapter's /sys entries for which no driver
> > was loaded.
> >
> > It appears from my perusal thus far that drivers using these new
> > SR-IOV patches will require modification; i.e. the driver associated
> > with the Physical Function (PF) will be required to make the
> > pci_iov_register() call along with the requisite notify() function.
> > Essentially this suggests to me a model for the PF driver to perform
> > any "global actions" or setup on behalf of VFs before enabling them
> > after which VF drivers could be associated.
>
> Where would the VF drivers have to be associated? On the "pci_dev"
> level or on a higher one?
>
> Will all drivers that want to bind to a "VF" device need to be
> rewritten?

The current model being implemented by my colleagues has separate
drivers for the PF (aka native) and VF devices. I don't personally
believe this is the correct path, but I'm reserving judgement until I
see some code.

I don't think we really know what the One True Usage model is for VF
devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
some ideas. I bet there's other people who have other ideas too.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-06 17:53:44

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
> On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
> > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > I have not modified any existing drivers, but instead I threw together
> > > a bare-bones module enabling me to make a call to pci_iov_register()
> > > and then poke at an SR-IOV adapter's /sys entries for which no driver
> > > was loaded.
> > >
> > > It appears from my perusal thus far that drivers using these new
> > > SR-IOV patches will require modification; i.e. the driver associated
> > > with the Physical Function (PF) will be required to make the
> > > pci_iov_register() call along with the requisite notify() function.
> > > Essentially this suggests to me a model for the PF driver to perform
> > > any "global actions" or setup on behalf of VFs before enabling them
> > > after which VF drivers could be associated.
> >
> > Where would the VF drivers have to be associated? On the "pci_dev"
> > level or on a higher one?
> >
> > Will all drivers that want to bind to a "VF" device need to be
> > rewritten?
>
> The current model being implemented by my colleagues has separate
> drivers for the PF (aka native) and VF devices. I don't personally
> believe this is the correct path, but I'm reserving judgement until I
> see some code.

Hm, I would like to see that code before we can properly evaluate this
interface. Especially as they are all tightly tied together.

> I don't think we really know what the One True Usage model is for VF
> devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
> some ideas. I bet there's other people who have other ideas too.

I'd love to hear those ideas.

Rumor has it, there is some Xen code floating around to support this
already, is that true?

thanks,

greg k-h

2008-11-06 18:04:20

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 05:38:16PM +0000, Fischer, Anna wrote:
> > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > I have not modified any existing drivers, but instead I threw
> > together
> > > a bare-bones module enabling me to make a call to pci_iov_register()
> > > and then poke at an SR-IOV adapter's /sys entries for which no driver
> > > was loaded.
> > >
> > > It appears from my perusal thus far that drivers using these new
> > > SR-IOV patches will require modification; i.e. the driver associated
> > > with the Physical Function (PF) will be required to make the
> > > pci_iov_register() call along with the requisite notify() function.
> > > Essentially this suggests to me a model for the PF driver to perform
> > > any "global actions" or setup on behalf of VFs before enabling them
> > > after which VF drivers could be associated.
> >
> > Where would the VF drivers have to be associated? On the "pci_dev"
> > level or on a higher one?
>
> A VF appears to the Linux OS as a standard (full, additional) PCI
> device. The driver is associated in the same way as for a normal PCI
> device. Ideally, you would use SR-IOV devices on a virtualized system,
> for example, using Xen. A VF can then be assigned to a guest domain as
> a full PCI device.

It's that "second" part that I'm worried about. How is that going to
happen? Do you have any patches that show this kind of "assignment"?

> > Will all drivers that want to bind to a "VF" device need to be
> > rewritten?
>
> Currently, any vendor providing a SR-IOV device needs to provide a PF
> driver and a VF driver that runs on their hardware.

Are there any such drivers available yet?

> A VF driver does not necessarily need to know much about SR-IOV but
> just run on the presented PCI device. You might want to have a
> communication channel between PF and VF driver though, for various
> reasons, if such a channel is not provided in hardware.

Agreed, but what does that channel look like in Linux?

I have some ideas of what I think it should look like, but if people
already have code, I'd love to see that as well.

> > > I have so far only seen Yu Zhao's "7-patch" set. I've not yet looked
> > > at his subsequently tendered "15-patch" set so I don't know what has
> > > changed. The hardware/firmware implementation for any given SR-IOV
> > > compatible device, will determine the extent of differences required
> > > between a PF driver and a VF driver.
> >
> > Yeah, that's what I'm worried/curious about. Without seeing the code
> > for such a driver, how can we properly evaluate if this infrastructure
> > is the correct one and the proper way to do all of this?
>
> Yu's API allows a PF driver to register with the Linux PCI code and
> use it to activate VFs and allocate their resources. The PF driver
> needs to be modified to work with that API. While you can argue about
> how that API is supposed to look like, it is clear that such an API is
> required in some form.

I totally agree, I'm arguing about what that API looks like :)

I want to see some code...

thanks,

greg k-h

2008-11-06 18:05:51

by H L

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support


--- On Thu, 11/6/08, Greg KH <[email protected]> wrote:

> On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > I have not modified any existing drivers, but instead
> I threw together
> > a bare-bones module enabling me to make a call to
> pci_iov_register()
> > and then poke at an SR-IOV adapter's /sys entries
> for which no driver
> > was loaded.
> >
> > It appears from my perusal thus far that drivers using
> these new
> > SR-IOV patches will require modification; i.e. the
> driver associated
> > with the Physical Function (PF) will be required to
> make the
> > pci_iov_register() call along with the requisite
> notify() function.
> > Essentially this suggests to me a model for the PF
> driver to perform
> > any "global actions" or setup on behalf of
> VFs before enabling them
> > after which VF drivers could be associated.
>
> Where would the VF drivers have to be associated? On the
> "pci_dev"
> level or on a higher one?


I have not yet fully grocked Yu Zhao's model to answer this. That said, I would *hope* to find it on the "pci_dev" level.


> Will all drivers that want to bind to a "VF"
> device need to be
> rewritten?

Not necessarily, or perhaps minimally; depends on hardware/firmware and actions the driver wants to take. An example here might assist. Let's just say someone has created, oh, I don't know, maybe an SR-IOV NIC. Now, for 'general' I/O operations to pass network traffic back and forth there would ideally be no difference in the actions and therefore behavior of a PF driver and a VF driver. But, what do you do in the instance a VF wants to change link-speed? As that physical characteristic affects all VFs, how do you handle that? This is where the hardware/firmware implementation part comes to play. If a VF driver performs some actions to initiate the change in link speed, the logic in the adapter could be anything like:

1. Acknowledge the request as if it were really done, but effectively ignore it. The Independent Hardware Vendor (IHV) might dictate that if you want to change any "global" characteristics of an adapter, you may only do so via the PF driver. Granted, this, depending on the device class, may just not be acceptable.

2. Acknowledge the request and then trigger an interrupt to the PF driver to have it assist. The PF driver might then just set the new link-speed, or it could result in a PF driver communicating by some mechanism to all of the VF driver instances that this change of link-speed was requested.

3. Acknowledge the request and perform inner PF and VF communication of this event within the logic of the card (e.g. to "vote" on whether or not to perform this action) with interrupts and associated status delivered to all PF and VF drivers.

The list goes on.

>
> > I have so far only seen Yu Zhao's
> "7-patch" set. I've not yet looked
> > at his subsequently tendered "15-patch" set
> so I don't know what has
> > changed. The hardware/firmware implementation for
> any given SR-IOV
> > compatible device, will determine the extent of
> differences required
> > between a PF driver and a VF driver.
>
> Yeah, that's what I'm worried/curious about.
> Without seeing the code
> for such a driver, how can we properly evaluate if this
> infrastructure
> is the correct one and the proper way to do all of this?


As the example above demonstrates, that's a tough question to answer. Ideally, in my view, there would only be one driver written per SR-IOV device and it would contain the logic to "do the right things" based on whether its running as a PF or VF with that determination easily accomplished by testing the existence of the SR-IOV extended capability. Then, in an effort to minimize (if not eliminate) the complexities of driver-to-driver actions for fielding "global events", contain as much of the logic as is possible within the adapter. Minimizing the efforts required for the device driver writers in my opinion paves the way to greater adoption of this technology.




2008-11-06 18:37:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support


[Anna, can you fix your word-wrapping please? Your lines appear to be
infinitely long which is most unpleasant to reply to]

On Thu, Nov 06, 2008 at 05:38:16PM +0000, Fischer, Anna wrote:
> > Where would the VF drivers have to be associated? On the "pci_dev"
> > level or on a higher one?
>
> A VF appears to the Linux OS as a standard (full, additional) PCI
> device. The driver is associated in the same way as for a normal PCI
> device. Ideally, you would use SR-IOV devices on a virtualized system,
> for example, using Xen. A VF can then be assigned to a guest domain as
> a full PCI device.

It's not clear thats the right solution. If the VF devices are _only_
going to be used by the guest, then arguably, we don't want to create
pci_devs for them in the host. (I think it _is_ the right answer, but I
want to make it clear there's multiple opinions on this).

> > Will all drivers that want to bind to a "VF" device need to be
> > rewritten?
>
> Currently, any vendor providing a SR-IOV device needs to provide a PF
> driver and a VF driver that runs on their hardware. A VF driver does not
> necessarily need to know much about SR-IOV but just run on the presented
> PCI device. You might want to have a communication channel between PF
> and VF driver though, for various reasons, if such a channel is not
> provided in hardware.

That is one model. Another model is to provide one driver that can
handle both PF and VF devices. A third model is to provide, say, a
Windows VF driver and a Xen PF driver and only support Windows-on-Xen.
(This last would probably be an exercise in foot-shooting, but
nevertheless, I've heard it mooted).

> > Yeah, that's what I'm worried/curious about. Without seeing the code
> > for such a driver, how can we properly evaluate if this infrastructure
> > is the correct one and the proper way to do all of this?
>
> Yu's API allows a PF driver to register with the Linux PCI code and use
> it to activate VFs and allocate their resources. The PF driver needs to
> be modified to work with that API. While you can argue about how that API
> is supposed to look like, it is clear that such an API is required in some
> form. The PF driver needs to know when VFs are active as it might want to
> allocate further (device-specific) resources to VFs or initiate further
> (device-specific) configurations. While probably a lot of SR-IOV specific
> code has to be in the PF driver, there is also support required from
> the Linux PCI subsystem, which is to some extend provided by Yu's patches.

Everyone agrees that some support is necessary. The question is exactly
what it looks like. I must confess to not having reviewed this latest
patch series yet -- I'm a little burned out on patch review.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-06 18:37:37

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 10:05:39AM -0800, H L wrote:
>
> --- On Thu, 11/6/08, Greg KH <[email protected]> wrote:
>
> > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > I have not modified any existing drivers, but instead
> > I threw together
> > > a bare-bones module enabling me to make a call to
> > pci_iov_register()
> > > and then poke at an SR-IOV adapter's /sys entries
> > for which no driver
> > > was loaded.
> > >
> > > It appears from my perusal thus far that drivers using
> > these new
> > > SR-IOV patches will require modification; i.e. the
> > driver associated
> > > with the Physical Function (PF) will be required to
> > make the
> > > pci_iov_register() call along with the requisite
> > notify() function.
> > > Essentially this suggests to me a model for the PF
> > driver to perform
> > > any "global actions" or setup on behalf of
> > VFs before enabling them
> > > after which VF drivers could be associated.
> >
> > Where would the VF drivers have to be associated? On the
> > "pci_dev"
> > level or on a higher one?
>
>
> I have not yet fully grocked Yu Zhao's model to answer this. That
> said, I would *hope* to find it on the "pci_dev" level.

Me too.

> > Will all drivers that want to bind to a "VF"
> > device need to be
> > rewritten?
>
> Not necessarily, or perhaps minimally; depends on hardware/firmware
> and actions the driver wants to take. An example here might assist.
> Let's just say someone has created, oh, I don't know, maybe an SR-IOV
> NIC. Now, for 'general' I/O operations to pass network traffic back
> and forth there would ideally be no difference in the actions and
> therefore behavior of a PF driver and a VF driver. But, what do you
> do in the instance a VF wants to change link-speed? As that physical
> characteristic affects all VFs, how do you handle that? This is where
> the hardware/firmware implementation part comes to play. If a VF
> driver performs some actions to initiate the change in link speed, the
> logic in the adapter could be anything like:

<snip>

Yes, I agree that all of this needs to be done, somehow.

It's that "somehow" that I am interested in trying to see how it works
out.

> >
> > > I have so far only seen Yu Zhao's
> > "7-patch" set. I've not yet looked
> > > at his subsequently tendered "15-patch" set
> > so I don't know what has
> > > changed. The hardware/firmware implementation for
> > any given SR-IOV
> > > compatible device, will determine the extent of
> > differences required
> > > between a PF driver and a VF driver.
> >
> > Yeah, that's what I'm worried/curious about.
> > Without seeing the code
> > for such a driver, how can we properly evaluate if this
> > infrastructure
> > is the correct one and the proper way to do all of this?
>
>
> As the example above demonstrates, that's a tough question to answer.
> Ideally, in my view, there would only be one driver written per SR-IOV
> device and it would contain the logic to "do the right things" based
> on whether its running as a PF or VF with that determination easily
> accomplished by testing the existence of the SR-IOV extended
> capability. Then, in an effort to minimize (if not eliminate) the
> complexities of driver-to-driver actions for fielding "global events",
> contain as much of the logic as is possible within the adapter.
> Minimizing the efforts required for the device driver writers in my
> opinion paves the way to greater adoption of this technology.

Yes, making things easier is the key here.

Perhaps some of this could be hidden with a new bus type for these kinds
of devices? Or a "virtual" bus of pci devices that the original SR-IOV
device creates that corrispond to the individual virtual PCI devices?
If that were the case, then it might be a lot easier in the end.

thanks,

greg k-h

2008-11-06 19:58:38

by H L

[permalink] [raw]
Subject: Re: git repository for SR-IOV development?

--- On Thu, 11/6/08, Greg KH <[email protected]> wrote:

> On Thu, Nov 06, 2008 at 08:51:09AM -0800, H L wrote:
> >
> > Has anyone initiated or given consideration to the
> creation of a git
> > repository (say, on kernel.org) for SR-IOV
> development?
>
> Why? It's only a few patches, right? Why would it
> need a whole new git
> tree?


So as to minimize the time and effort patching a kernel, especially if the tree (and/or hash level) against which the patches were created fails to be specified on a mailing-list. Plus, there appears to be questions raised on how, precisely, the implementation should ultimately be modeled and especially given that, who knows at this point what number of patches will ultimately be submitted? I know I've built the "7-patch" one (painfully, by the way), and I'm aware there's another 15-patch set out there which I've not yet examined.




2008-11-06 20:05:19

by Fischer, Anna

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

> Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
>
> On Thu, Nov 06, 2008 at 05:38:16PM +0000, Fischer, Anna wrote:
> > > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > > I have not modified any existing drivers, but instead I threw
> > > together
> > > > a bare-bones module enabling me to make a call to
> pci_iov_register()
> > > > and then poke at an SR-IOV adapter's /sys entries for which no
> driver
> > > > was loaded.
> > > >
> > > > It appears from my perusal thus far that drivers using these new
> > > > SR-IOV patches will require modification; i.e. the driver
> associated
> > > > with the Physical Function (PF) will be required to make the
> > > > pci_iov_register() call along with the requisite notify()
> function.
> > > > Essentially this suggests to me a model for the PF driver to
> perform
> > > > any "global actions" or setup on behalf of VFs before enabling
> them
> > > > after which VF drivers could be associated.
> > >
> > > Where would the VF drivers have to be associated? On the "pci_dev"
> > > level or on a higher one?
> >
> > A VF appears to the Linux OS as a standard (full, additional) PCI
> > device. The driver is associated in the same way as for a normal PCI
> > device. Ideally, you would use SR-IOV devices on a virtualized
> system,
> > for example, using Xen. A VF can then be assigned to a guest domain
> as
> > a full PCI device.
>
> It's that "second" part that I'm worried about. How is that going to
> happen? Do you have any patches that show this kind of "assignment"?

That depends on your setup. Using Xen, you could assign the VF to a guest domain like any other PCI device, e.g. using PCI pass-through. For VMware, KVM, there are standard ways to do that, too. I currently don't see why SR-IOV devices would need any specific, non-standard mechanism for device assignment.


> > > Will all drivers that want to bind to a "VF" device need to be
> > > rewritten?
> >
> > Currently, any vendor providing a SR-IOV device needs to provide a PF
> > driver and a VF driver that runs on their hardware.
>
> Are there any such drivers available yet?

I don't know.


> > A VF driver does not necessarily need to know much about SR-IOV but
> > just run on the presented PCI device. You might want to have a
> > communication channel between PF and VF driver though, for various
> > reasons, if such a channel is not provided in hardware.
>
> Agreed, but what does that channel look like in Linux?
>
> I have some ideas of what I think it should look like, but if people
> already have code, I'd love to see that as well.

At this point I would guess that this code is vendor specific, as are the drivers. The issue I see is that most likely drivers will run in different environments, for example, in Xen the PF driver runs in a driver domain while a VF driver runs in a guest VM. So a communication channel would need to be either Xen specific, or vendor specific. Also, a guest using the VF might run Windows while the PF might be controlled under Linux.

Anna

2008-11-06 21:37:31

by Fischer, Anna

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

> Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
>
> On Thu, Nov 06, 2008 at 10:05:39AM -0800, H L wrote:
> >
> > --- On Thu, 11/6/08, Greg KH <[email protected]> wrote:
> >
> > > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > > I have not modified any existing drivers, but instead
> > > I threw together
> > > > a bare-bones module enabling me to make a call to
> > > pci_iov_register()
> > > > and then poke at an SR-IOV adapter's /sys entries
> > > for which no driver
> > > > was loaded.
> > > >
> > > > It appears from my perusal thus far that drivers using
> > > these new
> > > > SR-IOV patches will require modification; i.e. the
> > > driver associated
> > > > with the Physical Function (PF) will be required to
> > > make the
> > > > pci_iov_register() call along with the requisite
> > > notify() function.
> > > > Essentially this suggests to me a model for the PF
> > > driver to perform
> > > > any "global actions" or setup on behalf of
> > > VFs before enabling them
> > > > after which VF drivers could be associated.
> > >
> > > Where would the VF drivers have to be associated? On the
> > > "pci_dev"
> > > level or on a higher one?
> >
> >
> > I have not yet fully grocked Yu Zhao's model to answer this. That
> > said, I would *hope* to find it on the "pci_dev" level.
>
> Me too.
>
> > > Will all drivers that want to bind to a "VF"
> > > device need to be
> > > rewritten?
> >
> > Not necessarily, or perhaps minimally; depends on hardware/firmware
> > and actions the driver wants to take. An example here might assist.
> > Let's just say someone has created, oh, I don't know, maybe an SR-IOV
> > NIC. Now, for 'general' I/O operations to pass network traffic back
> > and forth there would ideally be no difference in the actions and
> > therefore behavior of a PF driver and a VF driver. But, what do you
> > do in the instance a VF wants to change link-speed? As that physical
> > characteristic affects all VFs, how do you handle that? This is
> where
> > the hardware/firmware implementation part comes to play. If a VF
> > driver performs some actions to initiate the change in link speed,
> the
> > logic in the adapter could be anything like:
>
> <snip>
>
> Yes, I agree that all of this needs to be done, somehow.
>
> It's that "somehow" that I am interested in trying to see how it works
> out.
>
> > >
> > > > I have so far only seen Yu Zhao's
> > > "7-patch" set. I've not yet looked
> > > > at his subsequently tendered "15-patch" set
> > > so I don't know what has
> > > > changed. The hardware/firmware implementation for
> > > any given SR-IOV
> > > > compatible device, will determine the extent of
> > > differences required
> > > > between a PF driver and a VF driver.
> > >
> > > Yeah, that's what I'm worried/curious about.
> > > Without seeing the code
> > > for such a driver, how can we properly evaluate if this
> > > infrastructure
> > > is the correct one and the proper way to do all of this?
> >
> >
> > As the example above demonstrates, that's a tough question to answer.
> > Ideally, in my view, there would only be one driver written per SR-
> IOV
> > device and it would contain the logic to "do the right things" based
> > on whether its running as a PF or VF with that determination easily
> > accomplished by testing the existence of the SR-IOV extended
> > capability. Then, in an effort to minimize (if not eliminate) the
> > complexities of driver-to-driver actions for fielding "global
> events",
> > contain as much of the logic as is possible within the adapter.
> > Minimizing the efforts required for the device driver writers in my
> > opinion paves the way to greater adoption of this technology.
>
> Yes, making things easier is the key here.
>
> Perhaps some of this could be hidden with a new bus type for these
> kinds
> of devices? Or a "virtual" bus of pci devices that the original SR-IOV
> device creates that corrispond to the individual virtual PCI devices?
> If that were the case, then it might be a lot easier in the end.

I think a standard communication channel in Linux for SR-IOV devices
would be a good start, and help to adopt the technology. Something
like the virtual bus you are describing. It means that vendors do
not need to write their own communication channel in the drivers.
It would need to have well defined APIs though, as I guess that
devices will have very different capabilities and hardware
implementations for PFs and VFs, and so they might have very
different events and information to propagate.

Anna

2008-11-06 22:24:23

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 09:53:08AM -0800, Greg KH wrote:
> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
> > On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
> > > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > > I have not modified any existing drivers, but instead I threw together
> > > > a bare-bones module enabling me to make a call to pci_iov_register()
> > > > and then poke at an SR-IOV adapter's /sys entries for which no driver
> > > > was loaded.
> > > >
> > > > It appears from my perusal thus far that drivers using these new
> > > > SR-IOV patches will require modification; i.e. the driver associated
> > > > with the Physical Function (PF) will be required to make the
> > > > pci_iov_register() call along with the requisite notify() function.
> > > > Essentially this suggests to me a model for the PF driver to perform
> > > > any "global actions" or setup on behalf of VFs before enabling them
> > > > after which VF drivers could be associated.
> > >
> > > Where would the VF drivers have to be associated? On the "pci_dev"
> > > level or on a higher one?
> > >
> > > Will all drivers that want to bind to a "VF" device need to be
> > > rewritten?
> >
> > The current model being implemented by my colleagues has separate
> > drivers for the PF (aka native) and VF devices. I don't personally
> > believe this is the correct path, but I'm reserving judgement until I
> > see some code.
>
> Hm, I would like to see that code before we can properly evaluate this
> interface. Especially as they are all tightly tied together.
>
> > I don't think we really know what the One True Usage model is for VF
> > devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
> > some ideas. I bet there's other people who have other ideas too.
>
> I'd love to hear those ideas.
>
> Rumor has it, there is some Xen code floating around to support this
> already, is that true?

Xen patches were posted to xen-devel by Yu Zhao on the 29th of September [1].
Unfortunately the only responses that I can find are a) that the patches
were mangled and b) they seem to include changes (by others) that have
been merged into Linux. I have confirmed that both of these concerns
are valid.

I have not yet examined the difference, if any, in the approach taken by Yu
to SR-IOV in Linux and Xen. Unfortunately comparison is less than trivial
due to the gaping gap in kernel versions between Linux-Xen (2.6.18.8) and
Linux itself.

One approach that I was considering in order to familiarise myself with the
code was to backport the v6 Linux patches (this thread) to Linux-Xen. I made a
start on that, but again due to kernel version differences it is non-trivial.

[1] http://lists.xensource.com/archives/html/xen-devel/2008-09/msg00923.html

--
Simon Horman
VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en

2008-11-06 22:39:00

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Matthew Wilcox wrote:
> [Anna, can you fix your word-wrapping please? Your lines appear to be
> infinitely long which is most unpleasant to reply to]
>
> On Thu, Nov 06, 2008 at 05:38:16PM +0000, Fischer, Anna wrote:
>
>>> Where would the VF drivers have to be associated? On the "pci_dev"
>>> level or on a higher one?
>>>
>> A VF appears to the Linux OS as a standard (full, additional) PCI
>> device. The driver is associated in the same way as for a normal PCI
>> device. Ideally, you would use SR-IOV devices on a virtualized system,
>> for example, using Xen. A VF can then be assigned to a guest domain as
>> a full PCI device.
>>
>
> It's not clear thats the right solution. If the VF devices are _only_
> going to be used by the guest, then arguably, we don't want to create
> pci_devs for them in the host. (I think it _is_ the right answer, but I
> want to make it clear there's multiple opinions on this).
>

The VFs shouldn't be limited to being used by the guest.

SR-IOV is actually an incredibly painful thing. You need to have a VF
driver in the guest, do hardware pass through, have a PV driver stub in
the guest that's hypervisor specific (a VF is not usable on it's own),
have a device specific backend in the VMM, and if you want to do live
migration, have another PV driver in the guest that you can do teaming
with. Just a mess.

What we would rather do in KVM, is have the VFs appear in the host as
standard network devices. We would then like to back our existing PV
driver to this VF directly bypassing the host networking stack. A key
feature here is being able to fill the VF's receive queue with guest
memory instead of host kernel memory so that you can get zero-copy
receive traffic. This will perform just as well as doing passthrough
(at least) and avoid all that ugliness of dealing with SR-IOV in the guest.

This eliminates all of the mess of various drivers in the guest and all
the associated baggage of doing hardware passthrough.

So IMHO, having VFs be usable in the host is absolutely critical because
I think it's the only reasonable usage model.

Regards,

Anthony Liguori

2008-11-06 22:40:46

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
>
>> I don't think we really know what the One True Usage model is for VF
>> devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
>> some ideas. I bet there's other people who have other ideas too.
>>
>
> I'd love to hear those ideas.
>

We've been talking about avoiding hardware passthrough entirely and just
backing a virtio-net backend driver by a dedicated VF in the host. That
avoids a huge amount of guest-facing complexity, let's migration Just
Work, and should give the same level of performance.

Regards,

Anthony Liguori

> Rumor has it, there is some Xen code floating around to support this
> already, is that true?
>
> thanks,
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2008-11-06 22:57:14

by Simon Horman

[permalink] [raw]
Subject: Re: git repository for SR-IOV development?

On Thu, Nov 06, 2008 at 11:58:25AM -0800, H L wrote:
> --- On Thu, 11/6/08, Greg KH <[email protected]> wrote:
>
> > On Thu, Nov 06, 2008 at 08:51:09AM -0800, H L wrote:
> > >
> > > Has anyone initiated or given consideration to the
> > creation of a git
> > > repository (say, on kernel.org) for SR-IOV
> > development?
> >
> > Why? It's only a few patches, right? Why would it
> > need a whole new git
> > tree?
>
>
> So as to minimize the time and effort patching a kernel, especially if
> the tree (and/or hash level) against which the patches were created fails
> to be specified on a mailing-list. Plus, there appears to be questions
> raised on how, precisely, the implementation should ultimately be modeled
> and especially given that, who knows at this point what number of patches
> will ultimately be submitted? I know I've built the "7-patch" one
> (painfully, by the way), and I'm aware there's another 15-patch set out
> there which I've not yet examined.

FWIW, the v6 patch series (this thread) applied to both 2.6.28-rc3
and the current Linus tree after a minor tweak to the first patch, as below.

--
Simon Horman
VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en

From: Yu Zhao <[email protected]>

[PATCH 1/16 v6] PCI: remove unnecessary arg of pci_update_resource()

This cleanup removes unnecessary argument 'struct resource *res' in
pci_update_resource(), so it takes same arguments as other companion
functions (pci_assign_resource(), etc.).

Cc: Alex Chiang <[email protected]>
Cc: Grant Grundler <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jesse Barnes <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Yu Zhao <[email protected]>
Upported-by: Simon Horman <[email protected]>

---
drivers/pci/pci.c | 4 ++--
drivers/pci/setup-res.c | 7 ++++---
include/linux/pci.h | 2 +-
3 files changed, 7 insertions(+), 6 deletions(-)

* Fri, 07 Nov 2008 09:05:18 +1100, Simon Horman
- Minor rediff of include/linux/pci.h section to apply to 2.6.28-rc3

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4db261e..ae62f01 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -376,8 +376,8 @@ pci_restore_bars(struct pci_dev *dev)
return;
}

- for (i = 0; i < numres; i ++)
- pci_update_resource(dev, &dev->resource[i], i);
+ for (i = 0; i < numres; i++)
+ pci_update_resource(dev, i);
}

static struct pci_platform_pm_ops *pci_platform_pm;
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index 2dbd96c..b7ca679 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -26,11 +26,12 @@
#include "pci.h"


-void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno)
+void pci_update_resource(struct pci_dev *dev, int resno)
{
struct pci_bus_region region;
u32 new, check, mask;
int reg;
+ struct resource *res = dev->resource + resno;

/*
* Ignore resources for unimplemented BARs and unused resource slots
@@ -162,7 +163,7 @@ int pci_assign_resource(struct pci_dev *dev, int resno)
} else {
res->flags &= ~IORESOURCE_STARTALIGN;
if (resno < PCI_BRIDGE_RESOURCES)
- pci_update_resource(dev, res, resno);
+ pci_update_resource(dev, resno);
}

return ret;
@@ -197,7 +198,7 @@ int pci_assign_resource_fixed(struct pci_dev *dev, int resno)
dev_err(&dev->dev, "BAR %d: can't allocate %s resource %pR\n",
resno, res->flags & IORESOURCE_IO ? "I/O" : "mem", res);
} else if (resno < PCI_BRIDGE_RESOURCES) {
- pci_update_resource(dev, res, resno);
+ pci_update_resource(dev, resno);
}

return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 085187b..43e1fc1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -626,7 +626,7 @@ int pcix_get_mmrbc(struct pci_dev *dev);
int pcie_set_readrq(struct pci_dev *dev, int rq);
int pci_reset_function(struct pci_dev *dev);
int pci_execute_reset_function(struct pci_dev *dev);
-void pci_update_resource(struct pci_dev *dev, struct resource *res, int resno);
+void pci_update_resource(struct pci_dev *dev, int resno);
int __must_check pci_assign_resource(struct pci_dev *dev, int i);
int pci_select_bars(struct pci_dev *dev, unsigned long flags);

--
1.5.6.4

2008-11-06 22:59:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 04:38:40PM -0600, Anthony Liguori wrote:
> >It's not clear thats the right solution. If the VF devices are _only_
> >going to be used by the guest, then arguably, we don't want to create
> >pci_devs for them in the host. (I think it _is_ the right answer, but I
> >want to make it clear there's multiple opinions on this).
>
> The VFs shouldn't be limited to being used by the guest.
>
> SR-IOV is actually an incredibly painful thing. You need to have a VF
> driver in the guest, do hardware pass through, have a PV driver stub in
> the guest that's hypervisor specific (a VF is not usable on it's own),
> have a device specific backend in the VMM, and if you want to do live
> migration, have another PV driver in the guest that you can do teaming
> with. Just a mess.

Not to mention that you basically have to statically allocate them up
front.

> What we would rather do in KVM, is have the VFs appear in the host as
> standard network devices. We would then like to back our existing PV
> driver to this VF directly bypassing the host networking stack. A key
> feature here is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get zero-copy
> receive traffic. This will perform just as well as doing passthrough
> (at least) and avoid all that ugliness of dealing with SR-IOV in the guest.

This argues for ignoring the SR-IOV mess completely. Just have the
host driver expose multiple 'ethN' devices.

> This eliminates all of the mess of various drivers in the guest and all
> the associated baggage of doing hardware passthrough.
>
> So IMHO, having VFs be usable in the host is absolutely critical because
> I think it's the only reasonable usage model.
>
> Regards,
>
> Anthony Liguori
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-06 23:57:00

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

* Greg KH ([email protected]) wrote:
> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
> > On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
> > > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > > I have not modified any existing drivers, but instead I threw together
> > > > a bare-bones module enabling me to make a call to pci_iov_register()
> > > > and then poke at an SR-IOV adapter's /sys entries for which no driver
> > > > was loaded.
> > > >
> > > > It appears from my perusal thus far that drivers using these new
> > > > SR-IOV patches will require modification; i.e. the driver associated
> > > > with the Physical Function (PF) will be required to make the
> > > > pci_iov_register() call along with the requisite notify() function.
> > > > Essentially this suggests to me a model for the PF driver to perform
> > > > any "global actions" or setup on behalf of VFs before enabling them
> > > > after which VF drivers could be associated.
> > >
> > > Where would the VF drivers have to be associated? On the "pci_dev"
> > > level or on a higher one?
> > >
> > > Will all drivers that want to bind to a "VF" device need to be
> > > rewritten?
> >
> > The current model being implemented by my colleagues has separate
> > drivers for the PF (aka native) and VF devices. I don't personally
> > believe this is the correct path, but I'm reserving judgement until I
> > see some code.
>
> Hm, I would like to see that code before we can properly evaluate this
> interface. Especially as they are all tightly tied together.
>
> > I don't think we really know what the One True Usage model is for VF
> > devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
> > some ideas. I bet there's other people who have other ideas too.
>
> I'd love to hear those ideas.

First there's the question of how to represent the VF on the host.
Ideally (IMO) this would show up as a normal interface so that normal tools
can configure the interface. This is not exactly how the first round of
patches were designed.

Second there's the question of reserving the BDF on the host such that
we don't have two drivers (one in the host and one in a guest) trying to
drive the same device (an issue that shows up for device assignment as
well as VF assignment).

Third there's the question of whether the VF can be used in the host at
all.

Fourth there's the question of whether the VF and PF drivers are the
same or separate.

The typical usecase is assigning the VF to the guest directly, so
there's only enough functionality in the host side to allocate a VF,
configure it, and assign it (and propagate AER). This is with separate
PF and VF driver.

As Anthony mentioned, we are interested in allowing the host to use the
VF. This could be useful for containers as well as dedicating a VF (a
set of device resources) to a guest w/out passing it through.

thanks,
-chris

2008-11-07 01:52:46

by Dong, Eddie

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support


> What we would rather do in KVM, is have the VFs appear in
> the host as standard network devices. We would then like
> to back our existing PV driver to this VF directly
> bypassing the host networking stack. A key feature here
> is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get
> zero-copy
> receive traffic. This will perform just as well as doing
> passthrough (at least) and avoid all that ugliness of
> dealing with SR-IOV in the guest.
>

Anthony:
This is already addressed by VMDq solution(or so called netchannel2), right? Qing He is debugging the KVM side patch and pretty much close to end.

For this single purpose, we don't need SR-IOV. BTW at least Intel SR-IOV NIC also supports VMDq, so you can achieve this by simply use "native" VMDq enabled driver here, plus the work we are debugging now.

Thx, eddie

2008-11-07 02:00:52

by Greg KH

[permalink] [raw]
Subject: Re: git repository for SR-IOV development?

On Thu, Nov 06, 2008 at 11:58:25AM -0800, H L wrote:
> --- On Thu, 11/6/08, Greg KH <[email protected]> wrote:
>
> > On Thu, Nov 06, 2008 at 08:51:09AM -0800, H L wrote:
> > >
> > > Has anyone initiated or given consideration to the
> > creation of a git
> > > repository (say, on kernel.org) for SR-IOV
> > development?
> >
> > Why? It's only a few patches, right? Why would it
> > need a whole new git
> > tree?
>
>
> So as to minimize the time and effort patching a kernel, especially if
> the tree (and/or hash level) against which the patches were created
> fails to be specified on a mailing-list. Plus, there appears to be
> questions raised on how, precisely, the implementation should
> ultimately be modeled and especially given that, who knows at this
> point what number of patches will ultimately be submitted? I know
> I've built the "7-patch" one (painfully, by the way), and I'm aware
> there's another 15-patch set out there which I've not yet examined.

It's a mere 7 or 15 patches, you don't need a whole git tree for
something small like that.

Especially as there only seems to be one developer doing real work...

thanks,

greg k-h

2008-11-07 02:08:50

by Nakajima, Jun

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On 11/6/2008 2:38:40 PM, Anthony Liguori wrote:
> Matthew Wilcox wrote:
> > [Anna, can you fix your word-wrapping please? Your lines appear to
> > be infinitely long which is most unpleasant to reply to]
> >
> > On Thu, Nov 06, 2008 at 05:38:16PM +0000, Fischer, Anna wrote:
> >
> > > > Where would the VF drivers have to be associated? On the "pci_dev"
> > > > level or on a higher one?
> > > >
> > > A VF appears to the Linux OS as a standard (full, additional) PCI
> > > device. The driver is associated in the same way as for a normal
> > > PCI device. Ideally, you would use SR-IOV devices on a virtualized
> > > system, for example, using Xen. A VF can then be assigned to a
> > > guest domain as a full PCI device.
> > >
> >
> > It's not clear thats the right solution. If the VF devices are
> > _only_ going to be used by the guest, then arguably, we don't want
> > to create pci_devs for them in the host. (I think it _is_ the right
> > answer, but I want to make it clear there's multiple opinions on this).
> >
>
> The VFs shouldn't be limited to being used by the guest.
>
> SR-IOV is actually an incredibly painful thing. You need to have a VF
> driver in the guest, do hardware pass through, have a PV driver stub
> in the guest that's hypervisor specific (a VF is not usable on it's
> own), have a device specific backend in the VMM, and if you want to do
> live migration, have another PV driver in the guest that you can do
> teaming with. Just a mess.

Actually "a PV driver stub in the guest" _was_ correct; I admit that I stated so at a virt mini summit more than a half year ago ;-). But the things have changed, and such a stub is no longer required (at least in our implementation). The major benefit of VF drivers now is that they are VMM-agnostic.

>
> What we would rather do in KVM, is have the VFs appear in the host as
> standard network devices. We would then like to back our existing PV
> driver to this VF directly bypassing the host networking stack. A key
> feature here is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get zero-copy
> receive traffic. This will perform just as well as doing passthrough
> (at
> least) and avoid all that ugliness of dealing with SR-IOV in the guest.
>
> This eliminates all of the mess of various drivers in the guest and
> all the associated baggage of doing hardware passthrough.
>
> So IMHO, having VFs be usable in the host is absolutely critical
> because I think it's the only reasonable usage model.

As Eddie said, VMDq is better for this model, and the feature is already available today. It is much simpler because it was designed for such purposes. It does not require hardware pass-through (e.g. VT-d) or VFs as a PCI device, either.

>
> Regards,
>
> Anthony Liguori
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in the
> body of a message to [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html
.
Jun Nakajima | Intel Open Source Technology Center

2008-11-07 02:38:24

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Greg KH wrote:
> On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
>> Documentation/kernel-parameters.txt | 10 ++++++++++
>> 1 files changed, 10 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>> index 53ba7c7..5482ae0 100644
>> --- a/Documentation/kernel-parameters.txt
>> +++ b/Documentation/kernel-parameters.txt
>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is defined in the file
>> cbmemsize=nn[KMG] The fixed amount of bus space which is
>> reserved for the CardBus bridge's memory
>> window. The default value is 64 megabytes.
>> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
>> + devices under bus [dddd:]bb (dddd is the domain
>> + number and bb is the bus number).
>> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
>> + devices under bus [dddd:]bb (dddd is the domain
>> + number and bb is the bus number).
>> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
>> + device to minimum PAGE_SIZE alignment (dddd is
>> + the domain number and bb, dd and f is the bus,
>> + device and function number).
>
> This seems like a big problem. How are we going to know to add these
> command line options for devices we haven't even seen/known about yet?
>
> How do we know the bus ids aren't going to change between boots (hint,
> they are, pci bus ids change all the time...)
>
> We need to be able to do this kind of thing dynamically, not fixed at
> boot time, which seems way to early to even know about this, right?
>
> thanks,
>
> greg k-h

Yes, I totally agree. Doing things dynamically is better.

The purpose of these parameters is to rebalance and align resources for
device that has BARs encapsulated in various new capabilities (SR-IOV,
etc.), because most of existing BIOSes don't take care of those BARs.

If we do resource rebalance after system is up, do you think there is
any side effect or impact to other subsystem other than PCI (e.g. MTRR)?

I haven't had much thinking on the dynamical resource rebalance. If you
have any idea about this, can you please suggest?

Regards,
Yu

2008-11-07 02:52:47

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
>>> Documentation/kernel-parameters.txt | 10 ++++++++++
>>> 1 files changed, 10 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/Documentation/kernel-parameters.txt
>>> b/Documentation/kernel-parameters.txt
>>> index 53ba7c7..5482ae0 100644
>>> --- a/Documentation/kernel-parameters.txt
>>> +++ b/Documentation/kernel-parameters.txt
>>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is
>>> defined in the file
>>> cbmemsize=nn[KMG] The fixed amount of bus space which is
>>> reserved for the CardBus bridge's memory
>>> window. The default value is 64 megabytes.
>>> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
>>> + devices under bus [dddd:]bb (dddd is the domain
>>> + number and bb is the bus number).
>>> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
>>> + devices under bus [dddd:]bb (dddd is the domain
>>> + number and bb is the bus number).
>>> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
>>> + device to minimum PAGE_SIZE alignment (dddd is
>>> + the domain number and bb, dd and f is the bus,
>>> + device and function number).
>> This seems like a big problem. How are we going to know to add these
>> command line options for devices we haven't even seen/known about yet?
>> How do we know the bus ids aren't going to change between boots (hint,
>> they are, pci bus ids change all the time...)
>> We need to be able to do this kind of thing dynamically, not fixed at
>> boot time, which seems way to early to even know about this, right?
>> thanks,
>> greg k-h
>
> Yes, I totally agree. Doing things dynamically is better.
>
> The purpose of these parameters is to rebalance and align resources for
> device that has BARs encapsulated in various new capabilities (SR-IOV,
> etc.), because most of existing BIOSes don't take care of those BARs.

But how are you going to know what the proper device ids are going to
be before the machine boots? I don't see how these options are ever
going to work properly for a "real" user.

> If we do resource rebalance after system is up, do you think there is any
> side effect or impact to other subsystem other than PCI (e.g. MTRR)?

I don't think so.

> I haven't had much thinking on the dynamical resource rebalance. If you
> have any idea about this, can you please suggest?

Yeah, it's going to be hard :)

We've thought about this in the past, and even Microsoft said it was
going to happen for Vista, but they realized in the end, like we did a
few years previously, that it would require full support of all PCI
drivers as well (if you rebalance stuff that is already bound to a
driver.) So they dropped it.

When would you want to do this kind of rebalancing? Before any PCI
driver is bound to any devices? Or afterwards?

thanks,

greg k-h

2008-11-07 03:01:48

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

Greg KH wrote:
> On Wed, Nov 05, 2008 at 08:33:18PM -0800, Greg KH wrote:
>> On Wed, Oct 22, 2008 at 04:45:15PM +0800, Yu Zhao wrote:
>>> Documentation/ABI/testing/sysfs-bus-pci | 33 +++++++++++++++++++++++++++++++
>>> 1 files changed, 33 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>>> index ceddcff..41cce8f 100644
>>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>>> @@ -9,3 +9,36 @@ Description:
>>> that some devices may have malformatted data. If the
>>> underlying VPD has a writable section then the
>>> corresponding section of this file will be writable.
>>> +
>>> +What: /sys/bus/pci/devices/.../iov/enable
>> Are you sure this is still the correct location with your change to
>> struct device?
>
> Nevermind, this is correct.
>
> But the bigger problem is that userspace doesn't know when these
> attributes show up. So tools like udev and HAL and others can't look
> for them as they never get notified, and they don't even know if they
> should be looking for them or not.
>
> Is there any way to tie these attributes to the "main" pci device so
> that they get created before the device is announced to the world?
> Doing that would solve this issue.
>
> thanks,
>
> greg k-h

Currently PCI subsystem has /sys/.../{vendor,device,...} bundled to the
main PCI device (I suppose this means the entries are created by
'device_add')

And after the PCI device is announced,
/sys/.../{config,resourceX,rom,vpd,iov,...} get created depending on if
these features are supported.

Making dynamic entries tie to the main PCI device would require PCI
subsystem to allocate different 'bus_type' for the devices, right?

Regards,
Yu

2008-11-07 03:23:20

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Fri, Nov 07, 2008 at 11:01:29AM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Wed, Nov 05, 2008 at 08:33:18PM -0800, Greg KH wrote:
>>> On Wed, Oct 22, 2008 at 04:45:15PM +0800, Yu Zhao wrote:
>>>> Documentation/ABI/testing/sysfs-bus-pci | 33
>>>> +++++++++++++++++++++++++++++++
>>>> 1 files changed, 33 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci
>>>> b/Documentation/ABI/testing/sysfs-bus-pci
>>>> index ceddcff..41cce8f 100644
>>>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>>>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>>>> @@ -9,3 +9,36 @@ Description:
>>>> that some devices may have malformatted data. If the
>>>> underlying VPD has a writable section then the
>>>> corresponding section of this file will be writable.
>>>> +
>>>> +What: /sys/bus/pci/devices/.../iov/enable
>>> Are you sure this is still the correct location with your change to
>>> struct device?
>> Nevermind, this is correct.
>> But the bigger problem is that userspace doesn't know when these
>> attributes show up. So tools like udev and HAL and others can't look
>> for them as they never get notified, and they don't even know if they
>> should be looking for them or not.
>> Is there any way to tie these attributes to the "main" pci device so
>> that they get created before the device is announced to the world?
>> Doing that would solve this issue.
>> thanks,
>> greg k-h
>
> Currently PCI subsystem has /sys/.../{vendor,device,...} bundled to the
> main PCI device (I suppose this means the entries are created by
> 'device_add')
>
> And after the PCI device is announced,
> /sys/.../{config,resourceX,rom,vpd,iov,...} get created depending on if
> these features are supported.

And that's a bug. Let's not continue to make the same bug here as well.

> Making dynamic entries tie to the main PCI device would require PCI
> subsystem to allocate different 'bus_type' for the devices, right?

No, it would just mean they need to be all added before the device is
fully registered with the driver core.

thanks,

greg k-h

2008-11-07 03:40:37

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Greg KH wrote:
> On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:
>> Greg KH wrote:
>>> On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
>>>> Documentation/kernel-parameters.txt | 10 ++++++++++
>>>> 1 files changed, 10 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/Documentation/kernel-parameters.txt
>>>> b/Documentation/kernel-parameters.txt
>>>> index 53ba7c7..5482ae0 100644
>>>> --- a/Documentation/kernel-parameters.txt
>>>> +++ b/Documentation/kernel-parameters.txt
>>>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is
>>>> defined in the file
>>>> cbmemsize=nn[KMG] The fixed amount of bus space which is
>>>> reserved for the CardBus bridge's memory
>>>> window. The default value is 64 megabytes.
>>>> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>> + number and bb is the bus number).
>>>> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>> + number and bb is the bus number).
>>>> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
>>>> + device to minimum PAGE_SIZE alignment (dddd is
>>>> + the domain number and bb, dd and f is the bus,
>>>> + device and function number).
>>> This seems like a big problem. How are we going to know to add these
>>> command line options for devices we haven't even seen/known about yet?
>>> How do we know the bus ids aren't going to change between boots (hint,
>>> they are, pci bus ids change all the time...)
>>> We need to be able to do this kind of thing dynamically, not fixed at
>>> boot time, which seems way to early to even know about this, right?
>>> thanks,
>>> greg k-h
>> Yes, I totally agree. Doing things dynamically is better.
>>
>> The purpose of these parameters is to rebalance and align resources for
>> device that has BARs encapsulated in various new capabilities (SR-IOV,
>> etc.), because most of existing BIOSes don't take care of those BARs.
>
> But how are you going to know what the proper device ids are going to
> be before the machine boots? I don't see how these options are ever
> going to work properly for a "real" user.
>
>> If we do resource rebalance after system is up, do you think there is any
>> side effect or impact to other subsystem other than PCI (e.g. MTRR)?
>
> I don't think so.
>
>> I haven't had much thinking on the dynamical resource rebalance. If you
>> have any idea about this, can you please suggest?
>
> Yeah, it's going to be hard :)
>
> We've thought about this in the past, and even Microsoft said it was
> going to happen for Vista, but they realized in the end, like we did a
> few years previously, that it would require full support of all PCI
> drivers as well (if you rebalance stuff that is already bound to a
> driver.) So they dropped it.
>
> When would you want to do this kind of rebalancing? Before any PCI
> driver is bound to any devices? Or afterwards?

I guess if we want the rebalance dynamic, then we should have it full --
the rebalance would be functional even after the driver is loaded.

But in most cases, there will be problem when we unload driver from a
hard disk controller, etc. We can mount root on a ramdisk and do the
rebalance there, but it's complicated for a real user.

So looks like doing rebalancing before any driver is bound to any device
is also a nice idea, if user can get a shell to do rebalance before
built-in PCI driver grabs device.

Regards,
Yu

2008-11-07 04:17:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:
> Greg KH wrote:
> >We've thought about this in the past, and even Microsoft said it was
> >going to happen for Vista, but they realized in the end, like we did a
> >few years previously, that it would require full support of all PCI
> >drivers as well (if you rebalance stuff that is already bound to a
> >driver.) So they dropped it.
> >
> >When would you want to do this kind of rebalancing? Before any PCI
> >driver is bound to any devices? Or afterwards?
>
> I guess if we want the rebalance dynamic, then we should have it full --
> the rebalance would be functional even after the driver is loaded.
>
> But in most cases, there will be problem when we unload driver from a
> hard disk controller, etc. We can mount root on a ramdisk and do the
> rebalance there, but it's complicated for a real user.
>
> So looks like doing rebalancing before any driver is bound to any device
> is also a nice idea, if user can get a shell to do rebalance before
> built-in PCI driver grabs device.

Can we use the suspend/resume code to do this? Some drivers (sym2 for
one) would definitely need to rerun some of their init code to cope with
a BAR address changing.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-11-07 05:19:15

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> On Wed, Oct 22, 2008 at 04:38:09PM +0800, Yu Zhao wrote:
>> Greetings,
>>
>> Following patches are intended to support SR-IOV capability in the
>> Linux kernel. With these patches, people can turn a PCI device with
>> the capability into multiple ones from software perspective, which
>> will benefit KVM and achieve other purposes such as QoS, security,
>> and etc.
>
> Is there any actual users of this API around yet? How was it tested as
> there is no hardware to test on? Which drivers are going to have to be
> rewritten to take advantage of this new interface?

Yes, the API is used by Intel, HP, NextIO and some other anonymous
companies as they rise questions and send me feedback. I haven't seen
their works but I guess some of drivers using SR-IOV API are going to be
released soon.

My test was done with Intel 82576 Gigabit Ethernet Controller. The
product brief is at
http://download.intel.com/design/network/ProdBrf/320025.pdf and the spec
is available at
http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf

Regards,
Yu

2008-11-07 06:03:44

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> On Thu, Nov 06, 2008 at 10:05:39AM -0800, H L wrote:
>> --- On Thu, 11/6/08, Greg KH <[email protected]> wrote:
>>
>>> On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
>>>> I have not modified any existing drivers, but instead
>>> I threw together
>>>> a bare-bones module enabling me to make a call to
>>> pci_iov_register()
>>>> and then poke at an SR-IOV adapter's /sys entries
>>> for which no driver
>>>> was loaded.
>>>>
>>>> It appears from my perusal thus far that drivers using
>>> these new
>>>> SR-IOV patches will require modification; i.e. the
>>> driver associated
>>>> with the Physical Function (PF) will be required to
>>> make the
>>>> pci_iov_register() call along with the requisite
>>> notify() function.
>>>> Essentially this suggests to me a model for the PF
>>> driver to perform
>>>> any "global actions" or setup on behalf of
>>> VFs before enabling them
>>>> after which VF drivers could be associated.
>>> Where would the VF drivers have to be associated? On the
>>> "pci_dev"
>>> level or on a higher one?
>>
>> I have not yet fully grocked Yu Zhao's model to answer this. That
>> said, I would *hope* to find it on the "pci_dev" level.
>
> Me too.

VF is kind of lightweight PCI device, and it's represented by "struct
pci_dev". VF driver bounds to the "pci_dev" and works in the same way as
other drivers.

>
>>> Will all drivers that want to bind to a "VF"
>>> device need to be
>>> rewritten?
>> Not necessarily, or perhaps minimally; depends on hardware/firmware
>> and actions the driver wants to take. An example here might assist.
>> Let's just say someone has created, oh, I don't know, maybe an SR-IOV
>> NIC. Now, for 'general' I/O operations to pass network traffic back
>> and forth there would ideally be no difference in the actions and
>> therefore behavior of a PF driver and a VF driver. But, what do you
>> do in the instance a VF wants to change link-speed? As that physical
>> characteristic affects all VFs, how do you handle that? This is where
>> the hardware/firmware implementation part comes to play. If a VF
>> driver performs some actions to initiate the change in link speed, the
>> logic in the adapter could be anything like:
>
> <snip>
>
> Yes, I agree that all of this needs to be done, somehow.
>
> It's that "somehow" that I am interested in trying to see how it works
> out.

This is device specific part. VF driver is free to do what it wants to
do with device specific registers and resources, and wouldn't concern us
as far as it behaves as PCI device driver.

>
>>>> I have so far only seen Yu Zhao's
>>> "7-patch" set. I've not yet looked
>>>> at his subsequently tendered "15-patch" set
>>> so I don't know what has
>>>> changed. The hardware/firmware implementation for
>>> any given SR-IOV
>>>> compatible device, will determine the extent of
>>> differences required
>>>> between a PF driver and a VF driver.
>>> Yeah, that's what I'm worried/curious about.
>>> Without seeing the code
>>> for such a driver, how can we properly evaluate if this
>>> infrastructure
>>> is the correct one and the proper way to do all of this?
>>
>> As the example above demonstrates, that's a tough question to answer.
>> Ideally, in my view, there would only be one driver written per SR-IOV
>> device and it would contain the logic to "do the right things" based
>> on whether its running as a PF or VF with that determination easily
>> accomplished by testing the existence of the SR-IOV extended
>> capability. Then, in an effort to minimize (if not eliminate) the
>> complexities of driver-to-driver actions for fielding "global events",
>> contain as much of the logic as is possible within the adapter.
>> Minimizing the efforts required for the device driver writers in my
>> opinion paves the way to greater adoption of this technology.
>
> Yes, making things easier is the key here.
>
> Perhaps some of this could be hidden with a new bus type for these kinds
> of devices? Or a "virtual" bus of pci devices that the original SR-IOV
> device creates that corrispond to the individual virtual PCI devices?
> If that were the case, then it might be a lot easier in the end.

PCI SIG only defines SR-IOV at PCI level, we can't predict what the
hardware vendors would implement at device specific logic level.

An example of SR-IOV NIC: PF may not have network functionality, it only
controls VFs. Because people only want to use VFs in virtual machines,
they don't need network functionality in the environment (e.g.
hypervisor) where PF resides.

Thanks,
Yu

2008-11-07 06:28:56

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Fri, Nov 07, 2008 at 01:18:52PM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Wed, Oct 22, 2008 at 04:38:09PM +0800, Yu Zhao wrote:
>>> Greetings,
>>>
>>> Following patches are intended to support SR-IOV capability in the
>>> Linux kernel. With these patches, people can turn a PCI device with
>>> the capability into multiple ones from software perspective, which
>>> will benefit KVM and achieve other purposes such as QoS, security,
>>> and etc.
>> Is there any actual users of this API around yet? How was it tested as
>> there is no hardware to test on? Which drivers are going to have to be
>> rewritten to take advantage of this new interface?
>
> Yes, the API is used by Intel, HP, NextIO and some other anonymous
> companies as they rise questions and send me feedback. I haven't seen their
> works but I guess some of drivers using SR-IOV API are going to be released
> soon.

Well, we can't merge infrastructure without seeing the users of that
infrastructure, right?

> My test was done with Intel 82576 Gigabit Ethernet Controller. The product
> brief is at http://download.intel.com/design/network/ProdBrf/320025.pdf and
> the spec is available at
> http://download.intel.com/design/network/datashts/82576_Datasheet_v2p1.pdf

Cool, do you have that driver we can see?

How does it interact and handle the kvm and xen issues that have been
posted?

thanks,

greg k-h

2008-11-07 06:29:18

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 03:54:06PM -0800, Chris Wright wrote:
> * Greg KH ([email protected]) wrote:
> > On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
> > > On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
> > > > On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> > > > > I have not modified any existing drivers, but instead I threw together
> > > > > a bare-bones module enabling me to make a call to pci_iov_register()
> > > > > and then poke at an SR-IOV adapter's /sys entries for which no driver
> > > > > was loaded.
> > > > >
> > > > > It appears from my perusal thus far that drivers using these new
> > > > > SR-IOV patches will require modification; i.e. the driver associated
> > > > > with the Physical Function (PF) will be required to make the
> > > > > pci_iov_register() call along with the requisite notify() function.
> > > > > Essentially this suggests to me a model for the PF driver to perform
> > > > > any "global actions" or setup on behalf of VFs before enabling them
> > > > > after which VF drivers could be associated.
> > > >
> > > > Where would the VF drivers have to be associated? On the "pci_dev"
> > > > level or on a higher one?
> > > >
> > > > Will all drivers that want to bind to a "VF" device need to be
> > > > rewritten?
> > >
> > > The current model being implemented by my colleagues has separate
> > > drivers for the PF (aka native) and VF devices. I don't personally
> > > believe this is the correct path, but I'm reserving judgement until I
> > > see some code.
> >
> > Hm, I would like to see that code before we can properly evaluate this
> > interface. Especially as they are all tightly tied together.
> >
> > > I don't think we really know what the One True Usage model is for VF
> > > devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
> > > some ideas. I bet there's other people who have other ideas too.
> >
> > I'd love to hear those ideas.
>
> First there's the question of how to represent the VF on the host.
> Ideally (IMO) this would show up as a normal interface so that normal tools
> can configure the interface. This is not exactly how the first round of
> patches were designed.
>
> Second there's the question of reserving the BDF on the host such that
> we don't have two drivers (one in the host and one in a guest) trying to
> drive the same device (an issue that shows up for device assignment as
> well as VF assignment).
>
> Third there's the question of whether the VF can be used in the host at
> all.
>
> Fourth there's the question of whether the VF and PF drivers are the
> same or separate.
>
> The typical usecase is assigning the VF to the guest directly, so
> there's only enough functionality in the host side to allocate a VF,
> configure it, and assign it (and propagate AER). This is with separate
> PF and VF driver.
>
> As Anthony mentioned, we are interested in allowing the host to use the
> VF. This could be useful for containers as well as dedicating a VF (a
> set of device resources) to a guest w/out passing it through.

All of this looks great. So, with all of these questions, how does the
current code pertain to these issues? It seems like we have a long way
to go...

thanks,

greg k-h

2008-11-07 06:29:51

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:
>>> Greg KH wrote:
>>>> On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
>>>>> Documentation/kernel-parameters.txt | 10 ++++++++++
>>>>> 1 files changed, 10 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/kernel-parameters.txt
>>>>> b/Documentation/kernel-parameters.txt
>>>>> index 53ba7c7..5482ae0 100644
>>>>> --- a/Documentation/kernel-parameters.txt
>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is
>>>>> defined in the file
>>>>> cbmemsize=nn[KMG] The fixed amount of bus space which is
>>>>> reserved for the CardBus bridge's memory
>>>>> window. The default value is 64 megabytes.
>>>>> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
>>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>>> + number and bb is the bus number).
>>>>> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
>>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>>> + number and bb is the bus number).
>>>>> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
>>>>> + device to minimum PAGE_SIZE alignment (dddd is
>>>>> + the domain number and bb, dd and f is the bus,
>>>>> + device and function number).
>>>> This seems like a big problem. How are we going to know to add these
>>>> command line options for devices we haven't even seen/known about yet?
>>>> How do we know the bus ids aren't going to change between boots (hint,
>>>> they are, pci bus ids change all the time...)
>>>> We need to be able to do this kind of thing dynamically, not fixed at
>>>> boot time, which seems way to early to even know about this, right?
>>>> thanks,
>>>> greg k-h
>>> Yes, I totally agree. Doing things dynamically is better.
>>>
>>> The purpose of these parameters is to rebalance and align resources for
>>> device that has BARs encapsulated in various new capabilities (SR-IOV,
>>> etc.), because most of existing BIOSes don't take care of those BARs.
>> But how are you going to know what the proper device ids are going to
>> be before the machine boots? I don't see how these options are ever
>> going to work properly for a "real" user.
>>> If we do resource rebalance after system is up, do you think there is any
>>> side effect or impact to other subsystem other than PCI (e.g. MTRR)?
>> I don't think so.
>>> I haven't had much thinking on the dynamical resource rebalance. If you
>>> have any idea about this, can you please suggest?
>> Yeah, it's going to be hard :)
>> We've thought about this in the past, and even Microsoft said it was
>> going to happen for Vista, but they realized in the end, like we did a
>> few years previously, that it would require full support of all PCI
>> drivers as well (if you rebalance stuff that is already bound to a
>> driver.) So they dropped it.
>> When would you want to do this kind of rebalancing? Before any PCI
>> driver is bound to any devices? Or afterwards?
>
> I guess if we want the rebalance dynamic, then we should have it full --
> the rebalance would be functional even after the driver is loaded.
>
> But in most cases, there will be problem when we unload driver from a hard
> disk controller, etc. We can mount root on a ramdisk and do the rebalance
> there, but it's complicated for a real user.
>
> So looks like doing rebalancing before any driver is bound to any device is
> also a nice idea, if user can get a shell to do rebalance before built-in
> PCI driver grabs device.

That's not going to work, it needs to happen before any PCI device is
bound, which is before init runs.

thanks,

greg k-h

2008-11-07 06:29:36

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 04:40:21PM -0600, Anthony Liguori wrote:
> Greg KH wrote:
>> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
>>
>>> I don't think we really know what the One True Usage model is for VF
>>> devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
>>> some ideas. I bet there's other people who have other ideas too.
>>>
>>
>> I'd love to hear those ideas.
>>
>
> We've been talking about avoiding hardware passthrough entirely and
> just backing a virtio-net backend driver by a dedicated VF in the
> host. That avoids a huge amount of guest-facing complexity, let's
> migration Just Work, and should give the same level of performance.

Does that involve this patch set? Or a different type of interface.

thanks,

greg k-h

2008-11-07 06:30:16

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 09:35:57PM +0000, Fischer, Anna wrote:
> > Perhaps some of this could be hidden with a new bus type for these
> > kinds
> > of devices? Or a "virtual" bus of pci devices that the original SR-IOV
> > device creates that corrispond to the individual virtual PCI devices?
> > If that were the case, then it might be a lot easier in the end.
>
> I think a standard communication channel in Linux for SR-IOV devices
> would be a good start, and help to adopt the technology. Something
> like the virtual bus you are describing. It means that vendors do
> not need to write their own communication channel in the drivers.
> It would need to have well defined APIs though, as I guess that
> devices will have very different capabilities and hardware
> implementations for PFs and VFs, and so they might have very
> different events and information to propagate.

That would be good to standardize on. Have patches?

thanks,

greg k-h

2008-11-07 06:30:40

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 03:58:54PM -0700, Matthew Wilcox wrote:
> > What we would rather do in KVM, is have the VFs appear in the host as
> > standard network devices. We would then like to back our existing PV
> > driver to this VF directly bypassing the host networking stack. A key
> > feature here is being able to fill the VF's receive queue with guest
> > memory instead of host kernel memory so that you can get zero-copy
> > receive traffic. This will perform just as well as doing passthrough
> > (at least) and avoid all that ugliness of dealing with SR-IOV in the guest.
>
> This argues for ignoring the SR-IOV mess completely. Just have the
> host driver expose multiple 'ethN' devices.

That would work, but do we want to do that for every different type of
driver?

thanks,

greg k-h

2008-11-07 07:06:52

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Chris Wright wrote:
> * Greg KH ([email protected]) wrote:
>> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
>>> On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
>>>> On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
>>>>> I have not modified any existing drivers, but instead I threw together
>>>>> a bare-bones module enabling me to make a call to pci_iov_register()
>>>>> and then poke at an SR-IOV adapter's /sys entries for which no driver
>>>>> was loaded.
>>>>>
>>>>> It appears from my perusal thus far that drivers using these new
>>>>> SR-IOV patches will require modification; i.e. the driver associated
>>>>> with the Physical Function (PF) will be required to make the
>>>>> pci_iov_register() call along with the requisite notify() function.
>>>>> Essentially this suggests to me a model for the PF driver to perform
>>>>> any "global actions" or setup on behalf of VFs before enabling them
>>>>> after which VF drivers could be associated.
>>>> Where would the VF drivers have to be associated? On the "pci_dev"
>>>> level or on a higher one?
>>>>
>>>> Will all drivers that want to bind to a "VF" device need to be
>>>> rewritten?
>>> The current model being implemented by my colleagues has separate
>>> drivers for the PF (aka native) and VF devices. I don't personally
>>> believe this is the correct path, but I'm reserving judgement until I
>>> see some code.
>> Hm, I would like to see that code before we can properly evaluate this
>> interface. Especially as they are all tightly tied together.
>>
>>> I don't think we really know what the One True Usage model is for VF
>>> devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
>>> some ideas. I bet there's other people who have other ideas too.
>> I'd love to hear those ideas.
>
> First there's the question of how to represent the VF on the host.
> Ideally (IMO) this would show up as a normal interface so that normal tools
> can configure the interface. This is not exactly how the first round of
> patches were designed.

Whether the VF can show up as a normal interface is decided by VF
driver. VF is represented by 'pci_dev' at PCI level, so VF driver can be
loaded as normal PCI device driver.

What the software representation (eth, framebuffer, etc.) created by VF
driver is not controlled by SR-IOV framework.

So you definitely can use normal tool to configure the VF if its driver
supports that :-)

>
> Second there's the question of reserving the BDF on the host such that
> we don't have two drivers (one in the host and one in a guest) trying to
> drive the same device (an issue that shows up for device assignment as
> well as VF assignment).

If we don't reserve BDF for the device, they can't work neither in the
host nor the guest.

Without BDF, we can't access the config space of the device, the device
also can't do DMA.

Did I miss your point?

>
> Third there's the question of whether the VF can be used in the host at
> all.

Why can't? My VFs work well in the host as normal PCI devices :-)

>
> Fourth there's the question of whether the VF and PF drivers are the
> same or separate.

As I mentioned in another email of this thread. We can't predict how
hardware vendor creates their SR-IOV device. PCI SIG doesn't define
device specific logics.

So I think the answer of this question is up to the device driver
developers. If PF and VF in a SR-IOV device have similar logics, then
they can combine the driver. Otherwise, e.g., if PF doesn't have real
functionality at all -- it only has registers to control internal
resource allocation for VFs, then the drivers should be separate, right?

>
> The typical usecase is assigning the VF to the guest directly, so
> there's only enough functionality in the host side to allocate a VF,
> configure it, and assign it (and propagate AER). This is with separate
> PF and VF driver.
>
> As Anthony mentioned, we are interested in allowing the host to use the
> VF. This could be useful for containers as well as dedicating a VF (a
> set of device resources) to a guest w/out passing it through.

I've considered the container cases, we don't have problem with running
VF driver in the host.

Thanks,
Yu

2008-11-07 07:44:47

by Leonid Grossman

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf
Of
> Zhao, Yu
> Sent: Thursday, November 06, 2008 11:06 PM
> To: Chris Wright
> Cc: [email protected]; [email protected];
[email protected];
> Matthew Wilcox; Greg KH; [email protected];
[email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
>
> Chris Wright wrote:
> > * Greg KH ([email protected]) wrote:
> >> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
> >>> On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
> >>>> On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
> >>>>> I have not modified any existing drivers, but instead I threw
> together
> >>>>> a bare-bones module enabling me to make a call to
pci_iov_register()
> >>>>> and then poke at an SR-IOV adapter's /sys entries for which no
> driver
> >>>>> was loaded.
> >>>>>
> >>>>> It appears from my perusal thus far that drivers using these new
> >>>>> SR-IOV patches will require modification; i.e. the driver
associated
> >>>>> with the Physical Function (PF) will be required to make the
> >>>>> pci_iov_register() call along with the requisite notify()
function.
> >>>>> Essentially this suggests to me a model for the PF driver to
perform
> >>>>> any "global actions" or setup on behalf of VFs before enabling
them
> >>>>> after which VF drivers could be associated.
> >>>> Where would the VF drivers have to be associated? On the
"pci_dev"
> >>>> level or on a higher one?
> >>>>
> >>>> Will all drivers that want to bind to a "VF" device need to be
> >>>> rewritten?
> >>> The current model being implemented by my colleagues has separate
> >>> drivers for the PF (aka native) and VF devices. I don't
personally
> >>> believe this is the correct path, but I'm reserving judgement
until I
> >>> see some code.
> >> Hm, I would like to see that code before we can properly evaluate
this
> >> interface. Especially as they are all tightly tied together.
> >>
> >>> I don't think we really know what the One True Usage model is for
VF
> >>> devices. Chris Wright has some ideas, I have some ideas and Yu
Zhao
> has
> >>> some ideas. I bet there's other people who have other ideas too.
> >> I'd love to hear those ideas.
> >
> > First there's the question of how to represent the VF on the host.
> > Ideally (IMO) this would show up as a normal interface so that
normal
> tools
> > can configure the interface. This is not exactly how the first
round of
> > patches were designed.
>
> Whether the VF can show up as a normal interface is decided by VF
> driver. VF is represented by 'pci_dev' at PCI level, so VF driver can
be
> loaded as normal PCI device driver.
>
> What the software representation (eth, framebuffer, etc.) created by
VF
> driver is not controlled by SR-IOV framework.
>
> So you definitely can use normal tool to configure the VF if its
driver
> supports that :-)
>
> >
> > Second there's the question of reserving the BDF on the host such
that
> > we don't have two drivers (one in the host and one in a guest)
trying to
> > drive the same device (an issue that shows up for device assignment
as
> > well as VF assignment).
>
> If we don't reserve BDF for the device, they can't work neither in the
> host nor the guest.
>
> Without BDF, we can't access the config space of the device, the
device
> also can't do DMA.
>
> Did I miss your point?
>
> >
> > Third there's the question of whether the VF can be used in the host
at
> > all.
>
> Why can't? My VFs work well in the host as normal PCI devices :-)
>
> >
> > Fourth there's the question of whether the VF and PF drivers are the
> > same or separate.
>
> As I mentioned in another email of this thread. We can't predict how
> hardware vendor creates their SR-IOV device. PCI SIG doesn't define
> device specific logics.
>
> So I think the answer of this question is up to the device driver
> developers. If PF and VF in a SR-IOV device have similar logics, then
> they can combine the driver. Otherwise, e.g., if PF doesn't have real
> functionality at all -- it only has registers to control internal
> resource allocation for VFs, then the drivers should be separate,
right?


Right, this really depends upon the functionality behind a VF. If VF is
done as a subset of netdev interface (for example, a queue pair), then a
split VF/PF driver model and a proprietary communication channel is in
order.

If each VF is done as a complete netdev interface (like in our 10GbE IOV
controllers), then PF and VF drivers could be the same. Each VF can be
independently driven by such "native" netdev driver; this includes the
ability to run a native driver in a guest in passthru mode.
A PF driver in a privileged domain doesn't even have to be present.

>
> >
> > The typical usecase is assigning the VF to the guest directly, so
> > there's only enough functionality in the host side to allocate a VF,
> > configure it, and assign it (and propagate AER). This is with
separate
> > PF and VF driver.
> >
> > As Anthony mentioned, we are interested in allowing the host to use
the
> > VF. This could be useful for containers as well as dedicating a VF
(a
> > set of device resources) to a guest w/out passing it through.
>
> I've considered the container cases, we don't have problem with
running
> VF driver in the host.
>
> Thanks,
> Yu
> _______________________________________________
> Virtualization mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/virtualization

2008-11-07 07:48:17

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> On Thu, Nov 06, 2008 at 04:40:21PM -0600, Anthony Liguori wrote:
>> Greg KH wrote:
>>> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
>>>
>>>> I don't think we really know what the One True Usage model is for VF
>>>> devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has
>>>> some ideas. I bet there's other people who have other ideas too.
>>>>
>>> I'd love to hear those ideas.
>>>
>> We've been talking about avoiding hardware passthrough entirely and
>> just backing a virtio-net backend driver by a dedicated VF in the
>> host. That avoids a huge amount of guest-facing complexity, let's
>> migration Just Work, and should give the same level of performance.

This can be commonly used not only with VF -- devices that have multiple
DMA queues (e.g., Intel VMDq, Neterion Xframe) and even traditional
devices can also take the advantage of this.

CC Rusty Russel in case he has more comments.

>
> Does that involve this patch set? Or a different type of interface.

I think that is a different type of interface. We need to hook the DMA
interface in the device driver to virtio-net backend so the hardware
(normal device, VF, VMDq, etc.) can DMA data to/from the virtio-net backend.

Regards,
Yu

2008-11-07 07:50:49

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Greg KH wrote:
> On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:
>> Greg KH wrote:
>>> On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:
>>>> Greg KH wrote:
>>>>> On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
>>>>>> Documentation/kernel-parameters.txt | 10 ++++++++++
>>>>>> 1 files changed, 10 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/kernel-parameters.txt
>>>>>> b/Documentation/kernel-parameters.txt
>>>>>> index 53ba7c7..5482ae0 100644
>>>>>> --- a/Documentation/kernel-parameters.txt
>>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is
>>>>>> defined in the file
>>>>>> cbmemsize=nn[KMG] The fixed amount of bus space which is
>>>>>> reserved for the CardBus bridge's memory
>>>>>> window. The default value is 64 megabytes.
>>>>>> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
>>>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>>>> + number and bb is the bus number).
>>>>>> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
>>>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>>>> + number and bb is the bus number).
>>>>>> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
>>>>>> + device to minimum PAGE_SIZE alignment (dddd is
>>>>>> + the domain number and bb, dd and f is the bus,
>>>>>> + device and function number).
>>>>> This seems like a big problem. How are we going to know to add these
>>>>> command line options for devices we haven't even seen/known about yet?
>>>>> How do we know the bus ids aren't going to change between boots (hint,
>>>>> they are, pci bus ids change all the time...)
>>>>> We need to be able to do this kind of thing dynamically, not fixed at
>>>>> boot time, which seems way to early to even know about this, right?
>>>>> thanks,
>>>>> greg k-h
>>>> Yes, I totally agree. Doing things dynamically is better.
>>>>
>>>> The purpose of these parameters is to rebalance and align resources for
>>>> device that has BARs encapsulated in various new capabilities (SR-IOV,
>>>> etc.), because most of existing BIOSes don't take care of those BARs.
>>> But how are you going to know what the proper device ids are going to
>>> be before the machine boots? I don't see how these options are ever
>>> going to work properly for a "real" user.
>>>> If we do resource rebalance after system is up, do you think there is any
>>>> side effect or impact to other subsystem other than PCI (e.g. MTRR)?
>>> I don't think so.
>>>> I haven't had much thinking on the dynamical resource rebalance. If you
>>>> have any idea about this, can you please suggest?
>>> Yeah, it's going to be hard :)
>>> We've thought about this in the past, and even Microsoft said it was
>>> going to happen for Vista, but they realized in the end, like we did a
>>> few years previously, that it would require full support of all PCI
>>> drivers as well (if you rebalance stuff that is already bound to a
>>> driver.) So they dropped it.
>>> When would you want to do this kind of rebalancing? Before any PCI
>>> driver is bound to any devices? Or afterwards?
>> I guess if we want the rebalance dynamic, then we should have it full --
>> the rebalance would be functional even after the driver is loaded.
>>
>> But in most cases, there will be problem when we unload driver from a hard
>> disk controller, etc. We can mount root on a ramdisk and do the rebalance
>> there, but it's complicated for a real user.
>>
>> So looks like doing rebalancing before any driver is bound to any device is
>> also a nice idea, if user can get a shell to do rebalance before built-in
>> PCI driver grabs device.
>
> That's not going to work, it needs to happen before any PCI device is
> bound, which is before init runs.

I don't think it can work either. Then we have to do rebalance after the
driver bounding. But what should we do if we can't unload the driver
(hard disk controller, etc.)?

Thanks,
Yu

2008-11-07 08:05:46

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Fri, Nov 07, 2008 at 03:50:34PM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:
>>> Greg KH wrote:
>>>> On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:
>>>>> Greg KH wrote:
>>>>>> On Wed, Oct 22, 2008 at 04:45:31PM +0800, Yu Zhao wrote:
>>>>>>> Documentation/kernel-parameters.txt | 10 ++++++++++
>>>>>>> 1 files changed, 10 insertions(+), 0 deletions(-)
>>>>>>>
>>>>>>> diff --git a/Documentation/kernel-parameters.txt
>>>>>>> b/Documentation/kernel-parameters.txt
>>>>>>> index 53ba7c7..5482ae0 100644
>>>>>>> --- a/Documentation/kernel-parameters.txt
>>>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>>>> @@ -1677,6 +1677,16 @@ and is between 256 and 4096 characters. It is
>>>>>>> defined in the file
>>>>>>> cbmemsize=nn[KMG] The fixed amount of bus space which is
>>>>>>> reserved for the CardBus bridge's memory
>>>>>>> window. The default value is 64 megabytes.
>>>>>>> + assign-mmio=[dddd:]bb [X86] reassign memory resources of all
>>>>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>>>>> + number and bb is the bus number).
>>>>>>> + assign-pio=[dddd:]bb [X86] reassign io port resources of all
>>>>>>> + devices under bus [dddd:]bb (dddd is the domain
>>>>>>> + number and bb is the bus number).
>>>>>>> + align-mmio=[dddd:]bb:dd.f [X86] relocate memory resources of a
>>>>>>> + device to minimum PAGE_SIZE alignment (dddd is
>>>>>>> + the domain number and bb, dd and f is the bus,
>>>>>>> + device and function number).
>>>>>> This seems like a big problem. How are we going to know to add these
>>>>>> command line options for devices we haven't even seen/known about yet?
>>>>>> How do we know the bus ids aren't going to change between boots (hint,
>>>>>> they are, pci bus ids change all the time...)
>>>>>> We need to be able to do this kind of thing dynamically, not fixed at
>>>>>> boot time, which seems way to early to even know about this, right?
>>>>>> thanks,
>>>>>> greg k-h
>>>>> Yes, I totally agree. Doing things dynamically is better.
>>>>>
>>>>> The purpose of these parameters is to rebalance and align resources for
>>>>> device that has BARs encapsulated in various new capabilities (SR-IOV,
>>>>> etc.), because most of existing BIOSes don't take care of those BARs.
>>>> But how are you going to know what the proper device ids are going to
>>>> be before the machine boots? I don't see how these options are ever
>>>> going to work properly for a "real" user.
>>>>> If we do resource rebalance after system is up, do you think there is
>>>>> any side effect or impact to other subsystem other than PCI (e.g.
>>>>> MTRR)?
>>>> I don't think so.
>>>>> I haven't had much thinking on the dynamical resource rebalance. If you
>>>>> have any idea about this, can you please suggest?
>>>> Yeah, it's going to be hard :)
>>>> We've thought about this in the past, and even Microsoft said it was
>>>> going to happen for Vista, but they realized in the end, like we did a
>>>> few years previously, that it would require full support of all PCI
>>>> drivers as well (if you rebalance stuff that is already bound to a
>>>> driver.) So they dropped it.
>>>> When would you want to do this kind of rebalancing? Before any PCI
>>>> driver is bound to any devices? Or afterwards?
>>> I guess if we want the rebalance dynamic, then we should have it full --
>>> the rebalance would be functional even after the driver is loaded.
>>>
>>> But in most cases, there will be problem when we unload driver from a
>>> hard disk controller, etc. We can mount root on a ramdisk and do the
>>> rebalance there, but it's complicated for a real user.
>>>
>>> So looks like doing rebalancing before any driver is bound to any device
>>> is also a nice idea, if user can get a shell to do rebalance before
>>> built-in PCI driver grabs device.
>> That's not going to work, it needs to happen before any PCI device is
>> bound, which is before init runs.
>
> I don't think it can work either. Then we have to do rebalance after the
> driver bounding. But what should we do if we can't unload the driver (hard
> disk controller, etc.)?

Well, to do it "correctly" you are going to have to tell the driver to
shut itself down, and reinitialize itself.

Turns out, that doesn't really work for disk and network devices without
dropping the connection (well, network devices should be fine probably).

So you just can't do this, sorry. That's why the BIOS handles all of
these issues in a PCI hotplug system.

How does the hardware people think we are going to handle this in the
OS? It's not something that any operating system can do, is it part of
the IOV PCI spec somewhere?

thanks,

greg k-h

2008-11-07 08:17:21

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Greg KH wrote:
> On Fri, Nov 07, 2008 at 03:50:34PM +0800, Zhao, Yu wrote:
>> Greg KH wrote:
>>> On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:
>>>> Greg KH wrote:
>>>>> On Fri, Nov 07, 2008 at 10:37:55AM +0800, Zhao, Yu wrote:
>>>>>> Greg KH wrote:
>>>>>>> This seems like a big problem. How are we going to know to add these
>>>>>>> command line options for devices we haven't even seen/known about yet?
>>>>>>> How do we know the bus ids aren't going to change between boots (hint,
>>>>>>> they are, pci bus ids change all the time...)
>>>>>>> We need to be able to do this kind of thing dynamically, not fixed at
>>>>>>> boot time, which seems way to early to even know about this, right?
>>>>>>> thanks,
>>>>>>> greg k-h
>>>>>> Yes, I totally agree. Doing things dynamically is better.
>>>>>>
>>>>>> The purpose of these parameters is to rebalance and align resources for
>>>>>> device that has BARs encapsulated in various new capabilities (SR-IOV,
>>>>>> etc.), because most of existing BIOSes don't take care of those BARs.
>>>>> But how are you going to know what the proper device ids are going to
>>>>> be before the machine boots? I don't see how these options are ever
>>>>> going to work properly for a "real" user.
>>>>>> If we do resource rebalance after system is up, do you think there is
>>>>>> any side effect or impact to other subsystem other than PCI (e.g.
>>>>>> MTRR)?
>>>>> I don't think so.
>>>>>> I haven't had much thinking on the dynamical resource rebalance. If you
>>>>>> have any idea about this, can you please suggest?
>>>>> Yeah, it's going to be hard :)
>>>>> We've thought about this in the past, and even Microsoft said it was
>>>>> going to happen for Vista, but they realized in the end, like we did a
>>>>> few years previously, that it would require full support of all PCI
>>>>> drivers as well (if you rebalance stuff that is already bound to a
>>>>> driver.) So they dropped it.
>>>>> When would you want to do this kind of rebalancing? Before any PCI
>>>>> driver is bound to any devices? Or afterwards?
>>>> I guess if we want the rebalance dynamic, then we should have it full --
>>>> the rebalance would be functional even after the driver is loaded.
>>>>
>>>> But in most cases, there will be problem when we unload driver from a
>>>> hard disk controller, etc. We can mount root on a ramdisk and do the
>>>> rebalance there, but it's complicated for a real user.
>>>>
>>>> So looks like doing rebalancing before any driver is bound to any device
>>>> is also a nice idea, if user can get a shell to do rebalance before
>>>> built-in PCI driver grabs device.
>>> That's not going to work, it needs to happen before any PCI device is
>>> bound, which is before init runs.
>> I don't think it can work either. Then we have to do rebalance after the
>> driver bounding. But what should we do if we can't unload the driver (hard
>> disk controller, etc.)?
>
> Well, to do it "correctly" you are going to have to tell the driver to
> shut itself down, and reinitialize itself.
>
> Turns out, that doesn't really work for disk and network devices without
> dropping the connection (well, network devices should be fine probably).
>
> So you just can't do this, sorry. That's why the BIOS handles all of
> these issues in a PCI hotplug system.
>
> How does the hardware people think we are going to handle this in the
> OS? It's not something that any operating system can do, is it part of
> the IOV PCI spec somewhere?

No, it's not part of the PCI IOV spec.

I just want the IOV (and whole PCI subsystem) have more flexibility on
various BIOSes. So can we reconsider about resource rebalance as boot
option, or should we forget about this idea?

Regards,
Yu

2008-11-07 08:30:16

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>> Well, to do it "correctly" you are going to have to tell the driver to
>> shut itself down, and reinitialize itself.
>> Turns out, that doesn't really work for disk and network devices without
>> dropping the connection (well, network devices should be fine probably).
>> So you just can't do this, sorry. That's why the BIOS handles all of
>> these issues in a PCI hotplug system.
>> How does the hardware people think we are going to handle this in the
>> OS? It's not something that any operating system can do, is it part of
>> the IOV PCI spec somewhere?
>
> No, it's not part of the PCI IOV spec.
>
> I just want the IOV (and whole PCI subsystem) have more flexibility on
> various BIOSes. So can we reconsider about resource rebalance as boot
> option, or should we forget about this idea?

As you have proposed it, the boot option will not work at all, so I
think we need to forget about it. Especially if it is not really
needed.

thanks,

greg k-h

2008-11-07 08:36:25

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Greg KH wrote:
> On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>>> Well, to do it "correctly" you are going to have to tell the driver to
>>> shut itself down, and reinitialize itself.
>>> Turns out, that doesn't really work for disk and network devices without
>>> dropping the connection (well, network devices should be fine probably).
>>> So you just can't do this, sorry. That's why the BIOS handles all of
>>> these issues in a PCI hotplug system.
>>> How does the hardware people think we are going to handle this in the
>>> OS? It's not something that any operating system can do, is it part of
>>> the IOV PCI spec somewhere?
>> No, it's not part of the PCI IOV spec.
>>
>> I just want the IOV (and whole PCI subsystem) have more flexibility on
>> various BIOSes. So can we reconsider about resource rebalance as boot
>> option, or should we forget about this idea?
>
> As you have proposed it, the boot option will not work at all, so I
> think we need to forget about it. Especially if it is not really
> needed.

I guess at least one thing would work if people don't want to boot
twice: give the bus number 0 as rebalance starting point, then all
system resources would be reshuffled :-)

Thanks,
Yu

2008-11-07 13:17:19

by Yu Zhao

[permalink] [raw]
Subject: Re: git repository for SR-IOV development?

Hello Lance,

Thanks for your interest in SR-IOV. As Greg said we can't have a git
tree for the change, but you are welcome to ask any question here and I
also will keep you informed if there is any update on the SR-IOV patches.

Thanks,
Yu

Greg KH wrote:
> On Thu, Nov 06, 2008 at 11:58:25AM -0800, H L wrote:
>> --- On Thu, 11/6/08, Greg KH <[email protected]> wrote:
>>
>>> On Thu, Nov 06, 2008 at 08:51:09AM -0800, H L wrote:
>>>> Has anyone initiated or given consideration to the
>>> creation of a git
>>>> repository (say, on kernel.org) for SR-IOV
>>> development?
>>>
>>> Why? It's only a few patches, right? Why would it
>>> need a whole new git
>>> tree?
>>
>> So as to minimize the time and effort patching a kernel, especially if
>> the tree (and/or hash level) against which the patches were created
>> fails to be specified on a mailing-list. Plus, there appears to be
>> questions raised on how, precisely, the implementation should
>> ultimately be modeled and especially given that, who knows at this
>> point what number of patches will ultimately be submitted? I know
>> I've built the "7-patch" one (painfully, by the way), and I'm aware
>> there's another 15-patch set out there which I've not yet examined.
>
> It's a mere 7 or 15 patches, you don't need a whole git tree for
> something small like that.
>
> Especially as there only seems to be one developer doing real work...
>
> thanks,
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-11-07 15:16:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Anthony Liguori <[email protected]> writes:
>
> What we would rather do in KVM, is have the VFs appear in the host as
> standard network devices. We would then like to back our existing PV
> driver to this VF directly bypassing the host networking stack. A key
> feature here is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get zero-copy
> receive traffic. This will perform just as well as doing passthrough
> (at least) and avoid all that ugliness of dealing with SR-IOV in the
> guest.

But you shift a lot of ugliness into the host network stack again.
Not sure that is a good trade off.

Also it would always require context switches and I believe one
of the reasons for the PV/VF model is very low latency IO and having
heavyweight switches to the host and back would be against that.

-Andi

--
[email protected]

2008-11-07 15:21:07

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

While we are arguing what the software model the SR-IOV should be, let
me ask two simple questions first:

1, What does the SR-IOV looks like?
2, Why do we need to support it?

I'm sure people have different understandings from their own view
points. No one is wrong, but, please don't make thing complicated and
don't ignore user requirements.

PCI SIG and hardware vendors create such thing intending to make
hardware resource in one PCI device be shared from different software
instances -- I guess all of us agree with this. No doubt PF is real
function in the PCI device, but VF is different? No, it also has its own
Bus, Device and Function numbers, and PCI configuration space and Memory
Space (MMIO). To be more detailed, it can response to and initiate PCI
Transaction Layer Protocol packets, which means it can do everything a
PF can in PCI level. From these obvious behaviors, we can conclude PCI
SIG model VF as a normal PCI device function, even it's not standalone.

As you know the Linux kernel is the base of various virtual machine
monitors such as KVM, Xen, OpenVZ and VServer. We need SR-IOV support in
the kernel because mostly it helps high-end users (IT departments, HPC,
etc.) to share limited hardware resources among hundreds or even
thousands virtual machines and hence reduce the cost. How can we make
these virtual machine monitors utilize the advantage of SR-IOV without
spending too much effort meanwhile remaining architectural correctness?
I believe making VF represent as much closer as a normal PCI device
(struct pci_dev) is the best way in current situation, because this is
not only what the hardware designers expect us to do but also the usage
model that KVM, Xen and other VMMs have already supported.

I agree that API in the SR-IOV pacth is arguable and the concerns such
as lack of PF driver, etc. are also valid. But I personally think these
stuff are not essential problems to me and other SR-IOV driver
developers. People can refine things but don't want to recreate things
in another totally different way especially that way doesn't bring them
obvious benefits.

As I can see that we are now reaching a point that a decision must be
made, I know this is such difficult thing in an open and free community
but fortunately we have a lot of talented and experienced people here.
So let's make it happen, and keep our loyal users happy!

Thanks,
Yu

2008-11-07 16:09:45

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Anthony Liguori wrote:
> Matthew Wilcox wrote:
>> [Anna, can you fix your word-wrapping please? Your lines appear to be
>> infinitely long which is most unpleasant to reply to]
>>
>> On Thu, Nov 06, 2008 at 05:38:16PM +0000, Fischer, Anna wrote:
>>
>>>> Where would the VF drivers have to be associated? On the "pci_dev"
>>>> level or on a higher one?
>>>>
>>> A VF appears to the Linux OS as a standard (full, additional) PCI
>>> device. The driver is associated in the same way as for a normal PCI
>>> device. Ideally, you would use SR-IOV devices on a virtualized system,
>>> for example, using Xen. A VF can then be assigned to a guest domain as
>>> a full PCI device.
>>>
>>
>> It's not clear thats the right solution. If the VF devices are _only_
>> going to be used by the guest, then arguably, we don't want to create
>> pci_devs for them in the host. (I think it _is_ the right answer, but I
>> want to make it clear there's multiple opinions on this).
>>
>
> The VFs shouldn't be limited to being used by the guest.

Yes, VF driver running in the host is supported :-)

>
> SR-IOV is actually an incredibly painful thing. You need to have a VF
> driver in the guest, do hardware pass through, have a PV driver stub in
> the guest that's hypervisor specific (a VF is not usable on it's own),
> have a device specific backend in the VMM, and if you want to do live
> migration, have another PV driver in the guest that you can do teaming
> with. Just a mess.

Actually not so mess. VF driver can be a plain PCI device driver and
doesn't require any backend in the VMM, or hypervisor specific
knowledge, if the hardware is properly designed. In this case PF driver
controls hardware resource allocation for VFs and VF driver can work
without any communication to PF driver or VMM.

>
> What we would rather do in KVM, is have the VFs appear in the host as
> standard network devices. We would then like to back our existing PV
> driver to this VF directly bypassing the host networking stack. A key
> feature here is being able to fill the VF's receive queue with guest
> memory instead of host kernel memory so that you can get zero-copy
> receive traffic. This will perform just as well as doing passthrough
> (at least) and avoid all that ugliness of dealing with SR-IOV in the guest.

If the hardware supports both SR-IOV and IOMMU, I wouldn't suggest
people to do so, because they will get better performance by directly
assigning VF to the guest.

However, lots of low-end machines don't have SR-IOV and IOMMU support.
They may have multi queue NIC, which uses built-in L2 switch to dispense
packets to different DMA queue according to MAC address. They definitely
can benefit a lot if there is software support for the DMA queue hooking
virtio-net backend as you suggested.

>
> This eliminates all of the mess of various drivers in the guest and all
> the associated baggage of doing hardware passthrough.
>
> So IMHO, having VFs be usable in the host is absolutely critical because
> I think it's the only reasonable usage model.

Please don't worry, we have take this usage model as well as container
model into account when designing SR-IOV framework for the kernel.

>
> Regards,
>
> Anthony Liguori
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-11-07 20:13:15

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Fri, Nov 07, 2008 at 11:17:40PM +0800, Yu Zhao wrote:
> While we are arguing what the software model the SR-IOV should be, let me
> ask two simple questions first:
>
> 1, What does the SR-IOV looks like?
> 2, Why do we need to support it?

I don't think we need to worry about those questions, as we can see what
the SR-IOV interface looks like by looking at the PCI spec, and we know
Linux needs to support it, as Linux needs to support everything :)

(note, community members that can not see the PCI specs at this point in
time, please know that we are working on resolving these issues,
hopefully we will have some good news within a month or so.)

> As you know the Linux kernel is the base of various virtual machine
> monitors such as KVM, Xen, OpenVZ and VServer. We need SR-IOV support in
> the kernel because mostly it helps high-end users (IT departments, HPC,
> etc.) to share limited hardware resources among hundreds or even thousands
> virtual machines and hence reduce the cost. How can we make these virtual
> machine monitors utilize the advantage of SR-IOV without spending too much
> effort meanwhile remaining architectural correctness? I believe making VF
> represent as much closer as a normal PCI device (struct pci_dev) is the
> best way in current situation, because this is not only what the hardware
> designers expect us to do but also the usage model that KVM, Xen and other
> VMMs have already supported.

But would such an api really take advantage of the new IOV interfaces
that are exposed by the new device type?

> I agree that API in the SR-IOV pacth is arguable and the concerns such as
> lack of PF driver, etc. are also valid. But I personally think these stuff
> are not essential problems to me and other SR-IOV driver developers.

How can the lack of a PF driver not be a valid concern at this point in
time? Without such a driver written, how can we know that the SR-IOV
interface as created is sufficient, or that it even works properly?

Here's what I see we need to have before we can evaluate if the IOV core
PCI patches are acceptable:
- a driver that uses this interface
- a PF driver that uses this interface.

Without those, we can't determine if the infrastructure provided by the
IOV core even is sufficient, right?

Rumor has it that there is both of the above things floating around, can
someone please post them to the linux-pci list so that we can see how
this all works together?

thanks,

greg k-h

2008-11-07 20:13:51

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:
> Greg KH wrote:
>> On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>>>> Well, to do it "correctly" you are going to have to tell the driver to
>>>> shut itself down, and reinitialize itself.
>>>> Turns out, that doesn't really work for disk and network devices without
>>>> dropping the connection (well, network devices should be fine probably).
>>>> So you just can't do this, sorry. That's why the BIOS handles all of
>>>> these issues in a PCI hotplug system.
>>>> How does the hardware people think we are going to handle this in the
>>>> OS? It's not something that any operating system can do, is it part of
>>>> the IOV PCI spec somewhere?
>>> No, it's not part of the PCI IOV spec.
>>>
>>> I just want the IOV (and whole PCI subsystem) have more flexibility on
>>> various BIOSes. So can we reconsider about resource rebalance as boot
>>> option, or should we forget about this idea?
>> As you have proposed it, the boot option will not work at all, so I
>> think we need to forget about it. Especially if it is not really
>> needed.
>
> I guess at least one thing would work if people don't want to boot twice:
> give the bus number 0 as rebalance starting point, then all system
> resources would be reshuffled :-)

Hm, but don't we do that today with our basic resource reservation logic
at boot time? What would be different about this kind of proposal?

thanks,

greg k-h

2008-11-08 05:08:20

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Greg KH wrote:
> On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:
>> Greg KH wrote:
>>> On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>>>>> Well, to do it "correctly" you are going to have to tell the driver to
>>>>> shut itself down, and reinitialize itself.
>>>>> Turns out, that doesn't really work for disk and network devices without
>>>>> dropping the connection (well, network devices should be fine probably).
>>>>> So you just can't do this, sorry. That's why the BIOS handles all of
>>>>> these issues in a PCI hotplug system.
>>>>> How does the hardware people think we are going to handle this in the
>>>>> OS? It's not something that any operating system can do, is it part of
>>>>> the IOV PCI spec somewhere?
>>>> No, it's not part of the PCI IOV spec.
>>>>
>>>> I just want the IOV (and whole PCI subsystem) have more flexibility on
>>>> various BIOSes. So can we reconsider about resource rebalance as boot
>>>> option, or should we forget about this idea?
>>> As you have proposed it, the boot option will not work at all, so I
>>> think we need to forget about it. Especially if it is not really
>>> needed.
>> I guess at least one thing would work if people don't want to boot twice:
>> give the bus number 0 as rebalance starting point, then all system
>> resources would be reshuffled :-)
>
> Hm, but don't we do that today with our basic resource reservation logic
> at boot time? What would be different about this kind of proposal?

The generic PCI core can do this but this feature is kind of disabled by
low level PCI code in x86. The low level code tries to reserve resource
according to configuration from BIOS. If the BIOS is wrong, the
allocation would fail and the generic PCI core couldn't repair it
because the bridge resources may have been allocated by the PCI low
level and the PCI core can't expand them to find enough resource for the
subordinates.

The proposal is to disable x86 PCI low level to allocation resources
according to BIOS so PCI core can fully control the resource allocation.
The PCI core takes all resources from BARs it knows into account and
configure the resource windows on the bridges according to its own
calculation.

Regards,
Yu

2008-11-08 05:28:43

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Sat, Nov 08, 2008 at 01:00:29PM +0800, Yu Zhao wrote:
> Greg KH wrote:
>> On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:
>>> Greg KH wrote:
>>>> On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>>>>>> Well, to do it "correctly" you are going to have to tell the driver to
>>>>>> shut itself down, and reinitialize itself.
>>>>>> Turns out, that doesn't really work for disk and network devices
>>>>>> without
>>>>>> dropping the connection (well, network devices should be fine
>>>>>> probably).
>>>>>> So you just can't do this, sorry. That's why the BIOS handles all of
>>>>>> these issues in a PCI hotplug system.
>>>>>> How does the hardware people think we are going to handle this in the
>>>>>> OS? It's not something that any operating system can do, is it part
>>>>>> of
>>>>>> the IOV PCI spec somewhere?
>>>>> No, it's not part of the PCI IOV spec.
>>>>>
>>>>> I just want the IOV (and whole PCI subsystem) have more flexibility on
>>>>> various BIOSes. So can we reconsider about resource rebalance as boot
>>>>> option, or should we forget about this idea?
>>>> As you have proposed it, the boot option will not work at all, so I
>>>> think we need to forget about it. Especially if it is not really
>>>> needed.
>>> I guess at least one thing would work if people don't want to boot twice:
>>> give the bus number 0 as rebalance starting point, then all system
>>> resources would be reshuffled :-)
>> Hm, but don't we do that today with our basic resource reservation logic
>> at boot time? What would be different about this kind of proposal?
>
> The generic PCI core can do this but this feature is kind of disabled by
> low level PCI code in x86. The low level code tries to reserve resource
> according to configuration from BIOS. If the BIOS is wrong, the allocation
> would fail and the generic PCI core couldn't repair it because the bridge
> resources may have been allocated by the PCI low level and the PCI core
> can't expand them to find enough resource for the subordinates.

Yes, we do this on purpose.

> The proposal is to disable x86 PCI low level to allocation resources
> according to BIOS so PCI core can fully control the resource allocation.
> The PCI core takes all resources from BARs it knows into account and
> configure the resource windows on the bridges according to its own
> calculation.

Ah, so you mean we should revert back to the way we use to do x86 PCI
resource allocation from about a year and a half ago to about 8 years
ago?

Hint, there was a reason why we switched over to using the BIOS instead
of doing it ourselves. Turns out we have to trust the BIOS here, as
that is exactly what other operating systems do. Trying to do it on our
own was too fragile and resulted in too many problems over time.

Go look at the archives for when this all was switched, you'll see the
reasons why.

So no, we will not be going back to the way we used to do things, we
changed for a reason :)

thanks,

greg k-h

2008-11-08 05:57:53

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Sat, Nov 08, 2008 at 01:50:20PM +0800, freevanx wrote:
> Dear all,
>
> I'm glad to hear this. In fact, I'm developing for BIOS area. This feature
> is very useful when your system have one or more PCI/PCIe hotplug slot.
> Generally, BIOS reserve amount of resource for empty hotplug slot by
> default, but it is not always enough for all device. We have many kind of
> Express Modules which consume different amount of resouce, generally we
> reserve a small number of resouce for this, so, sometime, some Express
> Modules hotplug without card installed at boot time, it will not useable.

Then fix the BIOS :)

Seriously, that is what the PCI hotplug spec says to do, right?

> Then, Microsoft say they implement PCI Multi-level Resource Rebanlence in
> Vista and Server 2008, you can refer
> http://www.microsoft.com/whdc/archive/multilevel-rebal.mspx
> http://www.microsoft.com/whdc/connect/pci/PCI-rsc.mspx

But they did not implement this for Vista, and pulled it before it
shipped, right? That is what the driver development documentation for
Vista said that I read.

Do you know if they are going to add it back for Windows 7? If so, then
we should probably look into this, otherwise, no need to, as the BIOSes
will be fixed properly.

> They use a method of ACPI to tell OS that you can ignore the resource
> allocation of PCI devices below the bridge. I think this is more useful than
> specify the BUS number to ignore resouce allocation, because the BUS number
> often change due to some need by BIOS or new PCI/PCIe device added in
> system. Users generally do not know the system architecture and can not
> specify the BUS number of the root bridge, while if you specify the _DSM
> method like MS to the root bridge of hotplug slot, it is a much easier way
> to archive for BIOS writers.

Yes, push the burden of getting this right onto the OS developers,
instead of doing it properly in the BIOS, how fun :(

Seriously, it isn't that hard to reserve enough space on most machines
in the BIOS to get this correct. It only gets messy when you have
hundreds of hotplug PCI slots and bridges. Even then, the BIOS writers
have been able to resolve this for a while due to this kind of hardware
shipping successfully with Linux for many years now.

> PS:
> Since my mail address was blocked by the maillist, this mail may not reach
> people who only in linux kernel maillist.

It is being blocked because you are sending out html email.

Please reconfigure your gmail client to not do that, and your mail will
go through just fine.

thanks,

greg k-h

2008-11-08 06:06:40

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Greg KH wrote:
> On Sat, Nov 08, 2008 at 01:00:29PM +0800, Yu Zhao wrote:
>> Greg KH wrote:
>>> On Fri, Nov 07, 2008 at 04:35:47PM +0800, Zhao, Yu wrote:
>>>> Greg KH wrote:
>>>>> On Fri, Nov 07, 2008 at 04:17:02PM +0800, Zhao, Yu wrote:
>>>>>>> Well, to do it "correctly" you are going to have to tell the driver to
>>>>>>> shut itself down, and reinitialize itself.
>>>>>>> Turns out, that doesn't really work for disk and network devices
>>>>>>> without
>>>>>>> dropping the connection (well, network devices should be fine
>>>>>>> probably).
>>>>>>> So you just can't do this, sorry. That's why the BIOS handles all of
>>>>>>> these issues in a PCI hotplug system.
>>>>>>> How does the hardware people think we are going to handle this in the
>>>>>>> OS? It's not something that any operating system can do, is it part
>>>>>>> of
>>>>>>> the IOV PCI spec somewhere?
>>>>>> No, it's not part of the PCI IOV spec.
>>>>>>
>>>>>> I just want the IOV (and whole PCI subsystem) have more flexibility on
>>>>>> various BIOSes. So can we reconsider about resource rebalance as boot
>>>>>> option, or should we forget about this idea?
>>>>> As you have proposed it, the boot option will not work at all, so I
>>>>> think we need to forget about it. Especially if it is not really
>>>>> needed.
>>>> I guess at least one thing would work if people don't want to boot twice:
>>>> give the bus number 0 as rebalance starting point, then all system
>>>> resources would be reshuffled :-)
>>> Hm, but don't we do that today with our basic resource reservation logic
>>> at boot time? What would be different about this kind of proposal?
>> The generic PCI core can do this but this feature is kind of disabled by
>> low level PCI code in x86. The low level code tries to reserve resource
>> according to configuration from BIOS. If the BIOS is wrong, the allocation
>> would fail and the generic PCI core couldn't repair it because the bridge
>> resources may have been allocated by the PCI low level and the PCI core
>> can't expand them to find enough resource for the subordinates.
>
> Yes, we do this on purpose.
>
>> The proposal is to disable x86 PCI low level to allocation resources
>> according to BIOS so PCI core can fully control the resource allocation.
>> The PCI core takes all resources from BARs it knows into account and
>> configure the resource windows on the bridges according to its own
>> calculation.
>
> Ah, so you mean we should revert back to the way we use to do x86 PCI
> resource allocation from about a year and a half ago to about 8 years
> ago?
>
> Hint, there was a reason why we switched over to using the BIOS instead
> of doing it ourselves. Turns out we have to trust the BIOS here, as
> that is exactly what other operating systems do. Trying to do it on our
> own was too fragile and resulted in too many problems over time.
>
> Go look at the archives for when this all was switched, you'll see the
> reasons why.
>
> So no, we will not be going back to the way we used to do things, we
> changed for a reason :)

So it's really a long story, and I'm glad to see the reason.

Actually there was no such thing in early SR-IOV patches, but months ago
I heard some complaints that pushed me to do this kind of reverse. Looks
like I have to let these complaints turn to BIOS people from now on :-)

Regards,
Yu

2008-11-08 11:11:09

by Fischer, Anna

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

> Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
> Importance: High
>
> On Fri, Nov 07, 2008 at 11:17:40PM +0800, Yu Zhao wrote:
> > While we are arguing what the software model the SR-IOV should be,
> let me
> > ask two simple questions first:
> >
> > 1, What does the SR-IOV looks like?
> > 2, Why do we need to support it?
>
> I don't think we need to worry about those questions, as we can see
> what
> the SR-IOV interface looks like by looking at the PCI spec, and we know
> Linux needs to support it, as Linux needs to support everything :)
>
> (note, community members that can not see the PCI specs at this point
> in
> time, please know that we are working on resolving these issues,
> hopefully we will have some good news within a month or so.)
>
> > As you know the Linux kernel is the base of various virtual machine
> > monitors such as KVM, Xen, OpenVZ and VServer. We need SR-IOV support
> in
> > the kernel because mostly it helps high-end users (IT departments,
> HPC,
> > etc.) to share limited hardware resources among hundreds or even
> thousands
> > virtual machines and hence reduce the cost. How can we make these
> virtual
> > machine monitors utilize the advantage of SR-IOV without spending too
> much
> > effort meanwhile remaining architectural correctness? I believe
> making VF
> > represent as much closer as a normal PCI device (struct pci_dev) is
> the
> > best way in current situation, because this is not only what the
> hardware
> > designers expect us to do but also the usage model that KVM, Xen and
> other
> > VMMs have already supported.
>
> But would such an api really take advantage of the new IOV interfaces
> that are exposed by the new device type?

I agree with what Yu says. The idea is to have hardware capabilities to
virtualize a PCI device in a way that those virtual devices can represent
full PCI devices. The advantage of that is that those virtual device can
then be used like any other standard PCI device, meaning we can use existing
OS tools, configuration mechanism etc. to start working with them. Also, when
using a virtualization-based system, e.g. Xen or KVM, we do not need
to introduce new mechanisms to make use of SR-IOV, because we can handle
VFs as full PCI devices.

A virtual PCI device in hardware (a VF) can be as powerful or complex as
you like, or it can be very simple. But the big advantage of SR-IOV is
that hardware presents a complete PCI device to the OS - as opposed to
some resources, or queues, that need specific new configuration and
assignment mechanisms in order to use them with a guest OS (like, for
example, VMDq or similar technologies).

Anna

2008-11-08 15:38:20

by Leonid Grossman

[permalink] [raw]
Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support



> -----Original Message-----
> From: Fischer, Anna [mailto:[email protected]]
> Sent: Saturday, November 08, 2008 3:10 AM
> To: Greg KH; Yu Zhao
> Cc: Matthew Wilcox; Anthony Liguori; H L; [email protected];
> [email protected]; Chiang, Alexander;
[email protected];
> [email protected]; [email protected];
[email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Leonid Grossman;
> [email protected]; [email protected]; [email protected]
> Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
>


> > But would such an api really take advantage of the new IOV
interfaces
> > that are exposed by the new device type?
>
> I agree with what Yu says. The idea is to have hardware capabilities
to
> virtualize a PCI device in a way that those virtual devices can
represent
> full PCI devices. The advantage of that is that those virtual device
can
> then be used like any other standard PCI device, meaning we can use
> existing
> OS tools, configuration mechanism etc. to start working with them.
Also,
> when
> using a virtualization-based system, e.g. Xen or KVM, we do not need
> to introduce new mechanisms to make use of SR-IOV, because we can
handle
> VFs as full PCI devices.
>
> A virtual PCI device in hardware (a VF) can be as powerful or complex
as
> you like, or it can be very simple. But the big advantage of SR-IOV is
> that hardware presents a complete PCI device to the OS - as opposed to
> some resources, or queues, that need specific new configuration and
> assignment mechanisms in order to use them with a guest OS (like, for
> example, VMDq or similar technologies).
>
> Anna


Ditto.
Taking netdev interface as an example - a queue pair is a great way to
scale across cpu cores in a single OS image, but it is just not a good
way to share device across multiple OS images.
The best unit of virtualization is a VF that is implemented as a
complete netdev pci device (not a subset of a pci device).
This way, native netdev device drivers can work for direct hw access to
a VF "as is", and most/all Linux networking features (including VMQ)
will work in a guest.
Also, guest migration for netdev interfaces (both direct and virtual)
can be supported via native Linux mechanism (bonding driver), while Dom0
can retain "veto power" over any guest direct interface operation it
deems privileged (vlan, mac address, promisc mode, bandwidth allocation
between VFs, etc.).

Leonid

2008-11-09 06:42:54

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Thu, Nov 06, 2008 at 04:40:21PM -0600, Anthony Liguori wrote:

> We've been talking about avoiding hardware passthrough entirely and
> just backing a virtio-net backend driver by a dedicated VF in the
> host. That avoids a huge amount of guest-facing complexity, let's
> migration Just Work, and should give the same level of performance.

I don't believe that it will, and every benchmark I've seen or have
done so far shows a significant performance gap between virtio and
direct assignment, even on 1G ethernet. I am willing however to
reserve judgement until someone implements your suggestion and
actually measures it, preferably on 10G ethernet.

No doubt device assignment---and SR-IOV in particular---are complex,
but I hardly think ignoring it as you seem to propose is the right
approach.

Cheers,
Muli
--
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
<->
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/

2008-11-09 12:46:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> It's that "second" part that I'm worried about. How is that going to
> happen? Do you have any patches that show this kind of "assignment"?
>
>

For kvm, this is in 2.6.28-rc.

Note there are two ways to assign a device to a guest:

- run the VF driver in the guest: this has the advantage of best
performance, but requires pinning all guest memory, makes live migration
a tricky proposition, and ties the guest to the underlying hardware.
- run the VF driver in the host, and use virtio to connect the guest to
the host: allows paging the guest and allows straightforward live
migration, but reduces performance, and hides any features not exposed
by virtio from the guest.


--
error compiling committee.c: too many arguments to function

2008-11-09 12:50:49

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Matthew Wilcox wrote:
>> What we would rather do in KVM, is have the VFs appear in the host as
>> standard network devices. We would then like to back our existing PV
>> driver to this VF directly bypassing the host networking stack. A key
>> feature here is being able to fill the VF's receive queue with guest
>> memory instead of host kernel memory so that you can get zero-copy
>> receive traffic. This will perform just as well as doing passthrough
>> (at least) and avoid all that ugliness of dealing with SR-IOV in the guest.
>>
>
> This argues for ignoring the SR-IOV mess completely.

It does, but VF-in-host is not the only model that we want to support.
It's just the most appealing.

There will definitely be people who want to run VF-in-guest.

--
error compiling committee.c: too many arguments to function

2008-11-09 12:53:50

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Andi Kleen wrote:
> Anthony Liguori <[email protected]> writes:
>
>> What we would rather do in KVM, is have the VFs appear in the host as
>> standard network devices. We would then like to back our existing PV
>> driver to this VF directly bypassing the host networking stack. A key
>> feature here is being able to fill the VF's receive queue with guest
>> memory instead of host kernel memory so that you can get zero-copy
>> receive traffic. This will perform just as well as doing passthrough
>> (at least) and avoid all that ugliness of dealing with SR-IOV in the
>> guest.
>>
>
> But you shift a lot of ugliness into the host network stack again.
> Not sure that is a good trade off.
>

The net effect will be positive. We will finally have aio networking
from userspace (can send process memory without resorting to
sendfile()), and we'll be able to assign a queue to a process (which
will enable all sorts of interesting high performance things; basically
VJ channels without kernel involvement).

> Also it would always require context switches and I believe one
> of the reasons for the PV/VF model is very low latency IO and having
> heavyweight switches to the host and back would be against that.
>

It's true that latency would suffer (or alternatively cpu consumption
would increase).

--
error compiling committee.c: too many arguments to function

2008-11-09 12:59:44

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:
>> We've been talking about avoiding hardware passthrough entirely and
>> just backing a virtio-net backend driver by a dedicated VF in the
>> host. That avoids a huge amount of guest-facing complexity, let's
>> migration Just Work, and should give the same level of performance.
>>
>
> Does that involve this patch set? Or a different type of interface.
>

So long as the VF is exposed as a standalone PCI device, it's the same
interface. In fact you can take a random PCI card and expose it to a
guest this way; it doesn't have to be SR-IOV. Of course, with a
standard PCI card you won't get much sharing (a quad port NIC will be
good for four guests).

We'll need other changes in the network stack, but these are orthogonal
to SR-IOV.

--
error compiling committee.c: too many arguments to function

2008-11-09 13:04:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Muli Ben-Yehuda wrote:
>> We've been talking about avoiding hardware passthrough entirely and
>> just backing a virtio-net backend driver by a dedicated VF in the
>> host. That avoids a huge amount of guest-facing complexity, let's
>> migration Just Work, and should give the same level of performance.
>>
>
> I don't believe that it will, and every benchmark I've seen or have
> done so far shows a significant performance gap between virtio and
> direct assignment, even on 1G ethernet. I am willing however to
> reserve judgement until someone implements your suggestion and
> actually measures it, preferably on 10G ethernet.
>

Right now virtio copies data, and has other inefficiencies. With a
dedicated VF, we can eliminate the copies.

CPU utilization and latency will be worse. If we can limit the
slowdowns to an acceptable amount, the simplicity and other advantages
of VF-in-host may outweigh the performance degradation.

> No doubt device assignment---and SR-IOV in particular---are complex,
> but I hardly think ignoring it as you seem to propose is the right
> approach.

I agree. We should hedge our bets and support both models.

--
error compiling committee.c: too many arguments to function

2008-11-09 14:19:33

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

Hi!

> >>> If we do resource rebalance after system is up, do you think there is any
> >>> side effect or impact to other subsystem other than PCI (e.g. MTRR)?
> >> I don't think so.
> >>> I haven't had much thinking on the dynamical resource rebalance. If you
> >>> have any idea about this, can you please suggest?
> >> Yeah, it's going to be hard :)
> >> We've thought about this in the past, and even Microsoft said it was
> >> going to happen for Vista, but they realized in the end, like we did a
> >> few years previously, that it would require full support of all PCI
> >> drivers as well (if you rebalance stuff that is already bound to a
> >> driver.) So they dropped it.
> >> When would you want to do this kind of rebalancing? Before any PCI
> >> driver is bound to any devices? Or afterwards?
> >
> > I guess if we want the rebalance dynamic, then we should have it full --
> > the rebalance would be functional even after the driver is loaded.
> >
> > But in most cases, there will be problem when we unload driver from a hard
> > disk controller, etc. We can mount root on a ramdisk and do the rebalance
> > there, but it's complicated for a real user.
> >
> > So looks like doing rebalancing before any driver is bound to any device is
> > also a nice idea, if user can get a shell to do rebalance before built-in
> > PCI driver grabs device.
>
> That's not going to work, it needs to happen before any PCI device is
> bound, which is before init runs.

We could run shell from early initrd... And PCI is not required for
initrd, right?

(Ok, I guess this is in we could do it but is it worth the cost?-category...)

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-11-09 14:34:56

by Alexander Graf

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

>>>>> If we do resource rebalance after system is up, do you think
>>>>> there is any
>>>>> side effect or impact to other subsystem other than PCI (e.g.
>>>>> MTRR)?
>>>> I don't think so.
>>>>> I haven't had much thinking on the dynamical resource rebalance.
>>>>> If you
>>>>> have any idea about this, can you please suggest?
>>>> Yeah, it's going to be hard :)
>>>> We've thought about this in the past, and even Microsoft said it
>>>> was
>>>> going to happen for Vista, but they realized in the end, like we
>>>> did a
>>>> few years previously, that it would require full support of all PCI
>>>> drivers as well (if you rebalance stuff that is already bound to a
>>>> driver.) So they dropped it.
>>>> When would you want to do this kind of rebalancing? Before any PCI
>>>> driver is bound to any devices? Or afterwards?
>>>
>>> I guess if we want the rebalance dynamic, then we should have it
>>> full --
>>> the rebalance would be functional even after the driver is loaded.
>>>
>>> But in most cases, there will be problem when we unload driver
>>> from a hard
>>> disk controller, etc. We can mount root on a ramdisk and do the
>>> rebalance
>>> there, but it's complicated for a real user.
>>>
>>> So looks like doing rebalancing before any driver is bound to any
>>> device is
>>> also a nice idea, if user can get a shell to do rebalance before
>>> built-in
>>> PCI driver grabs device.
>>
>> That's not going to work, it needs to happen before any PCI device is
>> bound, which is before init runs.
>
> We could run shell from early initrd... And PCI is not required for
> initrd, right?

You can't be sure of that - compile your ATA driver =y and you'll
definitely end up using PCI.

Alex

2008-11-09 19:28:54

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Sun, Nov 09, 2008 at 02:44:06PM +0200, Avi Kivity wrote:
> Greg KH wrote:
>> It's that "second" part that I'm worried about. How is that going to
>> happen? Do you have any patches that show this kind of "assignment"?
>>
>>
>
> For kvm, this is in 2.6.28-rc.

Where? I just looked and couldn't find anything, but odds are I was
looking in the wrong place :(

> Note there are two ways to assign a device to a guest:
>
> - run the VF driver in the guest: this has the advantage of best
> performance, but requires pinning all guest memory, makes live migration a
> tricky proposition, and ties the guest to the underlying hardware.

Is this what you would prefer for kvm?

> - run the VF driver in the host, and use virtio to connect the guest to the
> host: allows paging the guest and allows straightforward live migration,
> but reduces performance, and hides any features not exposed by virtio from
> the guest.

thanks,

greg k-h

2008-11-09 19:38:48

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:
> On Sun, Nov 09, 2008 at 02:44:06PM +0200, Avi Kivity wrote:
>
>> Greg KH wrote:
>>
>>> It's that "second" part that I'm worried about. How is that going to
>>> happen? Do you have any patches that show this kind of "assignment"?
>>>
>>>
>>>
>> For kvm, this is in 2.6.28-rc.
>>
>
> Where? I just looked and couldn't find anything, but odds are I was
> looking in the wrong place :(
>
>

arch/x86/kvm/vtd.c: iommu integration (allows assigning the device's
memory resources)
virt/kvm/irq*: interrupt redirection (allows assigning the device's
interrupt resources)

the rest (pci config space, pio redirection) are in userspace.

>> Note there are two ways to assign a device to a guest:
>>
>> - run the VF driver in the guest: this has the advantage of best
>> performance, but requires pinning all guest memory, makes live migration a
>> tricky proposition, and ties the guest to the underlying hardware.
>>
>
> Is this what you would prefer for kvm?
>
>

It's not my personal preference, but it is a supported configuration.
For some use cases it is the only one that makes sense.

Again, VF-in-guest and VF-in-host both have their places. And since
Linux can be both guest and host, it's best if the VF driver knows
nothing about SR-IOV; it's just a pci driver. The PF driver should
emulate anything that SR-IOV does not provide (like missing pci config
space).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-11 06:35:19

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Sun, Nov 09, 2008 at 09:37:20PM +0200, Avi Kivity wrote:
> Greg KH wrote:
>> On Sun, Nov 09, 2008 at 02:44:06PM +0200, Avi Kivity wrote:
>>
>>> Greg KH wrote:
>>>
>>>> It's that "second" part that I'm worried about. How is that going to
>>>> happen? Do you have any patches that show this kind of "assignment"?
>>>>
>>>>
>>> For kvm, this is in 2.6.28-rc.
>>>
>>
>> Where? I just looked and couldn't find anything, but odds are I was
>> looking in the wrong place :(
>>
>>
>
> arch/x86/kvm/vtd.c: iommu integration (allows assigning the device's memory
> resources)

That file is not in 2.6.28-rc4 :(


> virt/kvm/irq*: interrupt redirection (allows assigning the device's
> interrupt resources)

I only see virt/kvm/irq_comm.c in 2.6.28-rc4.

> the rest (pci config space, pio redirection) are in userspace.

So you don't need these pci core changes at all?

>>> Note there are two ways to assign a device to a guest:
>>>
>>> - run the VF driver in the guest: this has the advantage of best
>>> performance, but requires pinning all guest memory, makes live migration
>>> a tricky proposition, and ties the guest to the underlying hardware.
>>
>> Is this what you would prefer for kvm?
>>
>
> It's not my personal preference, but it is a supported configuration. For
> some use cases it is the only one that makes sense.
>
> Again, VF-in-guest and VF-in-host both have their places. And since Linux
> can be both guest and host, it's best if the VF driver knows nothing about
> SR-IOV; it's just a pci driver. The PF driver should emulate anything that
> SR-IOV does not provide (like missing pci config space).

Yes, we need both.

thanks,

greg k-h

2008-11-11 09:04:25

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Greg KH wrote:



>> arch/x86/kvm/vtd.c: iommu integration (allows assigning the device's memory
>> resources)
>>
>
> That file is not in 2.6.28-rc4 :(
>
>

Sorry, was moved to virt/kvm/ for ia64's benefit.

>
>> virt/kvm/irq*: interrupt redirection (allows assigning the device's
>> interrupt resources)
>>
>
> I only see virt/kvm/irq_comm.c in 2.6.28-rc4.
>
>

kvm_main.c in that directory also has some related bits.

>> the rest (pci config space, pio redirection) are in userspace.
>>
>
> So you don't need these pci core changes at all?
>
>

Not beyond those required for SR-IOV and iommu support.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2008-11-12 22:41:44

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Andi Kleen wrote:
> Anthony Liguori <[email protected]> writes:
>> What we would rather do in KVM, is have the VFs appear in the host as
>> standard network devices. We would then like to back our existing PV
>> driver to this VF directly bypassing the host networking stack. A key
>> feature here is being able to fill the VF's receive queue with guest
>> memory instead of host kernel memory so that you can get zero-copy
>> receive traffic. This will perform just as well as doing passthrough
>> (at least) and avoid all that ugliness of dealing with SR-IOV in the
>> guest.
>
> But you shift a lot of ugliness into the host network stack again.
> Not sure that is a good trade off.
>
> Also it would always require context switches and I believe one
> of the reasons for the PV/VF model is very low latency IO and having
> heavyweight switches to the host and back would be against that.

I don't think it's established that PV/VF will have less latency than
using virtio-net. virtio-net requires a world switch to send a group of
packets. The cost of this (if it stays in kernel) is only a few
thousand cycles on the most modern processors.

Using VT-d means that for every DMA fetch that misses in the IOTLB, you
potentially have to do four memory fetches to main memory. There will
be additional packet latency using VT-d compared to native, it's just
not known how much at this time.

Regards,

Anthony Liguori


> -Andi
>

2008-11-13 07:47:21

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Fri, Nov 07, 2008 at 11:18:37AM +0800, Greg KH wrote:
> On Fri, Nov 07, 2008 at 11:01:29AM +0800, Zhao, Yu wrote:
> > Greg KH wrote:
> >> On Wed, Nov 05, 2008 at 08:33:18PM -0800, Greg KH wrote:
> >>> On Wed, Oct 22, 2008 at 04:45:15PM +0800, Yu Zhao wrote:
> >>>> Documentation/ABI/testing/sysfs-bus-pci | 33
> >>>> +++++++++++++++++++++++++++++++
> >>>> 1 files changed, 33 insertions(+), 0 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci
> >>>> b/Documentation/ABI/testing/sysfs-bus-pci
> >>>> index ceddcff..41cce8f 100644
> >>>> --- a/Documentation/ABI/testing/sysfs-bus-pci
> >>>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> >>>> @@ -9,3 +9,36 @@ Description:
> >>>> that some devices may have malformatted data. If the
> >>>> underlying VPD has a writable section then the
> >>>> corresponding section of this file will be writable.
> >>>> +
> >>>> +What: /sys/bus/pci/devices/.../iov/enable
> >>> Are you sure this is still the correct location with your change to
> >>> struct device?
> >> Nevermind, this is correct.
> >> But the bigger problem is that userspace doesn't know when these
> >> attributes show up. So tools like udev and HAL and others can't look
> >> for them as they never get notified, and they don't even know if they
> >> should be looking for them or not.
> >> Is there any way to tie these attributes to the "main" pci device so
> >> that they get created before the device is announced to the world?
> >> Doing that would solve this issue.
> >> thanks,
> >> greg k-h
> >
> > Currently PCI subsystem has /sys/.../{vendor,device,...} bundled to the
> > main PCI device (I suppose this means the entries are created by
> > 'device_add')
> >
> > And after the PCI device is announced,
> > /sys/.../{config,resourceX,rom,vpd,iov,...} get created depending on if
> > these features are supported.
>
> And that's a bug. Let's not continue to make the same bug here as well.
>
> > Making dynamic entries tie to the main PCI device would require PCI
> > subsystem to allocate different 'bus_type' for the devices, right?
>
> No, it would just mean they need to be all added before the device is
> fully registered with the driver core.

I looked into the PCI and driver core code again, but didn't figured out how
to do it.

A 'pci_dev' is added by pci_bus_add_device() via device_add(), which creates
sysfs entries according to 'dev_attrs' in the 'pci_bus_type'. If we want those
dynamic entries to appear before the uevent is triggered, we have to bundle
them into the 'dev_attrs'. Is this right way for the dynamic entries? Or I
missed something?

Thanks,
Yu

2008-11-13 08:46:30

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

On Sat, Nov 08, 2008 at 02:48:25AM +0800, Greg KH wrote:
> On Fri, Nov 07, 2008 at 11:17:40PM +0800, Yu Zhao wrote:
> > While we are arguing what the software model the SR-IOV should be, let me
> > ask two simple questions first:
> >
> > 1, What does the SR-IOV looks like?
> > 2, Why do we need to support it?
>
> I don't think we need to worry about those questions, as we can see what
> the SR-IOV interface looks like by looking at the PCI spec, and we know
> Linux needs to support it, as Linux needs to support everything :)
>
> (note, community members that can not see the PCI specs at this point in
> time, please know that we are working on resolving these issues,
> hopefully we will have some good news within a month or so.)

Thanks for doing this!

>
> > As you know the Linux kernel is the base of various virtual machine
> > monitors such as KVM, Xen, OpenVZ and VServer. We need SR-IOV support in
> > the kernel because mostly it helps high-end users (IT departments, HPC,
> > etc.) to share limited hardware resources among hundreds or even thousands
> > virtual machines and hence reduce the cost. How can we make these virtual
> > machine monitors utilize the advantage of SR-IOV without spending too much
> > effort meanwhile remaining architectural correctness? I believe making VF
> > represent as much closer as a normal PCI device (struct pci_dev) is the
> > best way in current situation, because this is not only what the hardware
> > designers expect us to do but also the usage model that KVM, Xen and other
> > VMMs have already supported.
>
> But would such an api really take advantage of the new IOV interfaces
> that are exposed by the new device type?

The SR-IOV is a very straightforward capability -- it can only reside in
the Physical Function's (the real device) config space and controls the
allocation of the Virtual Function by several registers. What we can do
in the PCI layer is to make the SR-IOV device spawn VF upon user request,
and register VF to the PCI core. The functionality of SR-IOV device (both
the PF and VF) can vary at a large range and their drivers (same as normal
PCI device driver) are responsible for handling device specific stuff.

So it looks like we can get all work done in the PCI layer with only two
interfaces: one for the PF driver to register itself as a SR-IOV capable
driver, expose the sysfs (or ioctl) interface to receive user request, and
allocate 'pci_dev' for VF; another one to cleanup all stuff when the PF
driver unregisters itself (e.g., the driver is removed or the device is
going to power-saving mode.).

>
> > I agree that API in the SR-IOV pacth is arguable and the concerns such as
> > lack of PF driver, etc. are also valid. But I personally think these stuff
> > are not essential problems to me and other SR-IOV driver developers.
>
> How can the lack of a PF driver not be a valid concern at this point in
> time? Without such a driver written, how can we know that the SR-IOV
> interface as created is sufficient, or that it even works properly?
>
> Here's what I see we need to have before we can evaluate if the IOV core
> PCI patches are acceptable:
> - a driver that uses this interface
> - a PF driver that uses this interface.
>
> Without those, we can't determine if the infrastructure provided by the
> IOV core even is sufficient, right?

Yes, using a PF driver to evaluate the SR-IOV core is necessary. And only
the PF driver can use the interface since the VF shouldn't have the SR-IOV
capability in its config space according to the spec.

Regards,
Yu

> Rumor has it that there is both of the above things floating around, can
> someone please post them to the linux-pci list so that we can see how
> this all works together?
>
> thanks,
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-11-14 00:43:24

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH 2/16 v6] PCI: define PCI resource names in an 'enum'

On Wed, Oct 22, 2008 at 04:40:41PM +0800, Yu Zhao wrote:
> This patch moves all definitions of the PCI resource names to an 'enum',
> and also replaces some hard-coded resource variables with symbol
> names. This change eases introduction of device specific resources.
>
> Cc: Alex Chiang <[email protected]>
> Cc: Grant Grundler <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jesse Barnes <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Randy Dunlap <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Yu Zhao <[email protected]>
>
> ---
> drivers/pci/pci-sysfs.c | 4 +++-
> drivers/pci/pci.c | 19 ++-----------------
> drivers/pci/probe.c | 2 +-
> drivers/pci/proc.c | 7 ++++---
> include/linux/pci.h | 37 ++++++++++++++++++++++++-------------
> 5 files changed, 34 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 110022d..5c456ab 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -101,11 +101,13 @@ resource_show(struct device * dev, struct device_attribute *attr, char * buf)
> struct pci_dev * pci_dev = to_pci_dev(dev);
> char * str = buf;
> int i;
> - int max = 7;
> + int max;
> resource_size_t start, end;
>
> if (pci_dev->subordinate)
> max = DEVICE_COUNT_RESOURCE;
> + else
> + max = PCI_BRIDGE_RESOURCES;
>
> for (i = 0; i < max; i++) {
> struct resource *res = &pci_dev->resource[i];
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index ae62f01..40284dc 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -359,24 +359,9 @@ pci_find_parent_resource(const struct pci_dev *dev, struct resource *res)
> static void
> pci_restore_bars(struct pci_dev *dev)
> {
> - int i, numres;
> -
> - switch (dev->hdr_type) {
> - case PCI_HEADER_TYPE_NORMAL:
> - numres = 6;
> - break;
> - case PCI_HEADER_TYPE_BRIDGE:
> - numres = 2;
> - break;
> - case PCI_HEADER_TYPE_CARDBUS:
> - numres = 1;
> - break;
> - default:
> - /* Should never get here, but just in case... */
> - return;
> - }
> + int i;
>
> - for (i = 0; i < numres; i++)
> + for (i = 0; i < PCI_BRIDGE_RESOURCES; i++)
> pci_update_resource(dev, i);
> }
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index aaaf0a1..a52784c 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -426,7 +426,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent,
> child->subordinate = 0xff;
>
> /* Set up default resource pointers and names.. */
> - for (i = 0; i < 4; i++) {
> + for (i = 0; i < PCI_BRIDGE_RES_NUM; i++) {
> child->resource[i] = &bridge->resource[PCI_BRIDGE_RESOURCES+i];
> child->resource[i]->name = child->name;
> }
> diff --git a/drivers/pci/proc.c b/drivers/pci/proc.c
> index e1098c3..f6f2a59 100644
> --- a/drivers/pci/proc.c
> +++ b/drivers/pci/proc.c
> @@ -352,15 +352,16 @@ static int show_device(struct seq_file *m, void *v)
> dev->vendor,
> dev->device,
> dev->irq);
> - /* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */
> - for (i=0; i<7; i++) {
> +
> + /* only print standard and ROM resources to preserve compatibility */
> + for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> resource_size_t start, end;
> pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
> seq_printf(m, "\t%16llx",
> (unsigned long long)(start |
> (dev->resource[i].flags & PCI_REGION_FLAG_MASK)));
> }
> - for (i=0; i<7; i++) {
> + for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
> resource_size_t start, end;
> pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
> seq_printf(m, "\t%16llx",
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 43e1fc1..2ada2b6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -76,7 +76,30 @@ enum pci_mmap_state {
> #define PCI_DMA_FROMDEVICE 2
> #define PCI_DMA_NONE 3
>
> -#define DEVICE_COUNT_RESOURCE 12
> +/*
> + * For PCI devices, the region numbers are assigned this way:
> + */
> +enum {
> + /* #0-5: standard PCI regions */
> + PCI_STD_RESOURCES,
> + PCI_STD_RESOURCES_END = 5,
> +
> + /* #6: expansion ROM */
> + PCI_ROM_RESOURCE,
> +
> + /* address space assigned to buses behind the bridge */
> +#ifndef PCI_BRIDGE_RES_NUM
> +#define PCI_BRIDGE_RES_NUM 4
> +#endif


Is there any intention to ever set PCI_BRIDGE_RES_NUM to any
value other than 4? I'm confused about why it is protected
by #ifndef as I can't find it declared anywhere else.

> + PCI_BRIDGE_RESOURCES,
> + PCI_BRIDGE_RES_END = PCI_BRIDGE_RESOURCES + PCI_BRIDGE_RES_NUM - 1,
> +
> + /* total resources associated with a PCI device */
> + PCI_NUM_RESOURCES,
> +
> + /* preserve this for compatibility */
> + DEVICE_COUNT_RESOURCE
> +};
>
> typedef int __bitwise pci_power_t;
>
> @@ -262,18 +285,6 @@ static inline void pci_add_saved_cap(struct pci_dev *pci_dev,
> hlist_add_head(&new_cap->next, &pci_dev->saved_cap_space);
> }
>
> -/*
> - * For PCI devices, the region numbers are assigned this way:
> - *
> - * 0-5 standard PCI regions
> - * 6 expansion ROM
> - * 7-10 bridges: address space assigned to buses behind the bridge
> - */
> -
> -#define PCI_ROM_RESOURCE 6
> -#define PCI_BRIDGE_RESOURCES 7
> -#define PCI_NUM_RESOURCES 11
> -
> #ifndef PCI_BUS_NUM_RESOURCES
> #define PCI_BUS_NUM_RESOURCES 16
> #endif

--
Simon Horman
VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en

2008-11-14 00:56:25

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Thu, Nov 13, 2008 at 02:50:24PM +0800, Yu Zhao wrote:
> On Fri, Nov 07, 2008 at 11:18:37AM +0800, Greg KH wrote:
> > On Fri, Nov 07, 2008 at 11:01:29AM +0800, Zhao, Yu wrote:
> > > Greg KH wrote:
> > >> On Wed, Nov 05, 2008 at 08:33:18PM -0800, Greg KH wrote:
> > >>> On Wed, Oct 22, 2008 at 04:45:15PM +0800, Yu Zhao wrote:
> > >>>> Documentation/ABI/testing/sysfs-bus-pci | 33
> > >>>> +++++++++++++++++++++++++++++++
> > >>>> 1 files changed, 33 insertions(+), 0 deletions(-)
> > >>>>
> > >>>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci
> > >>>> b/Documentation/ABI/testing/sysfs-bus-pci
> > >>>> index ceddcff..41cce8f 100644
> > >>>> --- a/Documentation/ABI/testing/sysfs-bus-pci
> > >>>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > >>>> @@ -9,3 +9,36 @@ Description:
> > >>>> that some devices may have malformatted data. If the
> > >>>> underlying VPD has a writable section then the
> > >>>> corresponding section of this file will be writable.
> > >>>> +
> > >>>> +What: /sys/bus/pci/devices/.../iov/enable
> > >>> Are you sure this is still the correct location with your change to
> > >>> struct device?
> > >> Nevermind, this is correct.
> > >> But the bigger problem is that userspace doesn't know when these
> > >> attributes show up. So tools like udev and HAL and others can't look
> > >> for them as they never get notified, and they don't even know if they
> > >> should be looking for them or not.
> > >> Is there any way to tie these attributes to the "main" pci device so
> > >> that they get created before the device is announced to the world?
> > >> Doing that would solve this issue.
> > >> thanks,
> > >> greg k-h
> > >
> > > Currently PCI subsystem has /sys/.../{vendor,device,...} bundled to the
> > > main PCI device (I suppose this means the entries are created by
> > > 'device_add')
> > >
> > > And after the PCI device is announced,
> > > /sys/.../{config,resourceX,rom,vpd,iov,...} get created depending on if
> > > these features are supported.
> >
> > And that's a bug. Let's not continue to make the same bug here as well.
> >
> > > Making dynamic entries tie to the main PCI device would require PCI
> > > subsystem to allocate different 'bus_type' for the devices, right?
> >
> > No, it would just mean they need to be all added before the device is
> > fully registered with the driver core.
>
> I looked into the PCI and driver core code again, but didn't figured out how
> to do it.
>
> A 'pci_dev' is added by pci_bus_add_device() via device_add(), which creates
> sysfs entries according to 'dev_attrs' in the 'pci_bus_type'. If we want those
> dynamic entries to appear before the uevent is triggered, we have to bundle
> them into the 'dev_attrs'. Is this right way for the dynamic entries? Or I
> missed something?

Yes, that is correct.

Or you can add attributes before device_add() is called to the device,
which is probably much easier to do, right?

There are also "conditional" attributes, which get only displayed if
some kind of condition is met, I think you want to use those.

thanks,

greg k-h

2008-11-16 16:07:24

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Anthony Liguori wrote:
> I don't think it's established that PV/VF will have less latency than
> using virtio-net. virtio-net requires a world switch to send a group
> of packets. The cost of this (if it stays in kernel) is only a few
> thousand cycles on the most modern processors.
>
> Using VT-d means that for every DMA fetch that misses in the IOTLB,
> you potentially have to do four memory fetches to main memory. There
> will be additional packet latency using VT-d compared to native, it's
> just not known how much at this time.

If the IOTLB has intermediate TLB entries like the processor, we're
talking just one or two fetches. That's a lot less than the cacheline
bouncing that virtio and kvm interrupt injection incur right now.

--
error compiling committee.c: too many arguments to function

2008-11-17 01:47:18

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Avi Kivity wrote:
> Anthony Liguori wrote:
>> I don't think it's established that PV/VF will have less latency than
>> using virtio-net. virtio-net requires a world switch to send a group
>> of packets. The cost of this (if it stays in kernel) is only a few
>> thousand cycles on the most modern processors.
>>
>> Using VT-d means that for every DMA fetch that misses in the IOTLB,
>> you potentially have to do four memory fetches to main memory. There
>> will be additional packet latency using VT-d compared to native, it's
>> just not known how much at this time.
>
> If the IOTLB has intermediate TLB entries like the processor, we're
> talking just one or two fetches. That's a lot less than the cacheline
> bouncing that virtio and kvm interrupt injection incur right now.
>

The PCI SIG Address Translation Service (ATS) specifies a way that uses
an Address Translation Cache (ATC) in the Endpoint to reduce the latency.

The Linux kernel support for ATS capability will come soon.

Thanks,
Yu

2008-11-17 09:06:52

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Fri, Nov 14, 2008 at 08:55:38AM +0800, Greg KH wrote:
> On Thu, Nov 13, 2008 at 02:50:24PM +0800, Yu Zhao wrote:
> > On Fri, Nov 07, 2008 at 11:18:37AM +0800, Greg KH wrote:
> > > On Fri, Nov 07, 2008 at 11:01:29AM +0800, Zhao, Yu wrote:
> > > > Currently PCI subsystem has /sys/.../{vendor,device,...} bundled to the
> > > > main PCI device (I suppose this means the entries are created by
> > > > 'device_add')
> > > >
> > > > And after the PCI device is announced,
> > > > /sys/.../{config,resourceX,rom,vpd,iov,...} get created depending on if
> > > > these features are supported.
> > >
> > > And that's a bug. Let's not continue to make the same bug here as well.
> > >
> > > > Making dynamic entries tie to the main PCI device would require PCI
> > > > subsystem to allocate different 'bus_type' for the devices, right?
> > >
> > > No, it would just mean they need to be all added before the device is
> > > fully registered with the driver core.
> >
> > I looked into the PCI and driver core code again, but didn't figured out how
> > to do it.
> >
> > A 'pci_dev' is added by pci_bus_add_device() via device_add(), which creates
> > sysfs entries according to 'dev_attrs' in the 'pci_bus_type'. If we want those
> > dynamic entries to appear before the uevent is triggered, we have to bundle
> > them into the 'dev_attrs'. Is this right way for the dynamic entries? Or I
> > missed something?
>
> Yes, that is correct.
>
> Or you can add attributes before device_add() is called to the device,
> which is probably much easier to do, right?
>
> There are also "conditional" attributes, which get only displayed if
> some kind of condition is met, I think you want to use those.

The problem is the sysfs directory of the PCI device is created by the
kobject_add() in the device_add() as follows. And the static entries
bundled with the 'pci_bus_type' are created by the bus_add_device().
Between the kobject_add() and the kobject_uevent(), we don't have any
other choice to add the dynamic entries.

In device_add():

error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
...
error = bus_add_device(dev);
...
kobject_uevent(&dev->kobj, KOBJ_ADD);


So looks like the only way is to make the dynamic entries bundled with
the 'pci_bus_type', which means they would become static no matter the
device supports the entries (i.e. corresponding capabilities) or not.

Thanks,
Yu

2008-11-17 12:03:35

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

Rusty Russell wrote:
> On Friday 07 November 2008 18:17:54 Zhao, Yu wrote:
> > Greg KH wrote:
> > > On Thu, Nov 06, 2008 at 04:40:21PM -0600, Anthony Liguori wrote:
> > >> Greg KH wrote:
> > >>> On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
> > >>>> I don't think we really know what the One True Usage model is for VF
> > >>>> devices. Chris Wright has some ideas, I have some ideas and Yu Zhao
> > >>>> has some ideas. I bet there's other people who have other ideas too.
> > >>>
> > >>> I'd love to hear those ideas.
> > >>
> > >> We've been talking about avoiding hardware passthrough entirely and
> > >> just backing a virtio-net backend driver by a dedicated VF in the
> > >> host. That avoids a huge amount of guest-facing complexity, let's
> > >> migration Just Work, and should give the same level of performance.
> >
> > This can be commonly used not only with VF -- devices that have multiple
> > DMA queues (e.g., Intel VMDq, Neterion Xframe) and even traditional
> > devices can also take the advantage of this.
> >
> > CC Rusty Russel in case he has more comments.
>
> Yes, even dumb devices could use this mechanism if you wanted to bind an
> entire device solely to one guest.
>
> We don't have network infrastructure for this today, but my thought was
> to do something in dev_alloc_skb and dev_kfree_skb et al.

Is there any discussion about this on the netdev? Any prototype
available? If not, I'd like to create one and evaluate the performance
of virtio-net solution again the hardware passthrough.

Thanks,
Yu

2008-11-18 15:13:23

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Mon, Nov 17, 2008 at 04:09:49PM +0800, Yu Zhao wrote:
> On Fri, Nov 14, 2008 at 08:55:38AM +0800, Greg KH wrote:
> > On Thu, Nov 13, 2008 at 02:50:24PM +0800, Yu Zhao wrote:
> > > On Fri, Nov 07, 2008 at 11:18:37AM +0800, Greg KH wrote:
> > > > On Fri, Nov 07, 2008 at 11:01:29AM +0800, Zhao, Yu wrote:
> > > > > Currently PCI subsystem has /sys/.../{vendor,device,...} bundled to the
> > > > > main PCI device (I suppose this means the entries are created by
> > > > > 'device_add')
> > > > >
> > > > > And after the PCI device is announced,
> > > > > /sys/.../{config,resourceX,rom,vpd,iov,...} get created depending on if
> > > > > these features are supported.
> > > >
> > > > And that's a bug. Let's not continue to make the same bug here as well.
> > > >
> > > > > Making dynamic entries tie to the main PCI device would require PCI
> > > > > subsystem to allocate different 'bus_type' for the devices, right?
> > > >
> > > > No, it would just mean they need to be all added before the device is
> > > > fully registered with the driver core.
> > >
> > > I looked into the PCI and driver core code again, but didn't figured out how
> > > to do it.
> > >
> > > A 'pci_dev' is added by pci_bus_add_device() via device_add(), which creates
> > > sysfs entries according to 'dev_attrs' in the 'pci_bus_type'. If we want those
> > > dynamic entries to appear before the uevent is triggered, we have to bundle
> > > them into the 'dev_attrs'. Is this right way for the dynamic entries? Or I
> > > missed something?
> >
> > Yes, that is correct.
> >
> > Or you can add attributes before device_add() is called to the device,
> > which is probably much easier to do, right?
> >
> > There are also "conditional" attributes, which get only displayed if
> > some kind of condition is met, I think you want to use those.
>
> The problem is the sysfs directory of the PCI device is created by the
> kobject_add() in the device_add() as follows. And the static entries
> bundled with the 'pci_bus_type' are created by the bus_add_device().
> Between the kobject_add() and the kobject_uevent(), we don't have any
> other choice to add the dynamic entries.
>
> In device_add():
>
> error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
> ...
> error = bus_add_device(dev);
> ...
> kobject_uevent(&dev->kobj, KOBJ_ADD);
>
>
> So looks like the only way is to make the dynamic entries bundled with
> the 'pci_bus_type', which means they would become static no matter the
> device supports the entries (i.e. corresponding capabilities) or not.

No, this can work, other busses do this. There are "conditional"
attributes that only get enabled if specific things happen, and you can
add attributes before device_add() is called. See the scsi code for
examples of both of these options.

thanks,

greg k-h

2008-11-18 16:49:24

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 15/16 v6] PCI: document the SR-IOV sysfs entries

On Mon, Nov 17, 2008 at 09:09, Yu Zhao <[email protected]> wrote:
> On Fri, Nov 14, 2008 at 08:55:38AM +0800, Greg KH wrote:

>> There are also "conditional" attributes, which get only displayed if
>> some kind of condition is met, I think you want to use those.
>
> The problem is the sysfs directory of the PCI device is created by the
> kobject_add() in the device_add() as follows. And the static entries
> bundled with the 'pci_bus_type' are created by the bus_add_device().
> Between the kobject_add() and the kobject_uevent(), we don't have any
> other choice to add the dynamic entries.
>
> In device_add():
>
> error = kobject_add(&dev->kobj, dev->kobj.parent, "%s", dev->bus_id);
> ...
> error = bus_add_device(dev);
> ...
> kobject_uevent(&dev->kobj, KOBJ_ADD);
>
>
> So looks like the only way is to make the dynamic entries bundled with
> the 'pci_bus_type', which means they would become static no matter the
> device supports the entries (i.e. corresponding capabilities) or not.

There is device_add_attrs() which is just called between the calls you
mention above. Like Greg said, it can add groups, and groups have an
is_visible() callback, which can be used to conditionally create
attributes out of of predefined list.

Kay

2008-12-11 01:44:29

by Zhao, Yu

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Fri, Nov 07, 2008 at 12:17:22PM +0800, Matthew Wilcox wrote:
> On Fri, Nov 07, 2008 at 11:40:21AM +0800, Zhao, Yu wrote:
> > Greg KH wrote:
> > >We've thought about this in the past, and even Microsoft said it was
> > >going to happen for Vista, but they realized in the end, like we did a
> > >few years previously, that it would require full support of all PCI
> > >drivers as well (if you rebalance stuff that is already bound to a
> > >driver.) So they dropped it.
> > >
> > >When would you want to do this kind of rebalancing? Before any PCI
> > >driver is bound to any devices? Or afterwards?
> >
> > I guess if we want the rebalance dynamic, then we should have it full --
> > the rebalance would be functional even after the driver is loaded.
> >
> > But in most cases, there will be problem when we unload driver from a
> > hard disk controller, etc. We can mount root on a ramdisk and do the
> > rebalance there, but it's complicated for a real user.
> >
> > So looks like doing rebalancing before any driver is bound to any device
> > is also a nice idea, if user can get a shell to do rebalance before
> > built-in PCI driver grabs device.
>
> Can we use the suspend/resume code to do this? Some drivers (sym2 for
> one) would definitely need to rerun some of their init code to cope with
> a BAR address changing.

Yes, that is what I was thinking. But after some grep on the PCI device
drivers, I feel frustrated because all those drivers only do 'ioremap'
once at the 'probe' stage.

I believe this is the only problem that preclude us having the run-time
resource rebalance. And I'm not sure how much effort we can fix it. Any
comments?


Thanks,
Yu

2008-12-11 04:34:06

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

On Thu, Dec 11, 2008 at 09:43:13AM +0800, Yu Zhao wrote:
...
> I believe this is the only problem that preclude us having the run-time
> resource rebalance. And I'm not sure how much effort we can fix it. Any
> comments?

Figure out the right sequence for driver resume so the probe function
can call resume as well?

Document the change and then start modifying drivers one-by-one.
API changes are alot of work.

grant

2008-12-11 15:40:40

by H L

[permalink] [raw]
Subject: Re: [PATCH 16/16 v6] PCI: document the new PCI boot parameters

--- On Wed, 12/10/08, Grant Grundler <[email protected]> wrote:

> Date: Wednesday, December 10, 2008, 10:33 PM
> On Thu, Dec 11, 2008 at 09:43:13AM +0800, Yu Zhao wrote:
> ...
> > I believe this is the only problem that preclude us
> having the run-time
> > resource rebalance. And I'm not sure how much
> effort we can fix it. Any
> > comments?
>
> Figure out the right sequence for driver resume so the
> probe function
> can call resume as well?
>
> Document the change and then start modifying drivers
> one-by-one.
> API changes are alot of work.
>
> grant
> --


I've been lurking awaiting to see such a discussion. Alerting the PCI drivers that their resources have been changed from underneath them by extending the suspend/resume model, or perhaps (heresy?) adding a new callback entry point specifically for instructing PCI drivers to re-read their BARs, etc. would be a step in the right direction to enable this whole re-balancing work; well, granted root/paging devs bound to PCI devs could be tricky. It seems entirely natural (to me) that Microsoft would shudder at the amount of work and verification of all drivers to do this, but the Linux community's attitude seems to embrace sweeping changes ;-).

--
LH