2013-05-14 02:29:22

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 0/7] PCI: fix pci dev add and remove sequence

The patchset is started while we try to address double remove pci
devices via sysfs that is found by Gu.

main point is from Bjorn that add reference for bus, and he also
pointed out that release should be done in pci_release_device.

After reviewing the add and remove path, found more problem that
need to be addressed, like
1. proc attach/detach is not balanced
2. stop_and_remove device is not multiple calling safe.

Also found sriov VFs add path has problem, it call pci_bus_add_device
to early, and it will make VF's driver get probed before PF's driver
probing is done. That will also have nested lock problem.

Please consider those patches to be 3.10 materials.

Thanks

Yinghai

PCI: move back pci_proc_attach_devices calling
PCI: move resources and bus_list releasing to pci_release_dev
PCI: Detach driver in pci_stop_device
PCI: Fix racing for pci device removing via sysfs
PCI, ACPI: Don't glue ACPI dev with pci VFs
PCI: Make sure VF's driver get attached after PF's
PCI: use pf as dma_dev for vf that does not have func0 sibling


2013-05-14 02:28:50

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 1/7] PCI: move back pci_proc_attach_devices calling

We stop detach proc when pci_stop_device.
So should attach that during pci_bus_add_device.

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/bus.c | 1 +
drivers/pci/probe.c | 2 --
2 files changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -1348,8 +1348,6 @@ void pci_device_add(struct pci_dev *dev,
dev->match_driver = false;
ret = device_add(&dev->dev);
WARN_ON(ret < 0);
-
- pci_proc_attach_device(dev);
}

struct pci_dev *__ref pci_scan_single_device(struct pci_bus *bus, int devfn)
Index: linux-2.6/drivers/pci/bus.c
===================================================================
--- linux-2.6.orig/drivers/pci/bus.c
+++ linux-2.6/drivers/pci/bus.c
@@ -175,6 +175,7 @@ int pci_bus_add_device(struct pci_dev *d
* are not assigned yet for some devices.
*/
pci_fixup_device(pci_fixup_final, dev);
+ pci_proc_attach_device(dev);
pci_create_sysfs_dev_files(dev);

dev->match_driver = true;

2013-05-14 02:29:24

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 4/7] PCI: Fix racing for pci device removing via sysfs

Gu found nested removing through
echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
/sys/bus/pci/devices/0000\:1a\:01.0/remove

will cause kernel crash as bus get freed.

[ 418.946462] CPU 4
[ 418.968377] Pid: 512, comm: kworker/u:2 Tainted: G W 3.8.0 #2
FUJITSU-SV PRIMEQUEST 1800E/SB
[ 419.081763] RIP: 0010:[<ffffffff8137972e>] [<ffffffff8137972e>]
pci_bus_read_config_word+0x5e/0x90
[ 420.494137] Call Trace:
[ 420.523326] [<ffffffff813851ef>] ? remove_callback+0x1f/0x40
[ 420.591984] [<ffffffff8138044b>] pci_pme_active+0x4b/0x1c0
[ 420.658545] [<ffffffff8137d8e7>] pci_stop_bus_device+0x57/0xb0
[ 420.729259] [<ffffffff8137dab6>] pci_stop_and_remove_bus_device+0x16/0x30
[ 420.811392] [<ffffffff813851fb>] remove_callback+0x2b/0x40
[ 420.877955] [<ffffffff81257a56>] sysfs_schedule_callback_work+0x26/0x70

https://bugzilla.kernel.org/show_bug.cgi?id=54411

We have one patch that will let device hold bus ref to prevent it from
being freed, but that will still generate warning.

------------[ cut here ]------------
WARNING: at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
Hardware name: PRIMEQUEST 1800E
list_del corruption, ffff8807d1b6c000->next is LIST_POISON1 (dead000000100100)
Call Trace:
[<ffffffff81056d4f>] warn_slowpath_common+0x7f/0xc0
[<ffffffff81056e46>] warn_slowpath_fmt+0x46/0x50
[<ffffffff81280b13>] __list_del_entry+0x63/0xd0
[<ffffffff81280b91>] list_del+0x11/0x40
[<ffffffff81298331>] pci_destroy_dev+0x31/0xc0
[<ffffffff812985bb>] pci_remove_bus_device+0x5b/0x70
[<ffffffff812985ee>] pci_stop_and_remove_bus_device+0x1e/0x30
[<ffffffff8129fc89>] remove_callback+0x29/0x40
[<ffffffff811f3b84>] sysfs_schedule_callback_work+0x24/0x70

Bjorn pointed out that pci_dev should retain its reference to the pci_bus
for as long as the pci_dev exists, so the release reference should go in
pci_release_dev() instead.

At last we will not need to touch pci-sysfs bits.

-v6: split other changes to other patches.

Reported-by: Gu Zheng <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/probe.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -1149,6 +1149,7 @@ static void pci_release_dev(struct devic
list_del(&pci_dev->bus_list);
up_write(&pci_bus_sem);
pci_free_resources(pci_dev);
+ put_device(&pci_dev->bus->dev);

pci_release_capabilities(pci_dev);
pci_release_of_node(pci_dev);
@@ -1360,6 +1361,7 @@ void pci_device_add(struct pci_dev *dev,
down_write(&pci_bus_sem);
list_add_tail(&dev->bus_list, &bus->devices);
up_write(&pci_bus_sem);
+ get_device(&bus->dev);

ret = pcibios_add_device(dev);
WARN_ON(ret < 0);

2013-05-14 02:29:32

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 7/7] PCI: use pf as dma_dev for vf that does not have func0 sibling

Fix crash:
[ 122.181002] pci 0000:c4:00.4: [10df:e228] type 00 class 0x020000
[ 122.188541] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 122.197324] IP: [<ffffffff81f58f78>] intel_iommu_add_device+0x108/0x1e0
[ 122.204753] PGD 0
[ 122.207019] Oops: 0000 [#1] SMP
[ 122.210655] Modules linked in:
[ 122.214084] CPU: 49 PID: 1 Comm: swapper/0 Tainted: G I 3.9.0-yh-09240-g8a69c01-dirty #1571
...
[ 122.365363] Call Trace:
[ 122.368112] [<ffffffff81f4ee2e>] iommu_bus_notifier+0x3e/0xd0
[ 122.374645] [<ffffffff820b702e>] notifier_call_chain+0x6e/0xb0
[ 122.381277] [<ffffffff810b912e>] __blocking_notifier_call_chain+0x5e/0x90
[ 122.388963] [<ffffffff810b9176>] blocking_notifier_call_chain+0x16/0x20
[ 122.396471] [<ffffffff81762fc5>] device_add+0x475/0x6f0

pv is c3:00.0, and vf is from c4:00.4 and there is no c4:00.0,
as we have
Initial VFs: 64, Total VFs: 64, Number of VFs: 63, Function Dependency Link: 00
VF offset: 260, stride: 1

Signed-off-by: Yinghai Lu <[email protected]>
Cc: David Woodhouse <[email protected]>
Cc: [email protected]

---
drivers/iommu/intel-iommu.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6/drivers/iommu/intel-iommu.c
===================================================================
--- linux-2.6.orig/drivers/iommu/intel-iommu.c
+++ linux-2.6/drivers/iommu/intel-iommu.c
@@ -4185,11 +4185,15 @@ static int intel_iommu_add_device(struct
* required ACS flags, add to the same group as function 0.
*/
if (dma_pdev->multifunction &&
- !pci_acs_enabled(dma_pdev, REQ_ACS_FLAGS))
- swap_pci_ref(&dma_pdev,
- pci_get_slot(dma_pdev->bus,
+ !pci_acs_enabled(dma_pdev, REQ_ACS_FLAGS)) {
+ struct pci_dev *dev_tmp = pci_get_slot(dma_pdev->bus,
PCI_DEVFN(PCI_SLOT(dma_pdev->devfn),
- 0)));
+ 0));
+
+ if (!dev_tmp && dma_pdev->is_virtfn)
+ dev_tmp = pci_dev_get(dma_pdev->physfn);
+ swap_pci_ref(&dma_pdev, dev_tmp);
+ }

/*
* Devices on the root bus go through the iommu. If that's not us,

2013-05-14 02:29:28

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

Found kernel try to load mlx4 drivers for VFs before
PF's is really loaded when the drivers are built-in, and kernel
command line include probe_vfs=63, num_vfs=63.

It turns that it also happen for hotadd path even drivers are
compiled as modules and if they loaded. Esp some VF share the
same driver with PF.

calling path:
device driver probe
==> pci_enable_sriov
==> virtfn_add
==> pci_dev_add
==> pci_bus_device_add
when pci_bus_device_add is called, the VF's driver will be attached.
and at that time PF's driver does not finish yet.

Need to move out pci_bus_device_add from virtfn_add and call it
later. Fix the problem for two path,
1. hotadd path: use device_schedule_callback.
2. for booting path, use initcall to call that for all VF's.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: [email protected]

---
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 7 +
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 4 -
drivers/net/ethernet/cisco/enic/enic_main.c | 2
drivers/net/ethernet/emulex/benet/be_main.c | 4 +
drivers/net/ethernet/intel/igb/igb_main.c | 11 ++
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2
drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 9 +-
drivers/net/ethernet/mellanox/mlx4/main.c | 2
drivers/net/ethernet/neterion/vxge/vxge-main.c | 3
drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c | 5 +
drivers/net/ethernet/sfc/efx.c | 1
drivers/pci/iov.c | 73 +++++++++++++++++--
drivers/scsi/lpfc/lpfc_init.c | 2
include/linux/pci.h | 4 +
14 files changed, 115 insertions(+), 14 deletions(-)

Index: linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/mellanox/mlx4/main.c
+++ linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2308,6 +2308,8 @@ slave_start:
priv->pci_dev_data = pci_dev_data;
pci_set_drvdata(pdev, dev);

+ pci_bus_add_device_vfs(pdev);
+
return 0;

err_port:
Index: linux-2.6/drivers/pci/iov.c
===================================================================
--- linux-2.6.orig/drivers/pci/iov.c
+++ linux-2.6/drivers/pci/iov.c
@@ -66,7 +66,8 @@ static void virtfn_remove_bus(struct pci
pci_remove_bus(child);
}

-static int virtfn_add(struct pci_dev *dev, int id, int reset)
+static int virtfn_add(struct pci_dev *dev, int id, int reset,
+ struct pci_dev **ret)
{
int i;
int rc;
@@ -116,7 +117,6 @@ static int virtfn_add(struct pci_dev *de
pci_device_add(virtfn, virtfn->bus);
mutex_unlock(&iov->dev->sriov->lock);

- rc = pci_bus_add_device(virtfn);
sprintf(buf, "virtfn%u", id);
rc = sysfs_create_link(&dev->dev.kobj, &virtfn->dev.kobj, buf);
if (rc)
@@ -127,6 +127,8 @@ static int virtfn_add(struct pci_dev *de

kobject_uevent(&virtfn->dev.kobj, KOBJ_CHANGE);

+ if (ret)
+ *ret = virtfn;
return 0;

failed2:
@@ -141,6 +143,55 @@ failed1:
return rc;
}

+static void pci_bus_add_vf(struct pci_dev *dev)
+{
+ int rc;
+
+ if (!dev || !dev->is_virtfn || dev->is_added)
+ return;
+
+ rc = pci_bus_add_device(dev);
+}
+
+static void bus_add_vfs(struct device *device)
+{
+ struct pci_dev *dev = to_pci_dev(device);
+ int i, num_vfs = pci_num_vf(dev);
+
+ for (i = 0; i < num_vfs; i++) {
+ struct pci_bus *bus;
+ struct pci_dev *virtfn;
+
+ bus = pci_find_bus(pci_domain_nr(dev->bus), virtfn_bus(dev, i));
+ if (!bus)
+ continue;
+
+ virtfn = pci_get_slot(bus, virtfn_devfn(dev, i));
+ pci_bus_add_vf(virtfn);
+ pci_dev_put(virtfn);
+ }
+}
+void pci_bus_add_device_vfs(struct pci_dev *pdev)
+{
+ if (system_state == SYSTEM_BOOTING)
+ return;
+
+ device_schedule_callback(&pdev->dev, bus_add_vfs);
+}
+EXPORT_SYMBOL_GPL(pci_bus_add_device_vfs);
+
+/* Make sure all VFs get added before pci_sysfs_init */
+static int __init pci_bus_add_device_vfs_booting(void)
+{
+ struct pci_dev *dev = NULL;
+
+ for_each_pci_dev(dev)
+ pci_bus_add_vf(dev);
+
+ return 0;
+}
+device_initcall_sync(pci_bus_add_device_vfs_booting);
+
static void virtfn_remove(struct pci_dev *dev, int id, int reset)
{
char buf[VIRTFN_ID_LEN];
@@ -213,14 +264,22 @@ static void sriov_migration_task(struct
if (state == PCI_SRIOV_VFM_MI) {
writeb(PCI_SRIOV_VFM_AV, iov->mstate + i);
state = readb(iov->mstate + i);
- if (state == PCI_SRIOV_VFM_AV)
- virtfn_add(iov->self, i, 1);
+ if (state == PCI_SRIOV_VFM_AV) {
+ struct pci_dev *virtfn = NULL;
+
+ virtfn_add(iov->self, i, 1, &virtfn);
+ pci_bus_add_vf(virtfn);
+ }
} else if (state == PCI_SRIOV_VFM_MO) {
virtfn_remove(iov->self, i, 1);
writeb(PCI_SRIOV_VFM_UA, iov->mstate + i);
state = readb(iov->mstate + i);
- if (state == PCI_SRIOV_VFM_AV)
- virtfn_add(iov->self, i, 0);
+ if (state == PCI_SRIOV_VFM_AV) {
+ struct pci_dev *virtfn = NULL;
+
+ virtfn_add(iov->self, i, 0, &virtfn);
+ pci_bus_add_vf(virtfn);
+ }
}
}

@@ -356,7 +415,7 @@ static int sriov_enable(struct pci_dev *
initial = nr_virtfn;

for (i = 0; i < initial; i++) {
- rc = virtfn_add(dev, i, 0);
+ rc = virtfn_add(dev, i, 0, NULL);
if (rc)
goto failed;
}
Index: linux-2.6/include/linux/pci.h
===================================================================
--- linux-2.6.orig/include/linux/pci.h
+++ linux-2.6/include/linux/pci.h
@@ -1659,6 +1659,7 @@ void __iomem *pci_ioremap_bar(struct pci

#ifdef CONFIG_PCI_IOV
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
+void pci_bus_add_device_vfs(struct pci_dev *dev);
void pci_disable_sriov(struct pci_dev *dev);
irqreturn_t pci_sriov_migration(struct pci_dev *dev);
int pci_num_vf(struct pci_dev *dev);
@@ -1670,6 +1671,9 @@ static inline int pci_enable_sriov(struc
{
return -ENODEV;
}
+static inline void pci_bus_add_device_vfs(struct pci_dev *dev)
+{
+}
static inline void pci_disable_sriov(struct pci_dev *dev)
{
}
Index: linux-2.6/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/intel/igb/igb_main.c
+++ linux-2.6/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2366,6 +2366,9 @@ static int igb_probe(struct pci_dev *pde
}

pm_runtime_put_noidle(&pdev->dev);
+
+ pci_bus_add_device_vfs(pdev);
+
return 0;

err_register:
@@ -7278,8 +7281,12 @@ static int igb_pci_sriov_configure(struc
#ifdef CONFIG_PCI_IOV
if (num_vfs == 0)
return igb_pci_disable_sriov(dev);
- else
- return igb_pci_enable_sriov(dev, num_vfs);
+ else {
+ int ret = igb_pci_enable_sriov(dev, num_vfs);
+ if (ret > 0)
+ pci_bus_add_device_vfs(dev);
+ return ret;
+ }
#endif
return 0;
}
Index: linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -346,8 +346,13 @@ int ixgbe_pci_sriov_configure(struct pci
{
if (num_vfs == 0)
return ixgbe_pci_sriov_disable(dev);
- else
- return ixgbe_pci_sriov_enable(dev, num_vfs);
+ else {
+ int ret = ixgbe_pci_sriov_enable(dev, num_vfs);
+
+ if (ret > 0)
+ pci_bus_add_device_vfs(dev);
+ return ret;
+ }
}

static int ixgbe_set_vf_multicasts(struct ixgbe_adapter *adapter,
Index: linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7658,6 +7658,8 @@ skip_sriov:
IXGBE_LINK_SPEED_10GB_FULL | IXGBE_LINK_SPEED_1GB_FULL,
true);

+ pci_bus_add_device_vfs(pdev);
+
return 0;

err_register:
Index: linux-2.6/drivers/scsi/lpfc/lpfc_init.c
===================================================================
--- linux-2.6.orig/drivers/scsi/lpfc/lpfc_init.c
+++ linux-2.6/drivers/scsi/lpfc/lpfc_init.c
@@ -10582,6 +10582,8 @@ lpfc_pci_probe_one(struct pci_dev *pdev,
else
rc = lpfc_pci_probe_one_s3(pdev, pid);

+ pci_bus_add_device_vfs(pdev);
+
return rc;
}

Index: linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/emulex/benet/be_main.c
+++ linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4119,6 +4119,7 @@ static int lancer_recover_func(struct be
if (status)
goto err;

+ pci_bus_add_device_vfs(adapter->pdev);
if (netif_running(adapter->netdev)) {
status = be_open(adapter->netdev);
if (status)
@@ -4335,6 +4336,8 @@ static int be_probe(struct pci_dev *pdev
dev_info(&pdev->dev, "%s: %s %s port %c\n", nic_name(pdev),
func_name(adapter), mc_name(adapter), port_name);

+ pci_bus_add_device_vfs(pdev);
+
return 0;

unsetup:
@@ -4406,6 +4409,7 @@ static int be_resume(struct pci_dev *pde
rtnl_unlock();
}

+ pci_bus_add_device_vfs(adapter->pdev);
schedule_delayed_work(&adapter->func_recovery_work,
msecs_to_jiffies(1000));
netif_device_attach(netdev);
Index: linux-2.6/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
+++ linux-2.6/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
@@ -568,8 +568,11 @@ int qlcnic_pci_sriov_configure(struct pc

if (num_vfs == 0)
err = qlcnic_pci_sriov_disable(adapter);
- else
+ else {
err = qlcnic_pci_sriov_enable(adapter, num_vfs);
+ if (err > 0)
+ pci_bus_add_device_vfs(dev);
+ }

clear_bit(__QLCNIC_RESETTING, &adapter->state);
return err;
Index: linux-2.6/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
+++ linux-2.6/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
@@ -3048,7 +3048,12 @@ int bnx2x_sriov_configure(struct pci_dev
pci_disable_sriov(dev);
return 0;
} else {
- return bnx2x_enable_sriov(bp);
+ int ret = bnx2x_enable_sriov(bp);
+
+ if (ret > 0)
+ pci_bus_add_device_vfs(dev);
+
+ return 0;
}
}

Index: linux-2.6/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ linux-2.6/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5749,10 +5749,12 @@ static int init_one(struct pci_dev *pdev
sriov:
#ifdef CONFIG_PCI_IOV
if (func < ARRAY_SIZE(num_vf) && num_vf[func] > 0)
- if (pci_enable_sriov(pdev, num_vf[func]) == 0)
+ if (pci_enable_sriov(pdev, num_vf[func]) == 0) {
dev_info(&pdev->dev,
"instantiated %u virtual functions\n",
num_vf[func]);
+ pci_bus_add_device_vfs(pdev);
+ }
#endif
return 0;

Index: linux-2.6/drivers/net/ethernet/cisco/enic/enic_main.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/cisco/enic/enic_main.c
+++ linux-2.6/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -2524,6 +2524,8 @@ static int enic_probe(struct pci_dev *pd
goto err_out_dev_deinit;
}

+ pci_bus_add_device_vfs(pdev);
+
return 0;

err_out_dev_deinit:
Index: linux-2.6/drivers/net/ethernet/neterion/vxge/vxge-main.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/neterion/vxge/vxge-main.c
+++ linux-2.6/drivers/net/ethernet/neterion/vxge/vxge-main.c
@@ -4731,6 +4731,9 @@ vxge_probe(struct pci_dev *pdev, const s
vxge_hw_device_trace_level_get(hldev));

kfree(ll_config);
+
+ pci_bus_add_device_vfs(pdev);
+
return 0;

_exit6:
Index: linux-2.6/drivers/net/ethernet/sfc/efx.c
===================================================================
--- linux-2.6.orig/drivers/net/ethernet/sfc/efx.c
+++ linux-2.6/drivers/net/ethernet/sfc/efx.c
@@ -2822,6 +2822,7 @@ static int efx_pci_probe(struct pci_dev
netif_warn(efx, probe, efx->net_dev,
"pci_enable_pcie_error_reporting failed (%d)\n", rc);

+ pci_bus_add_device_vfs(pci_dev);
return 0;

fail4:

2013-05-14 02:29:26

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 5/7] PCI, ACPI: Don't glue ACPI dev with pci VFs

When sriov is enabled, VF could just start after PF in pci tree.
like c1:00.0 will be PF, and c1:00.1 and after will be VF.

acpi do have dev with same ADR. that will make them get glued
wrongly.

Skip that if it is virtfn.

Also need to set is_virtfn before pci_device_add(), as
gluing is triggered by device_add().

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/iov.c | 6 +++---
drivers/pci/pci-acpi.c | 4 ++++
2 files changed, 7 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/pci/iov.c
===================================================================
--- linux-2.6.orig/drivers/pci/iov.c
+++ linux-2.6/drivers/pci/iov.c
@@ -110,12 +110,12 @@ static int virtfn_add(struct pci_dev *de
if (reset)
__pci_reset_function(virtfn);

- pci_device_add(virtfn, virtfn->bus);
- mutex_unlock(&iov->dev->sriov->lock);
-
virtfn->physfn = pci_dev_get(dev);
virtfn->is_virtfn = 1;

+ pci_device_add(virtfn, virtfn->bus);
+ mutex_unlock(&iov->dev->sriov->lock);
+
rc = pci_bus_add_device(virtfn);
sprintf(buf, "virtfn%u", id);
rc = sysfs_create_link(&dev->dev.kobj, &virtfn->dev.kobj, buf);
Index: linux-2.6/drivers/pci/pci-acpi.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci-acpi.c
+++ linux-2.6/drivers/pci/pci-acpi.c
@@ -321,6 +321,10 @@ static int acpi_pci_find_device(struct d
u64 addr;

pci_dev = to_pci_dev(dev);
+ /* don't mix vf with real pci device */
+ if (pci_dev->is_virtfn)
+ return -ENODEV;
+
/* Please ref to ACPI spec for the syntax of _ADR */
addr = (PCI_SLOT(pci_dev->devfn) << 16) | PCI_FUNC(pci_dev->devfn);
*handle = acpi_get_child(DEVICE_ACPI_HANDLE(dev->parent), addr);

2013-05-14 02:29:20

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 3/7] PCI: Detach driver in pci_stop_device

Make pci_stop_dev actually detach driver only, and pci_remove_dev
device will do device_del instead. Then also could make hostbridge
to use device_unregister to be pair with device_register before.

Add is_removed to record if device_del is called already.

So mutliple calling will not cause false removing,
only right put_device will get device removed at last.

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/remove.c | 12 ++++++++----
include/linux/pci.h | 3 ++-
2 files changed, 10 insertions(+), 5 deletions(-)

Index: linux-2.6/drivers/pci/remove.c
===================================================================
--- linux-2.6.orig/drivers/pci/remove.c
+++ linux-2.6/drivers/pci/remove.c
@@ -10,7 +10,7 @@ static void pci_stop_dev(struct pci_dev
if (dev->is_added) {
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
- device_del(&dev->dev);
+ device_release_driver(&dev->dev);
dev->is_added = 0;
}

@@ -20,7 +20,11 @@ static void pci_stop_dev(struct pci_dev

static void pci_destroy_dev(struct pci_dev *dev)
{
- put_device(&dev->dev);
+ if (!dev->is_removed) {
+ device_del(&dev->dev);
+ dev->is_removed = 1;
+ put_device(&dev->dev);
+ }
}

void pci_remove_bus(struct pci_bus *bus)
@@ -107,7 +111,7 @@ void pci_stop_root_bus(struct pci_bus *b
pci_stop_bus_device(child);

/* stop the host bridge */
- device_del(&host_bridge->dev);
+ device_release_driver(&host_bridge->dev);
}

void pci_remove_root_bus(struct pci_bus *bus)
@@ -126,5 +130,5 @@ void pci_remove_root_bus(struct pci_bus
host_bridge->bus = NULL;

/* remove the host bridge */
- put_device(&host_bridge->dev);
+ device_unregister(&host_bridge->dev);
}
Index: linux-2.6/include/linux/pci.h
===================================================================
--- linux-2.6.orig/include/linux/pci.h
+++ linux-2.6/include/linux/pci.h
@@ -307,7 +307,8 @@ struct pci_dev {
unsigned int transparent:1; /* Transparent PCI bridge */
unsigned int multifunction:1;/* Part of multi-function device */
/* keep track of device state */
- unsigned int is_added:1;
+ unsigned int is_added:1; /* driver is attached */
+ unsigned int is_removed:1; /* device_del is called */
unsigned int is_busmaster:1; /* device is busmaster */
unsigned int no_msi:1; /* device may not use msi */
unsigned int block_cfg_access:1; /* config space access is blocked */

2013-05-14 02:29:17

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 2/7] PCI: move resources and bus_list releasing to pci_release_dev

We should not release resource in pci_destroy that is too early
as there could be still other use hold reference.

release them or remove it from bus devices list at last
in pci_release_dev instead.

Signed-off-by: Yinghai Lu <[email protected]>

---
drivers/pci/probe.c | 20 ++++++++++++++++++++
drivers/pci/remove.c | 19 -------------------
2 files changed, 20 insertions(+), 19 deletions(-)

Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -1118,6 +1118,20 @@ static void pci_release_capabilities(str
pci_free_cap_save_buffers(dev);
}

+static void pci_free_resources(struct pci_dev *dev)
+{
+ int i;
+
+ msi_remove_pci_irq_vectors(dev);
+
+ pci_cleanup_rom(dev);
+ for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+ struct resource *res = dev->resource + i;
+ if (res->parent)
+ release_resource(res);
+ }
+}
+
/**
* pci_release_dev - free a pci device structure when all users of it are finished.
* @dev: device that's been disconnected
@@ -1130,6 +1144,12 @@ static void pci_release_dev(struct devic
struct pci_dev *pci_dev;

pci_dev = to_pci_dev(dev);
+
+ down_write(&pci_bus_sem);
+ list_del(&pci_dev->bus_list);
+ up_write(&pci_bus_sem);
+ pci_free_resources(pci_dev);
+
pci_release_capabilities(pci_dev);
pci_release_of_node(pci_dev);
kfree(pci_dev);
Index: linux-2.6/drivers/pci/remove.c
===================================================================
--- linux-2.6.orig/drivers/pci/remove.c
+++ linux-2.6/drivers/pci/remove.c
@@ -3,20 +3,6 @@
#include <linux/pci-aspm.h>
#include "pci.h"

-static void pci_free_resources(struct pci_dev *dev)
-{
- int i;
-
- msi_remove_pci_irq_vectors(dev);
-
- pci_cleanup_rom(dev);
- for (i = 0; i < PCI_NUM_RESOURCES; i++) {
- struct resource *res = dev->resource + i;
- if (res->parent)
- release_resource(res);
- }
-}
-
static void pci_stop_dev(struct pci_dev *dev)
{
pci_pme_active(dev, false);
@@ -34,11 +20,6 @@ static void pci_stop_dev(struct pci_dev

static void pci_destroy_dev(struct pci_dev *dev)
{
- down_write(&pci_bus_sem);
- list_del(&dev->bus_list);
- up_write(&pci_bus_sem);
-
- pci_free_resources(dev);
put_device(&dev->dev);
}

2013-05-14 03:21:03

by Yijing Wang

[permalink] [raw]
Subject: Re: [PATCH 2/7] PCI: move resources and bus_list releasing to pci_release_dev

On 2013/5/14 10:28, Yinghai Lu wrote:
> We should not release resource in pci_destroy that is too early

Hi Yinghai,
"too early" means that after pci_stop_dev(), if someone else
hold the device reference, it still care the device resource ? e.g.?

Thanks!
Yijing.

> as there could be still other use hold reference.
>
> release them or remove it from bus devices list at last
> in pci_release_dev instead.
>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> drivers/pci/probe.c | 20 ++++++++++++++++++++
> drivers/pci/remove.c | 19 -------------------
> 2 files changed, 20 insertions(+), 19 deletions(-)
>
> Index: linux-2.6/drivers/pci/probe.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/probe.c
> +++ linux-2.6/drivers/pci/probe.c
> @@ -1118,6 +1118,20 @@ static void pci_release_capabilities(str
> pci_free_cap_save_buffers(dev);
> }
>
> +static void pci_free_resources(struct pci_dev *dev)
> +{
> + int i;
> +
> + msi_remove_pci_irq_vectors(dev);
> +
> + pci_cleanup_rom(dev);
> + for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> + struct resource *res = dev->resource + i;
> + if (res->parent)
> + release_resource(res);
> + }
> +}
> +
> /**
> * pci_release_dev - free a pci device structure when all users of it are finished.
> * @dev: device that's been disconnected
> @@ -1130,6 +1144,12 @@ static void pci_release_dev(struct devic
> struct pci_dev *pci_dev;
>
> pci_dev = to_pci_dev(dev);
> +
> + down_write(&pci_bus_sem);
> + list_del(&pci_dev->bus_list);
> + up_write(&pci_bus_sem);
> + pci_free_resources(pci_dev);
> +
> pci_release_capabilities(pci_dev);
> pci_release_of_node(pci_dev);
> kfree(pci_dev);
> Index: linux-2.6/drivers/pci/remove.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/remove.c
> +++ linux-2.6/drivers/pci/remove.c
> @@ -3,20 +3,6 @@
> #include <linux/pci-aspm.h>
> #include "pci.h"
>
> -static void pci_free_resources(struct pci_dev *dev)
> -{
> - int i;
> -
> - msi_remove_pci_irq_vectors(dev);
> -
> - pci_cleanup_rom(dev);
> - for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> - struct resource *res = dev->resource + i;
> - if (res->parent)
> - release_resource(res);
> - }
> -}
> -
> static void pci_stop_dev(struct pci_dev *dev)
> {
> pci_pme_active(dev, false);
> @@ -34,11 +20,6 @@ static void pci_stop_dev(struct pci_dev
>
> static void pci_destroy_dev(struct pci_dev *dev)
> {
> - down_write(&pci_bus_sem);
> - list_del(&dev->bus_list);
> - up_write(&pci_bus_sem);
> -
> - pci_free_resources(dev);
> put_device(&dev->dev);
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> .
>


--
Thanks!
Yijing

2013-05-14 03:56:55

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 2/7] PCI: move resources and bus_list releasing to pci_release_dev

On Mon, May 13, 2013 at 8:20 PM, Yijing Wang <[email protected]> wrote:
> On 2013/5/14 10:28, Yinghai Lu wrote:
>> We should not release resource in pci_destroy that is too early
>
> Hi Yinghai,
> "too early" means that after pci_stop_dev(), if someone else
> hold the device reference, it still care the device resource ? e.g.?

I don't mean that.

>
>> as there could be still other use hold reference.

purpose is:
move res releasing as late as possible. so pci_stop_and_remove_bus_device
could be called several times for nested removing via sys.

Yinghai

2013-05-14 06:03:37

by Yijing Wang

[permalink] [raw]
Subject: Re: [PATCH 2/7] PCI: move resources and bus_list releasing to pci_release_dev

On 2013/5/14 11:56, Yinghai Lu wrote:
> On Mon, May 13, 2013 at 8:20 PM, Yijing Wang <[email protected]> wrote:
>> On 2013/5/14 10:28, Yinghai Lu wrote:
>>> We should not release resource in pci_destroy that is too early
>>
>> Hi Yinghai,
>> "too early" means that after pci_stop_dev(), if someone else
>> hold the device reference, it still care the device resource ? e.g.?
>
> I don't mean that.
>
>>
>>> as there could be still other use hold reference.
>
> purpose is:
> move res releasing as late as possible. so pci_stop_and_remove_bus_device
> could be called several times for nested removing via sys.

OK, thanks for explanation.

Thanks!
Yijing.

>
> Yinghai
>
>


--
Thanks!
Yijing

2013-05-14 09:09:02

by Yan Burman

[permalink] [raw]
Subject: RE: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's



> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Yinghai Lu
> Sent: Tuesday, May 14, 2013 05:31
> To: Bjorn Helgaas
> Cc: Gu Zheng; [email protected]; [email protected]; Yinghai
> Lu; [email protected]
> Subject: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's
>
> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded when
> the drivers are built-in, and kernel command line include probe_vfs=63,
> num_vfs=63.
>
> It turns that it also happen for hotadd path even drivers are compiled as
> modules and if they loaded. Esp some VF share the same driver with PF.
>
> calling path:
> device driver probe
> ==> pci_enable_sriov
> ==> virtfn_add
> ==> pci_dev_add
> ==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
>
> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
> the problem for two path, 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: [email protected]
>
> ---
> drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 7 +
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 4 -
> drivers/net/ethernet/cisco/enic/enic_main.c | 2
> drivers/net/ethernet/emulex/benet/be_main.c | 4 +
> drivers/net/ethernet/intel/igb/igb_main.c | 11 ++
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2
> drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 9 +-
> drivers/net/ethernet/mellanox/mlx4/main.c | 2
> drivers/net/ethernet/neterion/vxge/vxge-main.c | 3
> drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c | 5 +
> drivers/net/ethernet/sfc/efx.c | 1
> drivers/pci/iov.c | 73 +++++++++++++++++--
> drivers/scsi/lpfc/lpfc_init.c | 2
> include/linux/pci.h | 4 +
> 14 files changed, 115 insertions(+), 14 deletions(-)
>
> Index: linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> =================================================================
> ==
> --- linux-2.6.orig/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2308,6 +2308,8 @@ slave_start:
> priv->pci_dev_data = pci_dev_data;
> pci_set_drvdata(pdev, dev);
>
> + pci_bus_add_device_vfs(pdev);
> +

This seems wrong, since pci_bus_add_device_vfs() is being called for VF's as well as for PF when SRIOV is not enabled.
I can see that there are sanity checks in pci_bus_add_vf() that would prevent this code from doing damage,
but why schedule a callback that will do nothing if we can prevent scheduling altogether?
It would probably be better to do the following:

+ if (dev->flags & MLX4_FLAG_SRIOV)
+ pci_bus_add_device_vfs(pdev);
+

> return 0;
>
> err_port:

2013-05-14 09:46:35

by Sathya Perla

[permalink] [raw]
Subject: RE: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's


> -----Original Message-----
> From: [email protected] [mailto:linux-pci-
> [email protected]] On Behalf Of Yinghai Lu
>
> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded
> when the drivers are built-in, and kernel command line include
> probe_vfs=63, num_vfs=63.
>
> It turns that it also happen for hotadd path even drivers are compiled as
> modules and if they loaded. Esp some VF share the same driver with PF.
>
> calling path:
> device driver probe
> ==> pci_enable_sriov
> ==> virtfn_add
> ==> pci_dev_add
> ==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
>
> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
> the problem for two path, 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
>

...
>
> Index: linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> ==========================================================
> =========
> --- linux-2.6.orig/drivers/net/ethernet/emulex/benet/be_main.c
> +++ linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -4119,6 +4119,7 @@ static int lancer_recover_func(struct be
> if (status)
> goto err;
>
> + pci_bus_add_device_vfs(adapter->pdev);
> if (netif_running(adapter->netdev)) {
> status = be_open(adapter->netdev);
> if (status)
> @@ -4335,6 +4336,8 @@ static int be_probe(struct pci_dev *pdev
> dev_info(&pdev->dev, "%s: %s %s port %c\n", nic_name(pdev),
> func_name(adapter), mc_name(adapter), port_name);
>
> + pci_bus_add_device_vfs(pdev);
> +
> return 0;
>
> unsetup:
> @@ -4406,6 +4409,7 @@ static int be_resume(struct pci_dev *pde
> rtnl_unlock();
> }
>
> + pci_bus_add_device_vfs(adapter->pdev);
> schedule_delayed_work(&adapter->func_recovery_work,
> msecs_to_jiffies(1000));
> netif_device_attach(netdev);

I fixed this issue in be2net with the following patch (commit b4c1df93)
http://marc.info/?l=linux-netdev&m=136801459808765&w=2

-Sathya

2013-05-14 15:19:05

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Tue, May 14, 2013 at 2:46 AM, Perla, Sathya <[email protected]> wrote:
>
>> -----Original Message-----
>> From: [email protected] [mailto:linux-pci-
>> [email protected]] On Behalf Of Yinghai Lu
>>
>> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded
>> when the drivers are built-in, and kernel command line include
>> probe_vfs=63, num_vfs=63.
>>
>> It turns that it also happen for hotadd path even drivers are compiled as
>> modules and if they loaded. Esp some VF share the same driver with PF.
>>
>> calling path:
>> device driver probe
>> ==> pci_enable_sriov
>> ==> virtfn_add
>> ==> pci_dev_add
>> ==> pci_bus_device_add
>> when pci_bus_device_add is called, the VF's driver will be attached.
>> and at that time PF's driver does not finish yet.
>>
>> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
>> the problem for two path, 1. hotadd path: use device_schedule_callback.
>> 2. for booting path, use initcall to call that for all VF's.
>>
>
> ...
>>
>> Index: linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
>> ==========================================================
>> =========
>> --- linux-2.6.orig/drivers/net/ethernet/emulex/benet/be_main.c
>> +++ linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
>> @@ -4119,6 +4119,7 @@ static int lancer_recover_func(struct be
>> if (status)
>> goto err;
>>
>> + pci_bus_add_device_vfs(adapter->pdev);
>> if (netif_running(adapter->netdev)) {
>> status = be_open(adapter->netdev);
>> if (status)
>> @@ -4335,6 +4336,8 @@ static int be_probe(struct pci_dev *pdev
>> dev_info(&pdev->dev, "%s: %s %s port %c\n", nic_name(pdev),
>> func_name(adapter), mc_name(adapter), port_name);
>>
>> + pci_bus_add_device_vfs(pdev);
>> +
>> return 0;
>>
>> unsetup:
>> @@ -4406,6 +4409,7 @@ static int be_resume(struct pci_dev *pde
>> rtnl_unlock();
>> }
>>
>> + pci_bus_add_device_vfs(adapter->pdev);
>> schedule_delayed_work(&adapter->func_recovery_work,
>> msecs_to_jiffies(1000));
>> netif_device_attach(netdev);
>
> I fixed this issue in be2net with the following patch (commit b4c1df93)
> http://marc.info/?l=linux-netdev&m=136801459808765&w=2

We should fix that in generic way.

BTW, Can you please implement sriov_configure like
intel igb, ixgbe
qlogic: qlcnic
...

driver really should not call pci_enable_sriov in probe path.

Yinghai

2013-05-14 15:43:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Tue, May 14, 2013 at 1:58 AM, Yan Burman <[email protected]> wrote:
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> On Behalf Of Yinghai Lu
>> Sent: Tuesday, May 14, 2013 05:31
>> To: Bjorn Helgaas
>> Cc: Gu Zheng; [email protected]; [email protected]; Yinghai
>> Lu; [email protected]
>> Subject: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's
>>
>> Found kernel try to load mlx4 drivers for VFs before PF's is really loaded when
>> the drivers are built-in, and kernel command line include probe_vfs=63,
>> num_vfs=63.
>>
>> It turns that it also happen for hotadd path even drivers are compiled as
>> modules and if they loaded. Esp some VF share the same driver with PF.
>>
>> calling path:
>> device driver probe
>> ==> pci_enable_sriov
>> ==> virtfn_add
>> ==> pci_dev_add
>> ==> pci_bus_device_add
>> when pci_bus_device_add is called, the VF's driver will be attached.
>> and at that time PF's driver does not finish yet.
>>
>> Need to move out pci_bus_device_add from virtfn_add and call it later. Fix
>> the problem for two path, 1. hotadd path: use device_schedule_callback.
>> 2. for booting path, use initcall to call that for all VF's.

>> Index: linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
>> =================================================================
>> ==
>> --- linux-2.6.orig/drivers/net/ethernet/mellanox/mlx4/main.c
>> +++ linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
>> @@ -2308,6 +2308,8 @@ slave_start:
>> priv->pci_dev_data = pci_dev_data;
>> pci_set_drvdata(pdev, dev);
>>
>> + pci_bus_add_device_vfs(pdev);
>> +
>
> This seems wrong, since pci_bus_add_device_vfs() is being called for VF's as well as for PF when SRIOV is not enabled.
> I can see that there are sanity checks in pci_bus_add_vf() that would prevent this code from doing damage,
> but why schedule a callback that will do nothing if we can prevent scheduling altogether?
> It would probably be better to do the following:
>
> + if (dev->flags & MLX4_FLAG_SRIOV)
> + pci_bus_add_device_vfs(pdev);
> +

Yes, we can add that check.

BTW, do you have any plan to make mlx4 support sriov_configure via sysfs?

Yinghai

2013-05-14 16:00:52

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/13/2013 07:28 PM, Yinghai Lu wrote:
> Found kernel try to load mlx4 drivers for VFs before
> PF's is really loaded when the drivers are built-in, and kernel
> command line include probe_vfs=63, num_vfs=63.
>
> It turns that it also happen for hotadd path even drivers are
> compiled as modules and if they loaded. Esp some VF share the
> same driver with PF.
>
> calling path:
> device driver probe
> ==> pci_enable_sriov
> ==> virtfn_add
> ==> pci_dev_add
> ==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.
>
> Need to move out pci_bus_device_add from virtfn_add and call it
> later. Fix the problem for two path,
> 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: [email protected]
>

I'm sorry, but what is the point of this patch? With device assignment
it is always possible to have VFs loaded and the PF driver unloaded
since you cannot remove the VFs if they are assigned to a VM.

If there is a driver that has to have the PF driver fully loaded before
it instantiates the VFs then it sounds like a buggy driver to me. The
VF driver should be able to be loaded when the PF driver is not
present. We handle it in igb and ixgbe last I checked, and I don't see
any reason why it cannot be handled in all other VF drivers. I'm not
saying the VF has to be able to fully functional, but it should be able
to detect the PF becoming enabled and then bring itself to a fully
functional state. To not handle that case is a bug.

Thanks,

Alex

2013-05-14 18:44:12

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
<[email protected]> wrote:
> On 05/13/2013 07:28 PM, Yinghai Lu wrote:
>> Found kernel try to load mlx4 drivers for VFs before
>> PF's is really loaded when the drivers are built-in, and kernel
>> command line include probe_vfs=63, num_vfs=63.
>>
>> It turns that it also happen for hotadd path even drivers are
>> compiled as modules and if they loaded. Esp some VF share the
>> same driver with PF.
>>
>> calling path:
>> device driver probe
>> ==> pci_enable_sriov
>> ==> virtfn_add
>> ==> pci_dev_add
>> ==> pci_bus_device_add
>> when pci_bus_device_add is called, the VF's driver will be attached.
>> and at that time PF's driver does not finish yet.
>>
>> Need to move out pci_bus_device_add from virtfn_add and call it
>> later. Fix the problem for two path,
>> 1. hotadd path: use device_schedule_callback.
>> 2. for booting path, use initcall to call that for all VF's.
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>> Cc: [email protected]
>>
>
> I'm sorry, but what is the point of this patch? With device assignment
> it is always possible to have VFs loaded and the PF driver unloaded
> since you cannot remove the VFs if they are assigned to a VM.

unload PF driver will not call pci_disable_sriov?

>
> If there is a driver that has to have the PF driver fully loaded before
> it instantiates the VFs then it sounds like a buggy driver to me. The
> VF driver should be able to be loaded when the PF driver is not
> present. We handle it in igb and ixgbe last I checked, and I don't see
> any reason why it cannot be handled in all other VF drivers. I'm not
> saying the VF has to be able to fully functional, but it should be able
> to detect the PF becoming enabled and then bring itself to a fully
> functional state. To not handle that case is a bug.

more than that.

there is work_on_cpu nested lock problem. from calling pci_bus_add_device
in driver pci probe function.

[ 181.938110] mlx4_core 0000:02:00.0: Started init_resource_tracker: 80 slaves
[ 181.938759] alloc irq_desc for 1170 on node 0
[ 181.949104] mlx4_core 0000:02:00.0: irq 1170 for MSI-X
[ 181.949404] alloc irq_desc for 1171 on node 0
[ 181.949741] mlx4_core 0000:02:00.0: irq 1171 for MSI-X
[ 181.969253] alloc irq_desc for 1172 on node 0
[ 181.969564] mlx4_core 0000:02:00.0: irq 1172 for MSI-X
[ 181.989137] alloc irq_desc for 1173 on node 0
[ 181.989485] mlx4_core 0000:02:00.0: irq 1173 for MSI-X
[ 182.033789] mlx4_core 0000:02:00.0: NOP command IRQ test passed
[ 182.035380]
[ 182.035473] =============================================
[ 182.049065] [ INFO: possible recursive locking detected ]
[ 182.049349] 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588 Not tainted
[ 182.069079] ---------------------------------------------
[ 182.069354] kworker/0:1/2285 is trying to acquire lock:
[ 182.089080] ((&wfc.work)){+.+.+.}, at: [<ffffffff810ab745>]
flush_work+0x5/0x280
[ 182.089500]
[ 182.089500] but task is already holding lock:
[ 182.109671] ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
process_one_work+0x202/0x490
[ 182.129097]
[ 182.129097] other info that might help us debug this:
[ 182.129415] Possible unsafe locking scenario:
[ 182.129415]
[ 182.149275] CPU0
[ 182.149386] ----
[ 182.149513] lock((&wfc.work));
[ 182.149705] lock((&wfc.work));
[ 182.169391]
[ 182.169391] *** DEADLOCK ***
[ 182.169391]
[ 182.169722] May be due to missing lock nesting notation
[ 182.169722]
[ 182.189461] 3 locks held by kworker/0:1/2285:
[ 182.189664] #0: (events){.+.+.+}, at: [<ffffffff810aabe2>]
process_one_work+0x202/0x490
[ 182.209468] #1: ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
process_one_work+0x202/0x490
[ 182.229176] #2: (&__lockdep_no_validate__){......}, at:
[<ffffffff81765eea>] device_attach+0x2a/0xc0
[ 182.249108]
[ 182.249108] stack backtrace:
[ 182.249362] CPU: 0 PID: 2285 Comm: kworker/0:1 Not tainted
3.10.0-rc1-yh-00114-gf59c98e-dirty #1588
[ 182.269258] Hardware name: Oracle Corporation unknown /
, BIOS 11016600 05/17/2011
[ 182.289141] Workqueue: events work_for_cpu_fn
[ 182.289410] ffffffff83350bc0 ffff881025c11778 ffffffff82093a74
ffff881025c11838
[ 182.309167] ffffffff810ed194 ffff881025c117b8 ffff881025c38000
0000b787702301dc
[ 182.309587] ffff881000000000 0000000000000002 ffffffff8322cba0
ffff881025c11878
[ 182.329524] Call Trace:
[ 182.329669] [<ffffffff82093a74>] dump_stack+0x19/0x1b
[ 182.349365] [<ffffffff810ed194>] validate_chain.isra.19+0x8f4/0x1210
[ 182.349720] [<ffffffff810ed3b6>] ? validate_chain.isra.19+0xb16/0x1210
[ 182.369261] [<ffffffff810eacf8>] ? trace_hardirqs_off_caller+0x28/0x160
[ 182.389069] [<ffffffff810f0c40>] __lock_acquire+0xac0/0xce0
[ 182.389330] [<ffffffff810f150a>] lock_acquire+0xda/0x130
[ 182.409077] [<ffffffff810ab745>] ? flush_work+0x5/0x280
[ 182.409320] [<ffffffff810ab78c>] flush_work+0x4c/0x280
[ 182.409595] [<ffffffff810ab745>] ? flush_work+0x5/0x280
[ 182.429306] [<ffffffff810ee506>] ? mark_held_locks+0x136/0x150
[ 182.429634] [<ffffffff820a67fb>] ? _raw_spin_unlock+0x2b/0x40
[ 182.449352] [<ffffffff810aa5a5>] ? queue_work_on+0x75/0xa0
[ 182.469088] [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
[ 182.469352] [<ffffffff810aba42>] work_on_cpu+0x82/0x90
[ 182.489073] [<ffffffff810a7940>] ? find_worker_executing_work+0x90/0x90
[ 182.489426] [<ffffffff8151e290>] ? pci_device_shutdown+0x70/0x70
[ 182.509188] [<ffffffff8151ebcf>] pci_device_probe+0xaf/0x110
[ 182.509448] [<ffffffff8176608d>] driver_probe_device+0xdd/0x220
[ 182.529193] [<ffffffff81766280>] ? __driver_attach+0xb0/0xb0
[ 182.529516] [<ffffffff817662b3>] __device_attach+0x33/0x50
[ 182.549222] [<ffffffff817640b6>] bus_for_each_drv+0x56/0xa0
[ 182.549503] [<ffffffff81765f48>] device_attach+0x88/0xc0
[ 182.569215] [<ffffffff81515b49>] pci_bus_add_device+0x39/0x60
[ 182.569513] [<ffffffff81540605>] pci_bus_add_vf+0x25/0x40
[ 182.589239] [<ffffffff81540834>] pci_bus_add_device_vfs+0xa4/0xe0
[ 182.589618] [<ffffffff81c1faa6>] __mlx4_init_one+0xa96/0xc90
[ 182.609273] [<ffffffff81c1fd0d>] mlx4_init_one+0x4d/0x60
[ 182.609588] [<ffffffff8151e2db>] local_pci_probe+0x4b/0x80
[ 182.629584] [<ffffffff810a7958>] work_for_cpu_fn+0x18/0x30
[ 182.629869] [<ffffffff810aac6b>] process_one_work+0x28b/0x490
[ 182.649313] [<ffffffff810aabe2>] ? process_one_work+0x202/0x490
[ 182.649608] [<ffffffff810abf68>] ? worker_thread+0x48/0x370
[ 182.669325] [<ffffffff810aae9c>] process_scheduled_works+0x2c/0x40
[ 182.690446] [<ffffffff810ac158>] worker_thread+0x238/0x370
[ 182.690712] [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
[ 182.709143] [<ffffffff810abf20>] ? manage_workers.isra.18+0x330/0x330
[ 182.709499] [<ffffffff810b2e78>] kthread+0xe8/0xf0

2013-05-14 19:45:55

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/14/2013 11:44 AM, Yinghai Lu wrote:
> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
> <[email protected]> wrote:
>> On 05/13/2013 07:28 PM, Yinghai Lu wrote:
>>> Found kernel try to load mlx4 drivers for VFs before
>>> PF's is really loaded when the drivers are built-in, and kernel
>>> command line include probe_vfs=63, num_vfs=63.
>>>
>>> It turns that it also happen for hotadd path even drivers are
>>> compiled as modules and if they loaded. Esp some VF share the
>>> same driver with PF.
>>>
>>> calling path:
>>> device driver probe
>>> ==> pci_enable_sriov
>>> ==> virtfn_add
>>> ==> pci_dev_add
>>> ==> pci_bus_device_add
>>> when pci_bus_device_add is called, the VF's driver will be attached.
>>> and at that time PF's driver does not finish yet.
>>>
>>> Need to move out pci_bus_device_add from virtfn_add and call it
>>> later. Fix the problem for two path,
>>> 1. hotadd path: use device_schedule_callback.
>>> 2. for booting path, use initcall to call that for all VF's.
>>>
>>> Signed-off-by: Yinghai Lu <[email protected]>
>>> Cc: [email protected]
>>>
>> I'm sorry, but what is the point of this patch? With device assignment
>> it is always possible to have VFs loaded and the PF driver unloaded
>> since you cannot remove the VFs if they are assigned to a VM.
> unload PF driver will not call pci_disable_sriov?

You cannot call pci_disable_sriov because you will panic all of the
guests that have devices assigned.

>> If there is a driver that has to have the PF driver fully loaded before
>> it instantiates the VFs then it sounds like a buggy driver to me. The
>> VF driver should be able to be loaded when the PF driver is not
>> present. We handle it in igb and ixgbe last I checked, and I don't see
>> any reason why it cannot be handled in all other VF drivers. I'm not
>> saying the VF has to be able to fully functional, but it should be able
>> to detect the PF becoming enabled and then bring itself to a fully
>> functional state. To not handle that case is a bug.
> more than that.
>
> there is work_on_cpu nested lock problem. from calling pci_bus_add_device
> in driver pci probe function.
>
> [ 181.938110] mlx4_core 0000:02:00.0: Started init_resource_tracker: 80 slaves
> [ 181.938759] alloc irq_desc for 1170 on node 0
> [ 181.949104] mlx4_core 0000:02:00.0: irq 1170 for MSI-X
> [ 181.949404] alloc irq_desc for 1171 on node 0
> [ 181.949741] mlx4_core 0000:02:00.0: irq 1171 for MSI-X
> [ 181.969253] alloc irq_desc for 1172 on node 0
> [ 181.969564] mlx4_core 0000:02:00.0: irq 1172 for MSI-X
> [ 181.989137] alloc irq_desc for 1173 on node 0
> [ 181.989485] mlx4_core 0000:02:00.0: irq 1173 for MSI-X
> [ 182.033789] mlx4_core 0000:02:00.0: NOP command IRQ test passed
> [ 182.035380]
> [ 182.035473] =============================================
> [ 182.049065] [ INFO: possible recursive locking detected ]
> [ 182.049349] 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588 Not tainted
> [ 182.069079] ---------------------------------------------
> [ 182.069354] kworker/0:1/2285 is trying to acquire lock:
> [ 182.089080] ((&wfc.work)){+.+.+.}, at: [<ffffffff810ab745>]
> flush_work+0x5/0x280
> [ 182.089500]
> [ 182.089500] but task is already holding lock:
> [ 182.109671] ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
> process_one_work+0x202/0x490
> [ 182.129097]
> [ 182.129097] other info that might help us debug this:
> [ 182.129415] Possible unsafe locking scenario:
> [ 182.129415]
> [ 182.149275] CPU0
> [ 182.149386] ----
> [ 182.149513] lock((&wfc.work));
> [ 182.149705] lock((&wfc.work));
> [ 182.169391]
> [ 182.169391] *** DEADLOCK ***
> [ 182.169391]
> [ 182.169722] May be due to missing lock nesting notation
> [ 182.169722]
> [ 182.189461] 3 locks held by kworker/0:1/2285:
> [ 182.189664] #0: (events){.+.+.+}, at: [<ffffffff810aabe2>]
> process_one_work+0x202/0x490
> [ 182.209468] #1: ((&wfc.work)){+.+.+.}, at: [<ffffffff810aabe2>]
> process_one_work+0x202/0x490
> [ 182.229176] #2: (&__lockdep_no_validate__){......}, at:
> [<ffffffff81765eea>] device_attach+0x2a/0xc0
> [ 182.249108]
> [ 182.249108] stack backtrace:
> [ 182.249362] CPU: 0 PID: 2285 Comm: kworker/0:1 Not tainted
> 3.10.0-rc1-yh-00114-gf59c98e-dirty #1588
> [ 182.269258] Hardware name: Oracle Corporation unknown /
> , BIOS 11016600 05/17/2011
> [ 182.289141] Workqueue: events work_for_cpu_fn
> [ 182.289410] ffffffff83350bc0 ffff881025c11778 ffffffff82093a74
> ffff881025c11838
> [ 182.309167] ffffffff810ed194 ffff881025c117b8 ffff881025c38000
> 0000b787702301dc
> [ 182.309587] ffff881000000000 0000000000000002 ffffffff8322cba0
> ffff881025c11878
> [ 182.329524] Call Trace:
> [ 182.329669] [<ffffffff82093a74>] dump_stack+0x19/0x1b
> [ 182.349365] [<ffffffff810ed194>] validate_chain.isra.19+0x8f4/0x1210
> [ 182.349720] [<ffffffff810ed3b6>] ? validate_chain.isra.19+0xb16/0x1210
> [ 182.369261] [<ffffffff810eacf8>] ? trace_hardirqs_off_caller+0x28/0x160
> [ 182.389069] [<ffffffff810f0c40>] __lock_acquire+0xac0/0xce0
> [ 182.389330] [<ffffffff810f150a>] lock_acquire+0xda/0x130
> [ 182.409077] [<ffffffff810ab745>] ? flush_work+0x5/0x280
> [ 182.409320] [<ffffffff810ab78c>] flush_work+0x4c/0x280
> [ 182.409595] [<ffffffff810ab745>] ? flush_work+0x5/0x280
> [ 182.429306] [<ffffffff810ee506>] ? mark_held_locks+0x136/0x150
> [ 182.429634] [<ffffffff820a67fb>] ? _raw_spin_unlock+0x2b/0x40
> [ 182.449352] [<ffffffff810aa5a5>] ? queue_work_on+0x75/0xa0
> [ 182.469088] [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
> [ 182.469352] [<ffffffff810aba42>] work_on_cpu+0x82/0x90
> [ 182.489073] [<ffffffff810a7940>] ? find_worker_executing_work+0x90/0x90
> [ 182.489426] [<ffffffff8151e290>] ? pci_device_shutdown+0x70/0x70
> [ 182.509188] [<ffffffff8151ebcf>] pci_device_probe+0xaf/0x110
> [ 182.509448] [<ffffffff8176608d>] driver_probe_device+0xdd/0x220
> [ 182.529193] [<ffffffff81766280>] ? __driver_attach+0xb0/0xb0
> [ 182.529516] [<ffffffff817662b3>] __device_attach+0x33/0x50
> [ 182.549222] [<ffffffff817640b6>] bus_for_each_drv+0x56/0xa0
> [ 182.549503] [<ffffffff81765f48>] device_attach+0x88/0xc0
> [ 182.569215] [<ffffffff81515b49>] pci_bus_add_device+0x39/0x60
> [ 182.569513] [<ffffffff81540605>] pci_bus_add_vf+0x25/0x40
> [ 182.589239] [<ffffffff81540834>] pci_bus_add_device_vfs+0xa4/0xe0
> [ 182.589618] [<ffffffff81c1faa6>] __mlx4_init_one+0xa96/0xc90
> [ 182.609273] [<ffffffff81c1fd0d>] mlx4_init_one+0x4d/0x60
> [ 182.609588] [<ffffffff8151e2db>] local_pci_probe+0x4b/0x80
> [ 182.629584] [<ffffffff810a7958>] work_for_cpu_fn+0x18/0x30
> [ 182.629869] [<ffffffff810aac6b>] process_one_work+0x28b/0x490
> [ 182.649313] [<ffffffff810aabe2>] ? process_one_work+0x202/0x490
> [ 182.649608] [<ffffffff810abf68>] ? worker_thread+0x48/0x370
> [ 182.669325] [<ffffffff810aae9c>] process_scheduled_works+0x2c/0x40
> [ 182.690446] [<ffffffff810ac158>] worker_thread+0x238/0x370
> [ 182.690712] [<ffffffff810ee78d>] ? trace_hardirqs_on+0xd/0x10
> [ 182.709143] [<ffffffff810abf20>] ? manage_workers.isra.18+0x330/0x330
> [ 182.709499] [<ffffffff810b2e78>] kthread+0xe8/0xf0

So how does your patch actually fix this problem? It seems like it is
just avoiding it.

>From what I can tell your problem is originating in pci_call_probe. I
believe it is calling work_on_cpu and that doesn't seem correct since
the work should be taking place on a CPU already local to the PF. You
might want to look there to see why you are trying to schedule work on a
CPU which should be perfectly fine for you to already be doing your work on.

Thanks,

Alex

2013-05-14 19:59:36

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
<[email protected]> wrote:
> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>> <[email protected]> wrote:

>>> I'm sorry, but what is the point of this patch? With device assignment
>>> it is always possible to have VFs loaded and the PF driver unloaded
>>> since you cannot remove the VFs if they are assigned to a VM.
>> unload PF driver will not call pci_disable_sriov?
>
> You cannot call pci_disable_sriov because you will panic all of the
> guests that have devices assigned.

ixgbe_remove did call pci_disable_sriov...

for guest panic, that is another problem.
just like you pci passthrough with real pci device and hotremove the
card in host.

...

> So how does your patch actually fix this problem? It seems like it is
> just avoiding it.

yes, until the first one is done.

>
> From what I can tell your problem is originating in pci_call_probe. I
> believe it is calling work_on_cpu and that doesn't seem correct since
> the work should be taking place on a CPU already local to the PF. You
> might want to look there to see why you are trying to schedule work on a
> CPU which should be perfectly fine for you to already be doing your work on.

it always try to go with local cpu with same pxm.

2013-05-14 21:39:04

by Duyck, Alexander H

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/14/2013 12:59 PM, Yinghai Lu wrote:
> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
> <[email protected]> wrote:
>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>> <[email protected]> wrote:
>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>> since you cannot remove the VFs if they are assigned to a VM.
>>> unload PF driver will not call pci_disable_sriov?
>> You cannot call pci_disable_sriov because you will panic all of the
>> guests that have devices assigned.
> ixgbe_remove did call pci_disable_sriov...
>
> for guest panic, that is another problem.
> just like you pci passthrough with real pci device and hotremove the
> card in host.
>
> ...

I suggest you take another look. In ixgbe_disable_sriov, which is the
function that is called we do a check for assigned VFs. If they are
assigned then we do not call pci_disable_sriov.

>
>> So how does your patch actually fix this problem? It seems like it is
>> just avoiding it.
> yes, until the first one is done.

Avoiding the issue doesn't fix the underlying problem and instead you
are likely just introducing more bugs as a result.

>> From what I can tell your problem is originating in pci_call_probe. I
>> believe it is calling work_on_cpu and that doesn't seem correct since
>> the work should be taking place on a CPU already local to the PF. You
>> might want to look there to see why you are trying to schedule work on a
>> CPU which should be perfectly fine for you to already be doing your work on.
> it always try to go with local cpu with same pxm.

The problem is we really shouldn't be calling work_for_cpu in this case
since we are already on the correct CPU. What probably should be
happening is that pci_call_probe should be doing a check to see if the
current CPU is already contained within the device node map and if so
just call local_pci_probe directly. That way you can avoid deadlocking
the system by trying to flush the CPU queue of the CPU you are currently on.

Thanks,

Alex

2013-05-16 04:00:18

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Tue, May 14, 2013 at 11:43 AM, Yinghai Lu <[email protected]> wrote:

> BTW, do you have any plan to make mlx4 support sriov_configure via sysfs?

yes we do, we're waiting for a firmware change that will allow for
such a patch to get working.

Please note that this whole lockdep warning was identified as false
positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html

Or.

2013-05-16 04:39:15

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Wed, May 15, 2013 at 9:00 PM, Or Gerlitz <[email protected]> wrote:
> On Tue, May 14, 2013 at 11:43 AM, Yinghai Lu <[email protected]> wrote:
>
>> BTW, do you have any plan to make mlx4 support sriov_configure via sysfs?
>
> yes we do, we're waiting for a firmware change that will allow for
> such a patch to get working.
>
> Please note that this whole lockdep warning was identified as false
> positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html

No, at least one time, I noticed there is one real hang happens.

2013-05-16 04:56:45

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Thu, May 16, 2013 at 12:39 AM, Yinghai Lu <[email protected]> wrote:
>
> On Wed, May 15, 2013 at 9:00 PM, Or Gerlitz <[email protected]> wrote:
> > On Tue, May 14, 2013 at 11:43 AM, Yinghai Lu <[email protected]> wrote:
> >
> >> BTW, do you have any plan to make mlx4 support sriov_configure via
> >> sysfs?
> >
> > yes we do, we're waiting for a firmware change that will allow for
> > such a patch to get working.
> >
> > Please note that this whole lockdep warning was identified as false
> > positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html
>
> No, at least one time, I noticed there is one real hang happens.

Tejun, so what your thinking here?

2013-05-16 06:39:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Mon, May 13, 2013 at 07:28:25PM -0700, Yinghai Lu wrote:
> Found kernel try to load mlx4 drivers for VFs before
> PF's is really loaded when the drivers are built-in, and kernel
> command line include probe_vfs=63, num_vfs=63.
>
> It turns that it also happen for hotadd path even drivers are
> compiled as modules and if they loaded. Esp some VF share the
> same driver with PF.
>
> calling path:
> device driver probe
> ==> pci_enable_sriov
> ==> virtfn_add
> ==> pci_dev_add
> ==> pci_bus_device_add
> when pci_bus_device_add is called, the VF's driver will be attached.
> and at that time PF's driver does not finish yet.

Okay but why is this a problem?


> Need to move out pci_bus_device_add from virtfn_add and call it
> later. Fix the problem for two path,
> 1. hotadd path: use device_schedule_callback.
> 2. for booting path, use initcall to call that for all VF's.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: [email protected]
>
> ---
> drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 7 +
> drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 4 -
> drivers/net/ethernet/cisco/enic/enic_main.c | 2
> drivers/net/ethernet/emulex/benet/be_main.c | 4 +
> drivers/net/ethernet/intel/igb/igb_main.c | 11 ++
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2
> drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 9 +-
> drivers/net/ethernet/mellanox/mlx4/main.c | 2
> drivers/net/ethernet/neterion/vxge/vxge-main.c | 3
> drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c | 5 +
> drivers/net/ethernet/sfc/efx.c | 1
> drivers/pci/iov.c | 73 +++++++++++++++++--
> drivers/scsi/lpfc/lpfc_init.c | 2
> include/linux/pci.h | 4 +
> 14 files changed, 115 insertions(+), 14 deletions(-)
>
> Index: linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ linux-2.6/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2308,6 +2308,8 @@ slave_start:
> priv->pci_dev_data = pci_dev_data;
> pci_set_drvdata(pdev, dev);
>
> + pci_bus_add_device_vfs(pdev);
> +
> return 0;
>
> err_port:
> Index: linux-2.6/drivers/pci/iov.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/iov.c
> +++ linux-2.6/drivers/pci/iov.c
> @@ -66,7 +66,8 @@ static void virtfn_remove_bus(struct pci
> pci_remove_bus(child);
> }
>
> -static int virtfn_add(struct pci_dev *dev, int id, int reset)
> +static int virtfn_add(struct pci_dev *dev, int id, int reset,
> + struct pci_dev **ret)
> {
> int i;
> int rc;
> @@ -116,7 +117,6 @@ static int virtfn_add(struct pci_dev *de
> pci_device_add(virtfn, virtfn->bus);
> mutex_unlock(&iov->dev->sriov->lock);
>
> - rc = pci_bus_add_device(virtfn);
> sprintf(buf, "virtfn%u", id);
> rc = sysfs_create_link(&dev->dev.kobj, &virtfn->dev.kobj, buf);
> if (rc)
> @@ -127,6 +127,8 @@ static int virtfn_add(struct pci_dev *de
>
> kobject_uevent(&virtfn->dev.kobj, KOBJ_CHANGE);
>
> + if (ret)
> + *ret = virtfn;
> return 0;
>
> failed2:
> @@ -141,6 +143,55 @@ failed1:
> return rc;
> }
>
> +static void pci_bus_add_vf(struct pci_dev *dev)
> +{
> + int rc;
> +
> + if (!dev || !dev->is_virtfn || dev->is_added)
> + return;
> +
> + rc = pci_bus_add_device(dev);
> +}
> +
> +static void bus_add_vfs(struct device *device)
> +{
> + struct pci_dev *dev = to_pci_dev(device);
> + int i, num_vfs = pci_num_vf(dev);
> +
> + for (i = 0; i < num_vfs; i++) {
> + struct pci_bus *bus;
> + struct pci_dev *virtfn;
> +
> + bus = pci_find_bus(pci_domain_nr(dev->bus), virtfn_bus(dev, i));
> + if (!bus)
> + continue;
> +
> + virtfn = pci_get_slot(bus, virtfn_devfn(dev, i));
> + pci_bus_add_vf(virtfn);
> + pci_dev_put(virtfn);
> + }
> +}
> +void pci_bus_add_device_vfs(struct pci_dev *pdev)
> +{
> + if (system_state == SYSTEM_BOOTING)
> + return;
> +
> + device_schedule_callback(&pdev->dev, bus_add_vfs);
> +}
> +EXPORT_SYMBOL_GPL(pci_bus_add_device_vfs);
> +
> +/* Make sure all VFs get added before pci_sysfs_init */
> +static int __init pci_bus_add_device_vfs_booting(void)
> +{
> + struct pci_dev *dev = NULL;
> +
> + for_each_pci_dev(dev)
> + pci_bus_add_vf(dev);
> +
> + return 0;
> +}
> +device_initcall_sync(pci_bus_add_device_vfs_booting);
> +
> static void virtfn_remove(struct pci_dev *dev, int id, int reset)
> {
> char buf[VIRTFN_ID_LEN];
> @@ -213,14 +264,22 @@ static void sriov_migration_task(struct
> if (state == PCI_SRIOV_VFM_MI) {
> writeb(PCI_SRIOV_VFM_AV, iov->mstate + i);
> state = readb(iov->mstate + i);
> - if (state == PCI_SRIOV_VFM_AV)
> - virtfn_add(iov->self, i, 1);
> + if (state == PCI_SRIOV_VFM_AV) {
> + struct pci_dev *virtfn = NULL;
> +
> + virtfn_add(iov->self, i, 1, &virtfn);
> + pci_bus_add_vf(virtfn);
> + }
> } else if (state == PCI_SRIOV_VFM_MO) {
> virtfn_remove(iov->self, i, 1);
> writeb(PCI_SRIOV_VFM_UA, iov->mstate + i);
> state = readb(iov->mstate + i);
> - if (state == PCI_SRIOV_VFM_AV)
> - virtfn_add(iov->self, i, 0);
> + if (state == PCI_SRIOV_VFM_AV) {
> + struct pci_dev *virtfn = NULL;
> +
> + virtfn_add(iov->self, i, 0, &virtfn);
> + pci_bus_add_vf(virtfn);
> + }
> }
> }
>
> @@ -356,7 +415,7 @@ static int sriov_enable(struct pci_dev *
> initial = nr_virtfn;
>
> for (i = 0; i < initial; i++) {
> - rc = virtfn_add(dev, i, 0);
> + rc = virtfn_add(dev, i, 0, NULL);
> if (rc)
> goto failed;
> }
> Index: linux-2.6/include/linux/pci.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pci.h
> +++ linux-2.6/include/linux/pci.h
> @@ -1659,6 +1659,7 @@ void __iomem *pci_ioremap_bar(struct pci
>
> #ifdef CONFIG_PCI_IOV
> int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
> +void pci_bus_add_device_vfs(struct pci_dev *dev);
> void pci_disable_sriov(struct pci_dev *dev);
> irqreturn_t pci_sriov_migration(struct pci_dev *dev);
> int pci_num_vf(struct pci_dev *dev);
> @@ -1670,6 +1671,9 @@ static inline int pci_enable_sriov(struc
> {
> return -ENODEV;
> }
> +static inline void pci_bus_add_device_vfs(struct pci_dev *dev)
> +{
> +}
> static inline void pci_disable_sriov(struct pci_dev *dev)
> {
> }
> Index: linux-2.6/drivers/net/ethernet/intel/igb/igb_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/intel/igb/igb_main.c
> +++ linux-2.6/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -2366,6 +2366,9 @@ static int igb_probe(struct pci_dev *pde
> }
>
> pm_runtime_put_noidle(&pdev->dev);
> +
> + pci_bus_add_device_vfs(pdev);
> +
> return 0;
>
> err_register:
> @@ -7278,8 +7281,12 @@ static int igb_pci_sriov_configure(struc
> #ifdef CONFIG_PCI_IOV
> if (num_vfs == 0)
> return igb_pci_disable_sriov(dev);
> - else
> - return igb_pci_enable_sriov(dev, num_vfs);
> + else {
> + int ret = igb_pci_enable_sriov(dev, num_vfs);
> + if (ret > 0)
> + pci_bus_add_device_vfs(dev);
> + return ret;
> + }
> #endif
> return 0;
> }
> Index: linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> +++ linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
> @@ -346,8 +346,13 @@ int ixgbe_pci_sriov_configure(struct pci
> {
> if (num_vfs == 0)
> return ixgbe_pci_sriov_disable(dev);
> - else
> - return ixgbe_pci_sriov_enable(dev, num_vfs);
> + else {
> + int ret = ixgbe_pci_sriov_enable(dev, num_vfs);
> +
> + if (ret > 0)
> + pci_bus_add_device_vfs(dev);
> + return ret;
> + }
> }
>
> static int ixgbe_set_vf_multicasts(struct ixgbe_adapter *adapter,
> Index: linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ linux-2.6/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -7658,6 +7658,8 @@ skip_sriov:
> IXGBE_LINK_SPEED_10GB_FULL | IXGBE_LINK_SPEED_1GB_FULL,
> true);
>
> + pci_bus_add_device_vfs(pdev);
> +
> return 0;
>
> err_register:
> Index: linux-2.6/drivers/scsi/lpfc/lpfc_init.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/lpfc/lpfc_init.c
> +++ linux-2.6/drivers/scsi/lpfc/lpfc_init.c
> @@ -10582,6 +10582,8 @@ lpfc_pci_probe_one(struct pci_dev *pdev,
> else
> rc = lpfc_pci_probe_one_s3(pdev, pid);
>
> + pci_bus_add_device_vfs(pdev);
> +
> return rc;
> }
>
> Index: linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/emulex/benet/be_main.c
> +++ linux-2.6/drivers/net/ethernet/emulex/benet/be_main.c
> @@ -4119,6 +4119,7 @@ static int lancer_recover_func(struct be
> if (status)
> goto err;
>
> + pci_bus_add_device_vfs(adapter->pdev);
> if (netif_running(adapter->netdev)) {
> status = be_open(adapter->netdev);
> if (status)
> @@ -4335,6 +4336,8 @@ static int be_probe(struct pci_dev *pdev
> dev_info(&pdev->dev, "%s: %s %s port %c\n", nic_name(pdev),
> func_name(adapter), mc_name(adapter), port_name);
>
> + pci_bus_add_device_vfs(pdev);
> +
> return 0;
>
> unsetup:
> @@ -4406,6 +4409,7 @@ static int be_resume(struct pci_dev *pde
> rtnl_unlock();
> }
>
> + pci_bus_add_device_vfs(adapter->pdev);
> schedule_delayed_work(&adapter->func_recovery_work,
> msecs_to_jiffies(1000));
> netif_device_attach(netdev);
> Index: linux-2.6/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> +++ linux-2.6/drivers/net/ethernet/qlogic/qlcnic/qlcnic_sriov_pf.c
> @@ -568,8 +568,11 @@ int qlcnic_pci_sriov_configure(struct pc
>
> if (num_vfs == 0)
> err = qlcnic_pci_sriov_disable(adapter);
> - else
> + else {
> err = qlcnic_pci_sriov_enable(adapter, num_vfs);
> + if (err > 0)
> + pci_bus_add_device_vfs(dev);
> + }
>
> clear_bit(__QLCNIC_RESETTING, &adapter->state);
> return err;
> Index: linux-2.6/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> +++ linux-2.6/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c
> @@ -3048,7 +3048,12 @@ int bnx2x_sriov_configure(struct pci_dev
> pci_disable_sriov(dev);
> return 0;
> } else {
> - return bnx2x_enable_sriov(bp);
> + int ret = bnx2x_enable_sriov(bp);
> +
> + if (ret > 0)
> + pci_bus_add_device_vfs(dev);
> +
> + return 0;
> }
> }
>
> Index: linux-2.6/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> +++ linux-2.6/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
> @@ -5749,10 +5749,12 @@ static int init_one(struct pci_dev *pdev
> sriov:
> #ifdef CONFIG_PCI_IOV
> if (func < ARRAY_SIZE(num_vf) && num_vf[func] > 0)
> - if (pci_enable_sriov(pdev, num_vf[func]) == 0)
> + if (pci_enable_sriov(pdev, num_vf[func]) == 0) {
> dev_info(&pdev->dev,
> "instantiated %u virtual functions\n",
> num_vf[func]);
> + pci_bus_add_device_vfs(pdev);
> + }
> #endif
> return 0;
>
> Index: linux-2.6/drivers/net/ethernet/cisco/enic/enic_main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/cisco/enic/enic_main.c
> +++ linux-2.6/drivers/net/ethernet/cisco/enic/enic_main.c
> @@ -2524,6 +2524,8 @@ static int enic_probe(struct pci_dev *pd
> goto err_out_dev_deinit;
> }
>
> + pci_bus_add_device_vfs(pdev);
> +
> return 0;
>
> err_out_dev_deinit:
> Index: linux-2.6/drivers/net/ethernet/neterion/vxge/vxge-main.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/neterion/vxge/vxge-main.c
> +++ linux-2.6/drivers/net/ethernet/neterion/vxge/vxge-main.c
> @@ -4731,6 +4731,9 @@ vxge_probe(struct pci_dev *pdev, const s
> vxge_hw_device_trace_level_get(hldev));
>
> kfree(ll_config);
> +
> + pci_bus_add_device_vfs(pdev);
> +
> return 0;
>
> _exit6:
> Index: linux-2.6/drivers/net/ethernet/sfc/efx.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/ethernet/sfc/efx.c
> +++ linux-2.6/drivers/net/ethernet/sfc/efx.c
> @@ -2822,6 +2822,7 @@ static int efx_pci_probe(struct pci_dev
> netif_warn(efx, probe, efx->net_dev,
> "pci_enable_pcie_error_reporting failed (%d)\n", rc);
>
> + pci_bus_add_device_vfs(pci_dev);
> return 0;
>
> fail4:
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-16 07:54:33

by Gu Zheng

[permalink] [raw]
Subject: Re: [PATCH 4/7] PCI: Fix racing for pci device removing via sysfs

Hi Yinghai,
It works well.

Thanks!
Gu

On 05/14/2013 10:28 AM, Yinghai Lu wrote:

> Gu found nested removing through
> echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
> /sys/bus/pci/devices/0000\:1a\:01.0/remove
>
> will cause kernel crash as bus get freed.
>
> [ 418.946462] CPU 4
> [ 418.968377] Pid: 512, comm: kworker/u:2 Tainted: G W 3.8.0 #2
> FUJITSU-SV PRIMEQUEST 1800E/SB
> [ 419.081763] RIP: 0010:[<ffffffff8137972e>] [<ffffffff8137972e>]
> pci_bus_read_config_word+0x5e/0x90
> [ 420.494137] Call Trace:
> [ 420.523326] [<ffffffff813851ef>] ? remove_callback+0x1f/0x40
> [ 420.591984] [<ffffffff8138044b>] pci_pme_active+0x4b/0x1c0
> [ 420.658545] [<ffffffff8137d8e7>] pci_stop_bus_device+0x57/0xb0
> [ 420.729259] [<ffffffff8137dab6>] pci_stop_and_remove_bus_device+0x16/0x30
> [ 420.811392] [<ffffffff813851fb>] remove_callback+0x2b/0x40
> [ 420.877955] [<ffffffff81257a56>] sysfs_schedule_callback_work+0x26/0x70
>
> https://bugzilla.kernel.org/show_bug.cgi?id=54411
>
> We have one patch that will let device hold bus ref to prevent it from
> being freed, but that will still generate warning.
>
> ------------[ cut here ]------------
> WARNING: at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
> Hardware name: PRIMEQUEST 1800E
> list_del corruption, ffff8807d1b6c000->next is LIST_POISON1 (dead000000100100)
> Call Trace:
> [<ffffffff81056d4f>] warn_slowpath_common+0x7f/0xc0
> [<ffffffff81056e46>] warn_slowpath_fmt+0x46/0x50
> [<ffffffff81280b13>] __list_del_entry+0x63/0xd0
> [<ffffffff81280b91>] list_del+0x11/0x40
> [<ffffffff81298331>] pci_destroy_dev+0x31/0xc0
> [<ffffffff812985bb>] pci_remove_bus_device+0x5b/0x70
> [<ffffffff812985ee>] pci_stop_and_remove_bus_device+0x1e/0x30
> [<ffffffff8129fc89>] remove_callback+0x29/0x40
> [<ffffffff811f3b84>] sysfs_schedule_callback_work+0x24/0x70
>
> Bjorn pointed out that pci_dev should retain its reference to the pci_bus
> for as long as the pci_dev exists, so the release reference should go in
> pci_release_dev() instead.
>
> At last we will not need to touch pci-sysfs bits.
>
> -v6: split other changes to other patches.
>
> Reported-by: Gu Zheng <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>


Tested-by: Gu Zheng <[email protected]>

>
> ---
> drivers/pci/probe.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> Index: linux-2.6/drivers/pci/probe.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/probe.c
> +++ linux-2.6/drivers/pci/probe.c
> @@ -1149,6 +1149,7 @@ static void pci_release_dev(struct devic
> list_del(&pci_dev->bus_list);
> up_write(&pci_bus_sem);
> pci_free_resources(pci_dev);
> + put_device(&pci_dev->bus->dev);
>
> pci_release_capabilities(pci_dev);
> pci_release_of_node(pci_dev);
> @@ -1360,6 +1361,7 @@ void pci_device_add(struct pci_dev *dev,
> down_write(&pci_bus_sem);
> list_add_tail(&dev->bus_list, &bus->devices);
> up_write(&pci_bus_sem);
> + get_device(&bus->dev);
>
> ret = pcibios_add_device(dev);
> WARN_ON(ret < 0);
>

2013-05-16 17:53:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Thu, May 16, 2013 at 12:56:42AM -0400, Or Gerlitz wrote:
> > > Please note that this whole lockdep warning was identified as false
> > > positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html
> >
> > No, at least one time, I noticed there is one real hang happens.
>
> Tejun, so what your thinking here?

Can't really tell much without more details. Yinghai, do you have any
logs from such hang? Are you sure it's from this?

Thanks.

--
tejun

2013-05-16 18:36:11

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Thu, May 16, 2013 at 10:53 AM, Tejun Heo <[email protected]> wrote:
> On Thu, May 16, 2013 at 12:56:42AM -0400, Or Gerlitz wrote:
>> > > Please note that this whole lockdep warning was identified as false
>> > > positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html
>> >
>> > No, at least one time, I noticed there is one real hang happens.
>>
>> Tejun, so what your thinking here?
>
> Can't really tell much without more details. Yinghai, do you have any
> logs from such hang? Are you sure it's from this?

I did not keep that log.

Will remove AlexD patch, try several times to see if I can duplicate it again.

Thanks

Yinghai

2013-05-20 12:23:24

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Thu, May 16, 2013 at 9:36 PM, Yinghai Lu <[email protected]> wrote:
> On Thu, May 16, 2013 at 10:53 AM, Tejun Heo <[email protected]> wrote:
>> On Thu, May 16, 2013 at 12:56:42AM -0400, Or Gerlitz wrote:

>>>>> Please note that this whole lockdep warning was identified as false
>>>>> positive by Tejun http://www.spinics.net/lists/linux-pci/msg21568.html

>>>> No, at least one time, I noticed there is one real hang happens.

>>> Tejun, so what your thinking here?

>> Can't really tell much without more details. Yinghai, do you have any
>> logs from such hang? Are you sure it's from this?

> I did not keep that log.
> Will remove AlexD patch, try several times to see if I can duplicate it again.

Any news here?

Or.

2013-05-21 21:30:45

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/14/2013 05:39 PM, Alexander Duyck wrote:
> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>> <[email protected]> wrote:
>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>> <[email protected]> wrote:
>>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>> unload PF driver will not call pci_disable_sriov?
>>> You cannot call pci_disable_sriov because you will panic all of the
>>> guests that have devices assigned.
>> ixgbe_remove did call pci_disable_sriov...
>>
>> for guest panic, that is another problem.
>> just like you pci passthrough with real pci device and hotremove the
>> card in host.
>>
>> ...
>
> I suggest you take another look. In ixgbe_disable_sriov, which is the
> function that is called we do a check for assigned VFs. If they are
> assigned then we do not call pci_disable_sriov.
>
>>
>>> So how does your patch actually fix this problem? It seems like it is
>>> just avoiding it.
>> yes, until the first one is done.
>
> Avoiding the issue doesn't fix the underlying problem and instead you
> are likely just introducing more bugs as a result.
>
>>> From what I can tell your problem is originating in pci_call_probe. I
>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>> the work should be taking place on a CPU already local to the PF. You
>>> might want to look there to see why you are trying to schedule work on a
>>> CPU which should be perfectly fine for you to already be doing your work on.
>> it always try to go with local cpu with same pxm.
>
> The problem is we really shouldn't be calling work_for_cpu in this case
> since we are already on the correct CPU. What probably should be
> happening is that pci_call_probe should be doing a check to see if the
> current CPU is already contained within the device node map and if so
> just call local_pci_probe directly. That way you can avoid deadlocking
> the system by trying to flush the CPU queue of the CPU you are currently on.
>
That's the patch that Michael Tsirkin posted for a fix,
but it was noted that if you have the case where the _same_ driver is used for vf & pf,
other deadlocks may occur.
It would work in the case of ixgbe/ixgbevf, but not for something like
the Mellanox pf/vf driver (which is the same).

>
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-21 21:32:03

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/21/2013 05:30 PM, Don Dutile wrote:
> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>> <[email protected]> wrote:
>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>> <[email protected]> wrote:
>>>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>> unload PF driver will not call pci_disable_sriov?
>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>> guests that have devices assigned.
>>> ixgbe_remove did call pci_disable_sriov...
>>>
>>> for guest panic, that is another problem.
>>> just like you pci passthrough with real pci device and hotremove the
>>> card in host.
>>>
>>> ...
>>
>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>> function that is called we do a check for assigned VFs. If they are
>> assigned then we do not call pci_disable_sriov.
>>
>>>
>>>> So how does your patch actually fix this problem? It seems like it is
>>>> just avoiding it.
>>> yes, until the first one is done.
>>
>> Avoiding the issue doesn't fix the underlying problem and instead you
>> are likely just introducing more bugs as a result.
>>
>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>> the work should be taking place on a CPU already local to the PF. You
>>>> might want to look there to see why you are trying to schedule work on a
>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>> it always try to go with local cpu with same pxm.
>>
>> The problem is we really shouldn't be calling work_for_cpu in this case
>> since we are already on the correct CPU. What probably should be
>> happening is that pci_call_probe should be doing a check to see if the
>> current CPU is already contained within the device node map and if so
>> just call local_pci_probe directly. That way you can avoid deadlocking
>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>
> That's the patch that Michael Tsirkin posted for a fix,
> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
> other deadlocks may occur.
> It would work in the case of ixgbe/ixgbevf, but not for something like
> the Mellanox pf/vf driver (which is the same).
>
apologies; here's the thread the discussed the issue:
https://patchwork.kernel.org/patch/2458681/

>>
>> Alex
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2013-05-21 21:49:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
> >On 05/14/2013 12:59 PM, Yinghai Lu wrote:
> >>On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
> >><[email protected]> wrote:
> >>>On 05/14/2013 11:44 AM, Yinghai Lu wrote:
> >>>>On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
> >>>><[email protected]> wrote:
> >>>>>I'm sorry, but what is the point of this patch? With device assignment
> >>>>>it is always possible to have VFs loaded and the PF driver unloaded
> >>>>>since you cannot remove the VFs if they are assigned to a VM.
> >>>>unload PF driver will not call pci_disable_sriov?
> >>>You cannot call pci_disable_sriov because you will panic all of the
> >>>guests that have devices assigned.
> >>ixgbe_remove did call pci_disable_sriov...
> >>
> >>for guest panic, that is another problem.
> >>just like you pci passthrough with real pci device and hotremove the
> >>card in host.
> >>
> >>...
> >
> >I suggest you take another look. In ixgbe_disable_sriov, which is the
> >function that is called we do a check for assigned VFs. If they are
> >assigned then we do not call pci_disable_sriov.
> >
> >>
> >>>So how does your patch actually fix this problem? It seems like it is
> >>>just avoiding it.
> >>yes, until the first one is done.
> >
> >Avoiding the issue doesn't fix the underlying problem and instead you
> >are likely just introducing more bugs as a result.
> >
> >>> From what I can tell your problem is originating in pci_call_probe. I
> >>>believe it is calling work_on_cpu and that doesn't seem correct since
> >>>the work should be taking place on a CPU already local to the PF. You
> >>>might want to look there to see why you are trying to schedule work on a
> >>>CPU which should be perfectly fine for you to already be doing your work on.
> >>it always try to go with local cpu with same pxm.
> >
> >The problem is we really shouldn't be calling work_for_cpu in this case
> >since we are already on the correct CPU. What probably should be
> >happening is that pci_call_probe should be doing a check to see if the
> >current CPU is already contained within the device node map and if so
> >just call local_pci_probe directly. That way you can avoid deadlocking
> >the system by trying to flush the CPU queue of the CPU you are currently on.
> >
> That's the patch that Michael Tsirkin posted for a fix,
> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
> other deadlocks may occur.
> It would work in the case of ixgbe/ixgbevf, but not for something like
> the Mellanox pf/vf driver (which is the same).
>

I think our conclusion was this is a false positive for Mellanox.
If not, we need to understand what the deadlock is better.

> >
> >Alex
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-21 21:59:03

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/21/2013 02:31 PM, Don Dutile wrote:
> On 05/21/2013 05:30 PM, Don Dutile wrote:
>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>> <[email protected]> wrote:
>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>> <[email protected]> wrote:
>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>> assignment
>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>> guests that have devices assigned.
>>>> ixgbe_remove did call pci_disable_sriov...
>>>>
>>>> for guest panic, that is another problem.
>>>> just like you pci passthrough with real pci device and hotremove the
>>>> card in host.
>>>>
>>>> ...
>>>
>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>> function that is called we do a check for assigned VFs. If they are
>>> assigned then we do not call pci_disable_sriov.
>>>
>>>>
>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>> just avoiding it.
>>>> yes, until the first one is done.
>>>
>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>> are likely just introducing more bugs as a result.
>>>
>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>> might want to look there to see why you are trying to schedule work
>>>>> on a
>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>> work on.
>>>> it always try to go with local cpu with same pxm.
>>>
>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>> since we are already on the correct CPU. What probably should be
>>> happening is that pci_call_probe should be doing a check to see if the
>>> current CPU is already contained within the device node map and if so
>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>> the system by trying to flush the CPU queue of the CPU you are
>>> currently on.
>>>
>> That's the patch that Michael Tsirkin posted for a fix,
>> but it was noted that if you have the case where the _same_ driver is
>> used for vf & pf,
>> other deadlocks may occur.
>> It would work in the case of ixgbe/ixgbevf, but not for something like
>> the Mellanox pf/vf driver (which is the same).
>>
> apologies; here's the thread the discussed the issue:
> https://patchwork.kernel.org/patch/2458681/
>

I found out about that patch after I submitted one that was similar.
The only real complaint I had about his patch was that it was only
looking at the CPU and he could save himself some trouble by just doing
the work locally if we were on the correct NUMA node. For example if
the system only has one node in it what is the point in scheduling all
of the work on CPU 0? My alternative patch can be found at:
https://patchwork.kernel.org/patch/2568881/

As far as the inter-driver locking issues for same driver I don't think
that is really any kind if issue. Most drivers shouldn't be holding any
big locks when they call pci_enable_sriov. If I am not mistaken the
follow on patch I submitted which was similar to Michaels was reported
to have resolved the issue.

As far as the Mellanox PF/VF the bigger issue is that when they call
pci_enable_sriov they are not ready to handle VFs. There have been
several suggestions on how to resolve it including -EPROBE_DEFER or the
igbvf/ixgbevf approach of just brining up the device in a "link down" state.

Thanks,

Alex

2013-05-21 22:01:12

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>> <[email protected]> wrote:
>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>> <[email protected]> wrote:
>>>>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>> guests that have devices assigned.
>>>> ixgbe_remove did call pci_disable_sriov...
>>>>
>>>> for guest panic, that is another problem.
>>>> just like you pci passthrough with real pci device and hotremove the
>>>> card in host.
>>>>
>>>> ...
>>>
>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>> function that is called we do a check for assigned VFs. If they are
>>> assigned then we do not call pci_disable_sriov.
>>>
>>>>
>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>> just avoiding it.
>>>> yes, until the first one is done.
>>>
>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>> are likely just introducing more bugs as a result.
>>>
>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>> might want to look there to see why you are trying to schedule work on a
>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>> it always try to go with local cpu with same pxm.
>>>
>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>> since we are already on the correct CPU. What probably should be
>>> happening is that pci_call_probe should be doing a check to see if the
>>> current CPU is already contained within the device node map and if so
>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>
>> That's the patch that Michael Tsirkin posted for a fix,
>> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
>> other deadlocks may occur.
>> It would work in the case of ixgbe/ixgbevf, but not for something like
>> the Mellanox pf/vf driver (which is the same).
>>
>
> I think our conclusion was this is a false positive for Mellanox.
> If not, we need to understand what the deadlock is better.
>

As I understand the issue, the problem is not a deadlock for Mellanox
(At least with either your patch or mine applied), the issue is that the
PF is not ready to handle VFs when pci_enable_sriov is called due to
some firmware issues.

Thanks,

Alex

2013-05-21 22:09:25

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/21/2013 05:58 PM, Alexander Duyck wrote:
> On 05/21/2013 02:31 PM, Don Dutile wrote:
>> On 05/21/2013 05:30 PM, Don Dutile wrote:
>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>> <[email protected]> wrote:
>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>> <[email protected]> wrote:
>>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>>> assignment
>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>> guests that have devices assigned.
>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>
>>>>> for guest panic, that is another problem.
>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>> card in host.
>>>>>
>>>>> ...
>>>>
>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>> function that is called we do a check for assigned VFs. If they are
>>>> assigned then we do not call pci_disable_sriov.
>>>>
>>>>>
>>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>>> just avoiding it.
>>>>> yes, until the first one is done.
>>>>
>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>> are likely just introducing more bugs as a result.
>>>>
>>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>> might want to look there to see why you are trying to schedule work
>>>>>> on a
>>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>>> work on.
>>>>> it always try to go with local cpu with same pxm.
>>>>
>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>> since we are already on the correct CPU. What probably should be
>>>> happening is that pci_call_probe should be doing a check to see if the
>>>> current CPU is already contained within the device node map and if so
>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>> the system by trying to flush the CPU queue of the CPU you are
>>>> currently on.
>>>>
>>> That's the patch that Michael Tsirkin posted for a fix,
>>> but it was noted that if you have the case where the _same_ driver is
>>> used for vf& pf,
>>> other deadlocks may occur.
>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>> the Mellanox pf/vf driver (which is the same).
>>>
>> apologies; here's the thread the discussed the issue:
>> https://patchwork.kernel.org/patch/2458681/
>>
>
> I found out about that patch after I submitted one that was similar.
> The only real complaint I had about his patch was that it was only
> looking at the CPU and he could save himself some trouble by just doing
> the work locally if we were on the correct NUMA node. For example if
> the system only has one node in it what is the point in scheduling all
> of the work on CPU 0? My alternative patch can be found at:
> https://patchwork.kernel.org/patch/2568881/
>
> As far as the inter-driver locking issues for same driver I don't think
> that is really any kind if issue. Most drivers shouldn't be holding any
> big locks when they call pci_enable_sriov. If I am not mistaken the
> follow on patch I submitted which was similar to Michaels was reported
> to have resolved the issue.
>
You mean the above patchwork patch, or another one?

> As far as the Mellanox PF/VF the bigger issue is that when they call
> pci_enable_sriov they are not ready to handle VFs. There have been
> several suggestions on how to resolve it including -EPROBE_DEFER or the
> igbvf/ixgbevf approach of just brining up the device in a "link down" state.
>
thanks for summary. i was backlogged on email, and responding as i read them;
I should have read through the whole thread before chiming in.

> Thanks,
>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-21 22:11:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
> > On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
> >> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
> >>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
> >>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
> >>>> <[email protected]> wrote:
> >>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
> >>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
> >>>>>> <[email protected]> wrote:
> >>>>>>> I'm sorry, but what is the point of this patch? With device assignment
> >>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
> >>>>>>> since you cannot remove the VFs if they are assigned to a VM.
> >>>>>> unload PF driver will not call pci_disable_sriov?
> >>>>> You cannot call pci_disable_sriov because you will panic all of the
> >>>>> guests that have devices assigned.
> >>>> ixgbe_remove did call pci_disable_sriov...
> >>>>
> >>>> for guest panic, that is another problem.
> >>>> just like you pci passthrough with real pci device and hotremove the
> >>>> card in host.
> >>>>
> >>>> ...
> >>>
> >>> I suggest you take another look. In ixgbe_disable_sriov, which is the
> >>> function that is called we do a check for assigned VFs. If they are
> >>> assigned then we do not call pci_disable_sriov.
> >>>
> >>>>
> >>>>> So how does your patch actually fix this problem? It seems like it is
> >>>>> just avoiding it.
> >>>> yes, until the first one is done.
> >>>
> >>> Avoiding the issue doesn't fix the underlying problem and instead you
> >>> are likely just introducing more bugs as a result.
> >>>
> >>>>> From what I can tell your problem is originating in pci_call_probe. I
> >>>>> believe it is calling work_on_cpu and that doesn't seem correct since
> >>>>> the work should be taking place on a CPU already local to the PF. You
> >>>>> might want to look there to see why you are trying to schedule work on a
> >>>>> CPU which should be perfectly fine for you to already be doing your work on.
> >>>> it always try to go with local cpu with same pxm.
> >>>
> >>> The problem is we really shouldn't be calling work_for_cpu in this case
> >>> since we are already on the correct CPU. What probably should be
> >>> happening is that pci_call_probe should be doing a check to see if the
> >>> current CPU is already contained within the device node map and if so
> >>> just call local_pci_probe directly. That way you can avoid deadlocking
> >>> the system by trying to flush the CPU queue of the CPU you are currently on.
> >>>
> >> That's the patch that Michael Tsirkin posted for a fix,
> >> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
> >> other deadlocks may occur.
> >> It would work in the case of ixgbe/ixgbevf, but not for something like
> >> the Mellanox pf/vf driver (which is the same).
> >>
> >
> > I think our conclusion was this is a false positive for Mellanox.
> > If not, we need to understand what the deadlock is better.
> >
>
> As I understand the issue, the problem is not a deadlock for Mellanox
> (At least with either your patch or mine applied), the issue is that the
> PF is not ready to handle VFs when pci_enable_sriov is called due to
> some firmware issues.
>
> Thanks,
>
> Alex

I haven't seen Mellanox guys say anything like this on the list.
Pointers?
All I saw is some lockdep warnings and Tejun says they are bogus ...


--
MST

2013-05-21 22:12:48

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/21/2013 03:09 PM, Don Dutile wrote:
> On 05/21/2013 05:58 PM, Alexander Duyck wrote:
>> On 05/21/2013 02:31 PM, Don Dutile wrote:
>>> On 05/21/2013 05:30 PM, Don Dutile wrote:
>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>> <[email protected]> wrote:
>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> I'm sorry, but what is the point of this patch? With device
>>>>>>>>> assignment
>>>>>>>>> it is always possible to have VFs loaded and the PF driver
>>>>>>>>> unloaded
>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>> guests that have devices assigned.
>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>
>>>>>> for guest panic, that is another problem.
>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>> card in host.
>>>>>>
>>>>>> ...
>>>>>
>>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>>> function that is called we do a check for assigned VFs. If they are
>>>>> assigned then we do not call pci_disable_sriov.
>>>>>
>>>>>>
>>>>>>> So how does your patch actually fix this problem? It seems like
>>>>>>> it is
>>>>>>> just avoiding it.
>>>>>> yes, until the first one is done.
>>>>>
>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>> are likely just introducing more bugs as a result.
>>>>>
>>>>>>> From what I can tell your problem is originating in
>>>>>>> pci_call_probe. I
>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct
>>>>>>> since
>>>>>>> the work should be taking place on a CPU already local to the PF.
>>>>>>> You
>>>>>>> might want to look there to see why you are trying to schedule work
>>>>>>> on a
>>>>>>> CPU which should be perfectly fine for you to already be doing your
>>>>>>> work on.
>>>>>> it always try to go with local cpu with same pxm.
>>>>>
>>>>> The problem is we really shouldn't be calling work_for_cpu in this
>>>>> case
>>>>> since we are already on the correct CPU. What probably should be
>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>> current CPU is already contained within the device node map and if so
>>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>>> the system by trying to flush the CPU queue of the CPU you are
>>>>> currently on.
>>>>>
>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>> but it was noted that if you have the case where the _same_ driver is
>>>> used for vf& pf,
>>>> other deadlocks may occur.
>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>> the Mellanox pf/vf driver (which is the same).
>>>>
>>> apologies; here's the thread the discussed the issue:
>>> https://patchwork.kernel.org/patch/2458681/
>>>
>>
>> I found out about that patch after I submitted one that was similar.
>> The only real complaint I had about his patch was that it was only
>> looking at the CPU and he could save himself some trouble by just doing
>> the work locally if we were on the correct NUMA node. For example if
>> the system only has one node in it what is the point in scheduling all
>> of the work on CPU 0? My alternative patch can be found at:
>> https://patchwork.kernel.org/patch/2568881/
>>
>> As far as the inter-driver locking issues for same driver I don't think
>> that is really any kind if issue. Most drivers shouldn't be holding any
>> big locks when they call pci_enable_sriov. If I am not mistaken the
>> follow on patch I submitted which was similar to Michaels was reported
>> to have resolved the issue.
>>
> You mean the above patchwork patch, or another one?

Well I know the above patchwork patch resolves it, but I am assuming
Michaels would probably work as well since it resolves the underlying issue.

>> As far as the Mellanox PF/VF the bigger issue is that when they call
>> pci_enable_sriov they are not ready to handle VFs. There have been
>> several suggestions on how to resolve it including -EPROBE_DEFER or the
>> igbvf/ixgbevf approach of just brining up the device in a "link down"
>> state.
>>
> thanks for summary. i was backlogged on email, and responding as i read
> them;
> I should have read through the whole thread before chiming in.
>

No problem. My main concern at this point is that we should probably
get either Michaels patch or mine pulled in since the potential for
deadlock is still there.

Thanks,

Alex

2013-05-21 22:30:48

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/21/2013 03:11 PM, Michael S. Tsirkin wrote:
> On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
>> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>> <[email protected]> wrote:
>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>> guests that have devices assigned.
>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>
>>>>>> for guest panic, that is another problem.
>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>> card in host.
>>>>>>
>>>>>> ...
>>>>>
>>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>>> function that is called we do a check for assigned VFs. If they are
>>>>> assigned then we do not call pci_disable_sriov.
>>>>>
>>>>>>
>>>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>>>> just avoiding it.
>>>>>> yes, until the first one is done.
>>>>>
>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>> are likely just introducing more bugs as a result.
>>>>>
>>>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>>> might want to look there to see why you are trying to schedule work on a
>>>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>>>> it always try to go with local cpu with same pxm.
>>>>>
>>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>>> since we are already on the correct CPU. What probably should be
>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>> current CPU is already contained within the device node map and if so
>>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>>>
>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
>>>> other deadlocks may occur.
>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>> the Mellanox pf/vf driver (which is the same).
>>>>
>>>
>>> I think our conclusion was this is a false positive for Mellanox.
>>> If not, we need to understand what the deadlock is better.
>>>
>>
>> As I understand the issue, the problem is not a deadlock for Mellanox
>> (At least with either your patch or mine applied), the issue is that the
>> PF is not ready to handle VFs when pci_enable_sriov is called due to
>> some firmware issues.
>>
>> Thanks,
>>
>> Alex
>
> I haven't seen Mellanox guys say anything like this on the list.
> Pointers?
> All I saw is some lockdep warnings and Tejun says they are bogus ...

Actually the patch I submitted is at:
https://patchwork.kernel.org/patch/2568881/

It was in response to:
https://patchwork.kernel.org/patch/2562471/

Basically the patch I was responding to was supposed to address both the
lockdep issue and a problem with mlx4 not being able to support the VFs
when pci_enable_sriov is called. Yinghai had specifically called out
the work_on_cpu lockdep issue that you also submitted a patch for.

As per the feedback from Yinghai it seems like my patch does resolve the
lockdep issue that was seen. The other half of the issue was what we
have been discussing with Or in regards to delaying VF driver init via
something like -EPROBE_DEFER instead of trying to split up
pci_enable_sriov and VF probe.

Thanks,

Alex

2013-05-22 20:16:59

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Wed, May 22, 2013 at 1:30 AM, Alexander Duyck
<[email protected]> wrote:
> On 05/21/2013 03:11 PM, Michael S. Tsirkin wrote:
>> On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
>>> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
>>>> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>>> <[email protected]> wrote:
>>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>>> guests that have devices assigned.
>>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>>
>>>>>>> for guest panic, that is another problem.
>>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>>> card in host.
>>>>>>>
>>>>>>> ...
>>>>>>
>>>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>>>> function that is called we do a check for assigned VFs. If they are
>>>>>> assigned then we do not call pci_disable_sriov.
>>>>>>
>>>>>>>
>>>>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>>>>> just avoiding it.
>>>>>>> yes, until the first one is done.
>>>>>>
>>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>>> are likely just introducing more bugs as a result.
>>>>>>
>>>>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>>>> might want to look there to see why you are trying to schedule work on a
>>>>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>>>>> it always try to go with local cpu with same pxm.
>>>>>>
>>>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>>>> since we are already on the correct CPU. What probably should be
>>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>>> current CPU is already contained within the device node map and if so
>>>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>>>>
>>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>>> but it was noted that if you have the case where the _same_ driver is used for vf & pf,
>>>>> other deadlocks may occur.
>>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>>> the Mellanox pf/vf driver (which is the same).
>>>>>
>>>>
>>>> I think our conclusion was this is a false positive for Mellanox.
>>>> If not, we need to understand what the deadlock is better.
>>>>
>>>
>>> As I understand the issue, the problem is not a deadlock for Mellanox
>>> (At least with either your patch or mine applied), the issue is that the
>>> PF is not ready to handle VFs when pci_enable_sriov is called due to
>>> some firmware issues.


>> I haven't seen Mellanox guys say anything like this on the list. Pointers?
>> All I saw is some lockdep warnings and Tejun says they are bogus ...
>
> Actually the patch I submitted is at:
> https://patchwork.kernel.org/patch/2568881/
>
> It was in response to:
> https://patchwork.kernel.org/patch/2562471/
>
> Basically the patch I was responding to was supposed to address both the
> lockdep issue and a problem with mlx4 not being able to support the VFs
> when pci_enable_sriov is called. Yinghai had specifically called out
> the work_on_cpu lockdep issue that you also submitted a patch for.
>
> As per the feedback from Yinghai it seems like my patch does resolve the
> lockdep issue that was seen. The other half of the issue was what we
> have been discussing with Or in regards to delaying VF driver init via
> something like -EPROBE_DEFER instead of trying to split up
> pci_enable_sriov and VF probe.


Hi Alex, all, so to clarify:

1. currently due to current firmware limitation we must call
pci_enable_sriov before the
PF ends its initialization sequence done in the PCI probe callback, hence

2. we can't move to the new sysfs API for enabling SRIOV

3. as of 3.9-rc1 we see these nested brobes, bisected that to be as of
commit 90888ac01d059e38ffe77a2291d44cafa9016fb "driver core: fix
possible missing of device probe". But we didn't reach into consensus
with the author that this wasn't possible before the commit, nor this
is something that needs to be avoided, see
http://marc.info/?t=136249697200007&r=1&w=2

4. I am not sure if/how we can modify the PF code to support the case
where VFs are probed and start thier initialization sequence before
the PF is done with its initialization

5. etc

all in all, we will look into returning -EPROBE_DEFER from the VF
when they identify the problematic situation -- so for how much time
this is deferred? or if this isn't time based what the logical
condition which once met the VF probe is attempted again?


Or.

2013-05-22 21:41:12

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On 05/22/2013 04:16 PM, Or Gerlitz wrote:
> On Wed, May 22, 2013 at 1:30 AM, Alexander Duyck
> <[email protected]> wrote:
>> On 05/21/2013 03:11 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 21, 2013 at 03:01:08PM -0700, Alexander Duyck wrote:
>>>> On 05/21/2013 02:49 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, May 21, 2013 at 05:30:32PM -0400, Don Dutile wrote:
>>>>>> On 05/14/2013 05:39 PM, Alexander Duyck wrote:
>>>>>>> On 05/14/2013 12:59 PM, Yinghai Lu wrote:
>>>>>>>> On Tue, May 14, 2013 at 12:45 PM, Alexander Duyck
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> On 05/14/2013 11:44 AM, Yinghai Lu wrote:
>>>>>>>>>> On Tue, May 14, 2013 at 9:00 AM, Alexander Duyck
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> I'm sorry, but what is the point of this patch? With device assignment
>>>>>>>>>>> it is always possible to have VFs loaded and the PF driver unloaded
>>>>>>>>>>> since you cannot remove the VFs if they are assigned to a VM.
>>>>>>>>>> unload PF driver will not call pci_disable_sriov?
>>>>>>>>> You cannot call pci_disable_sriov because you will panic all of the
>>>>>>>>> guests that have devices assigned.
>>>>>>>> ixgbe_remove did call pci_disable_sriov...
>>>>>>>>
>>>>>>>> for guest panic, that is another problem.
>>>>>>>> just like you pci passthrough with real pci device and hotremove the
>>>>>>>> card in host.
>>>>>>>>
>>>>>>>> ...
>>>>>>>
>>>>>>> I suggest you take another look. In ixgbe_disable_sriov, which is the
>>>>>>> function that is called we do a check for assigned VFs. If they are
>>>>>>> assigned then we do not call pci_disable_sriov.
>>>>>>>
>>>>>>>>
>>>>>>>>> So how does your patch actually fix this problem? It seems like it is
>>>>>>>>> just avoiding it.
>>>>>>>> yes, until the first one is done.
>>>>>>>
>>>>>>> Avoiding the issue doesn't fix the underlying problem and instead you
>>>>>>> are likely just introducing more bugs as a result.
>>>>>>>
>>>>>>>>> From what I can tell your problem is originating in pci_call_probe. I
>>>>>>>>> believe it is calling work_on_cpu and that doesn't seem correct since
>>>>>>>>> the work should be taking place on a CPU already local to the PF. You
>>>>>>>>> might want to look there to see why you are trying to schedule work on a
>>>>>>>>> CPU which should be perfectly fine for you to already be doing your work on.
>>>>>>>> it always try to go with local cpu with same pxm.
>>>>>>>
>>>>>>> The problem is we really shouldn't be calling work_for_cpu in this case
>>>>>>> since we are already on the correct CPU. What probably should be
>>>>>>> happening is that pci_call_probe should be doing a check to see if the
>>>>>>> current CPU is already contained within the device node map and if so
>>>>>>> just call local_pci_probe directly. That way you can avoid deadlocking
>>>>>>> the system by trying to flush the CPU queue of the CPU you are currently on.
>>>>>>>
>>>>>> That's the patch that Michael Tsirkin posted for a fix,
>>>>>> but it was noted that if you have the case where the _same_ driver is used for vf& pf,
>>>>>> other deadlocks may occur.
>>>>>> It would work in the case of ixgbe/ixgbevf, but not for something like
>>>>>> the Mellanox pf/vf driver (which is the same).
>>>>>>
>>>>>
>>>>> I think our conclusion was this is a false positive for Mellanox.
>>>>> If not, we need to understand what the deadlock is better.
>>>>>
>>>>
>>>> As I understand the issue, the problem is not a deadlock for Mellanox
>>>> (At least with either your patch or mine applied), the issue is that the
>>>> PF is not ready to handle VFs when pci_enable_sriov is called due to
>>>> some firmware issues.
>
>
>>> I haven't seen Mellanox guys say anything like this on the list. Pointers?
>>> All I saw is some lockdep warnings and Tejun says they are bogus ...
>>
>> Actually the patch I submitted is at:
>> https://patchwork.kernel.org/patch/2568881/
>>
>> It was in response to:
>> https://patchwork.kernel.org/patch/2562471/
>>
>> Basically the patch I was responding to was supposed to address both the
>> lockdep issue and a problem with mlx4 not being able to support the VFs
>> when pci_enable_sriov is called. Yinghai had specifically called out
>> the work_on_cpu lockdep issue that you also submitted a patch for.
>>
>> As per the feedback from Yinghai it seems like my patch does resolve the
>> lockdep issue that was seen. The other half of the issue was what we
>> have been discussing with Or in regards to delaying VF driver init via
>> something like -EPROBE_DEFER instead of trying to split up
>> pci_enable_sriov and VF probe.
>
>
> Hi Alex, all, so to clarify:
>
> 1. currently due to current firmware limitation we must call
> pci_enable_sriov before the
> PF ends its initialization sequence done in the PCI probe callback, hence
>
> 2. we can't move to the new sysfs API for enabling SRIOV
>
> 3. as of 3.9-rc1 we see these nested brobes, bisected that to be as of
> commit 90888ac01d059e38ffe77a2291d44cafa9016fb "driver core: fix
> possible missing of device probe". But we didn't reach into consensus
> with the author that this wasn't possible before the commit, nor this
> is something that needs to be avoided, see
> http://marc.info/?t=136249697200007&r=1&w=2
>
> 4. I am not sure if/how we can modify the PF code to support the case
> where VFs are probed and start thier initialization sequence before
> the PF is done with its initialization
>
> 5. etc
>
> all in all, we will look into returning -EPROBE_DEFER from the VF
> when they identify the problematic situation -- so for how much time
> this is deferred? or if this isn't time based what the logical
> condition which once met the VF probe is attempted again?
>
ah, sounds device specific.... i.e., it goes back to PF probe....

So, I'm assuming some sort of init/info-xchg is done btwn VF & PF
and has to be done to some level before PF can continue it's pci-probe
operation. In that case, has the VF & PF done sufficient init/info-xchg
on 1st call, that the PF can continue, and then queue up a sriov-enable
at the end of PF probe ?

>
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-05-22 23:45:40

by Ben Hutchings

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Wed, 2013-05-22 at 23:16 +0300, Or Gerlitz wrote:
[...]
> all in all, we will look into returning -EPROBE_DEFER from the VF
> when they identify the problematic situation -- so for how much time
> this is deferred? or if this isn't time based what the logical
> condition which once met the VF probe is attempted again?

My reading is that it will be retried as soon as the PF probe returns.
But I've never tried using this feature myself.

Ben.

--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

2013-05-23 06:32:46

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Thu, May 23, 2013 at 2:45 AM, Ben Hutchings
<[email protected]> wrote:
> On Wed, 2013-05-22 at 23:16 +0300, Or Gerlitz wrote:
> [...]
>> all in all, we will look into returning -EPROBE_DEFER from the VF
>> when they identify the problematic situation -- so for how much time
>> this is deferred? or if this isn't time based what the logical
>> condition which once met the VF probe is attempted again?
>
> My reading is that it will be retried as soon as the PF probe returns.
> But I've never tried using this feature myself.

Yes, empirically this is what happens, the VF probe detects that the
PF isn't ready yet, and returns error.
Few seconds later the VF is probed again and this time it works, today
we return -EIO, so we need
to change that to -EPROBE_DEFER to make that more robust.

Or.

2013-05-23 06:43:23

by Or Gerlitz

[permalink] [raw]
Subject: Re: [PATCH 6/7] PCI: Make sure VF's driver get attached after PF's

On Thu, May 23, 2013 at 12:40 AM, Don Dutile <[email protected]> wrote:
> On 05/22/2013 04:16 PM, Or Gerlitz wrote:
[...]
>> Hi Alex, all, so to clarify:
>>
>> 1. currently due to current firmware limitation we must call pci_enable_sriov before the
>> PF ends its initialization sequence done in the PCI probe callback, hence
>>
>> 2. we can't move to the new sysfs API for enabling SRIOV
>>
>> 3. as of 3.9-rc1 we see these nested brobes, bisected that to be as of
>> commit 90888ac01d059e38ffe77a2291d44cafa9016fb "driver core: fix
>> possible missing of device probe". But we didn't reach into consensus
>> with the author that this wasn't possible before the commit, nor this
>> is something that needs to be avoided, see
>> http://marc.info/?t=136249697200007&r=1&w=2
>>
>> 4. I am not sure if/how we can modify the PF code to support the case
>> where VFs are probed and start thier initialization sequence before
>> the PF is done with its initialization
>>
>> 5. etc
>>
>> all in all, we will look into returning -EPROBE_DEFER from the VF
>> when they identify the problematic situation -- so for how much time
>> this is deferred? or if this isn't time based what the logical
>> condition which once met the VF probe is attempted again?
>>
> ah, sounds device specific.... i.e., it goes back to PF probe....
>
> So, I'm assuming some sort of init/info-xchg is done btwn VF & PF
> and has to be done to some level before PF can continue it's pci-probe
> operation. In that case, has the VF & PF done sufficient init/info-xchg
> on 1st call, that the PF can continue, and then queue up a sriov-enable
> at the end of PF probe ?

not exactly (sorry if repeating myself) -- currently we must call
pci_enable_sriov
before the PF ends its initialization sequence done in its probe
function. As soon as
pci_enable_sriov is called, VFs are started to get probed and they
detect that the PF has
not done its initialization and hence the VF init/info-xchng can't be
done. As to your question,
we can't allow for VFs to start their probing before the PF did
sriov-enable b/c of the that very same limitation.

Summing up my recollection of the the 2-3 related thread that were around lately

1. when the VF see they can't probe, we will return -EPROBE_DEFER -
this will fix the mlx4 specific issue

2. once the limitation is removed, mlx4 will implement the sysfs
method so doing sriov-enable will be decoupled from PF probing

3. since nested probing is more general, still need to see if/which of
Michael/Tejun/Alex patch needs to go in

makes sense?

Or.