2009-03-20 20:56:10

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 00/13] PCI core learns hotplug

Here is v5 of this series that addresses the review comments from v4.

I've pushed out a new test tree that contains the updated series for
testing. The tree consists of:

Linus's v2.6.28
+ Jesse's linux-next queue (as of 19 March 2009)
+ the 13 patches in this series

It does not contain the sysfs callback patch that is required for
interface torture tests. That can be had here:

http://thread.gmane.org/gmane.linux.kernel/806648

You can pull my tree here:

git://git.kernel.org/pub/scm/linux/kernel/git/achiang/pci-hotplug.git test-20090319

This is also a nice chance to test the shiny new patchwork instance
that we have running on kernel.org that we hope to use for linux-pci
in the future:

http://patchwork.kernel.org/project/linux-pci/list/

I still haven't figured out the locking issue that Kenji-san pointed
out from last time, because I haven't been able to reproduce it.

Any further testing/review is welcomed and appreciated.

Thanks!

/ac

v4 -> v5:
- acquire pci_bus_sem when walking device list in pci_rescan_bus
- keep CONFIG_HOTPLUG (instead of CONFIG_HOTPLUG_PCI)
- comment why we need device_schedule_callback()
- remove CAP_SYS_ADMIN check
- do not remove primary bus in remove_callback()
- do not enable bridges multiple times
- checkpatch cleanups

v3 -> v4:
- protect sysfs interfaces with mutex
- undo changes in pci_do_scan_bus
- introduce pci_rescan_bus instead
- do not initialize bridges more than once

v2 -> v3:
- properly remove device with internal bridge
- added Kenji Kaneshige's pci_is_root_bus() interface
- dropped whitespace cleanups for another time

v1 -> v2:
- incorporated lots of Trent Piepho's work
- beefed up pci_do_scan_bus as heavy lifter for rescanning
- small bugfixes folded into earlier patches to get everything working

---

Alex Chiang (9):
PCI Hotplug: schedule fakephp for feature removal
PCI Hotplug: rename legacy_fakephp to fakephp
PCI: Introduce /sys/bus/pci/devices/.../rescan
PCI: Introduce /sys/bus/pci/devices/.../remove
PCI: Introduce /sys/bus/pci/rescan
PCI: Introduce pci_rescan_bus()
PCI: do not enable bridges more than once
PCI: do not initialize bridges more than once
PCI: always scan child buses

Kenji Kaneshige (1):
PCI: pci_is_root_bus helper

Trent Piepho (3):
PCI Hotplug: restore fakephp interface with complete reimplementation
PCI: pci_scan_slot() returns newly found devices
PCI: don't scan existing devices


Documentation/ABI/testing/sysfs-bus-pci | 27 ++
Documentation/feature-removal-schedule.txt | 33 ++
Documentation/filesystems/sysfs-pci.txt | 10 +
drivers/pci/bus.c | 6
drivers/pci/hotplug/fakephp.c | 443 +++++++---------------------
drivers/pci/pci-driver.c | 1
drivers/pci/pci-sysfs.c | 81 +++++
drivers/pci/pci.h | 6
drivers/pci/probe.c | 112 +++++--
drivers/pci/setup-bus.c | 3
include/linux/pci.h | 12 +
11 files changed, 356 insertions(+), 378 deletions(-)


2009-03-20 20:56:35

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 01/13] PCI: pci_is_root_bus helper

From: Kenji Kaneshige <[email protected]>

Introduce pci_is_root_bus helper function. This will help make code
more consistent, as well as prevent incorrect assumptions (such as
pci_bus->self == NULL on a root bus, which is not always true).

Signed-off-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

include/linux/pci.h | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 1f6c5dd..72ccf15 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -341,6 +341,15 @@ struct pci_bus {
#define pci_bus_b(n) list_entry(n, struct pci_bus, node)
#define to_pci_bus(n) container_of(n, struct pci_bus, dev)

+/*
+ * Returns true if the pci bus is root (behind host-pci bridge),
+ * false otherwise
+ */
+static inline bool pci_is_root_bus(struct pci_bus *pbus)
+{
+ return !(pbus->parent);
+}
+
#ifdef CONFIG_PCI_MSI
static inline bool pci_dev_msi_enabled(struct pci_dev *pci_dev)
{

2009-03-20 20:56:56

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 02/13] PCI: don't scan existing devices

From: Trent Piepho <[email protected]>

pci_scan_single_device is supposed to add newly discovered
devices to pci_bus->devices, but doesn't check to see if the
device has already been added. This can cause problems if we ever
want to use this interface to rescan the PCI bus.

If the device is already added, just return it.

Signed-off-by: Trent Piepho <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/probe.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 579a56c..b89a86e 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1013,6 +1013,12 @@ struct pci_dev *__ref pci_scan_single_device(struct pci_bus *bus, int devfn)
{
struct pci_dev *dev;

+ dev = pci_get_slot(bus, devfn);
+ if (dev) {
+ pci_dev_put(dev);
+ return dev;
+ }
+
dev = pci_scan_device(bus, devfn);
if (!dev)
return NULL;

2009-03-20 20:57:21

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 03/13] PCI: pci_scan_slot() returns newly found devices

From: Trent Piepho <[email protected]>

pci_scan_slot() has been rewritten to be less complex and will now
return the number of *new* devices found.

Existing callers need not worry because they already assume that
they can't call pci_scan_slot() on an already-scanned slot.

Thus, there is no semantic change for existing callers: returning
newly found devices (this patch) is exactly equal to returning all
found devices (before this patch).

This patch adds some more groundwork to allow us to rescan the
PCI bus during runtime to discover newly added devices.

[[email protected]: fix checkpatch errors]
Signed-off-by: Trent Piepho <[email protected]>
Reviewed-by: Alex Chiang <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/probe.c | 40 ++++++++++++++++------------------------
1 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b89a86e..f261741 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1037,35 +1037,27 @@ EXPORT_SYMBOL(pci_scan_single_device);
* Scan a PCI slot on the specified PCI bus for devices, adding
* discovered devices to the @bus->devices list. New devices
* will not have is_added set.
+ *
+ * Returns the number of new devices found.
*/
int pci_scan_slot(struct pci_bus *bus, int devfn)
{
- int func, nr = 0;
- int scan_all_fns;
-
- scan_all_fns = pcibios_scan_all_fns(bus, devfn);
-
- for (func = 0; func < 8; func++, devfn++) {
- struct pci_dev *dev;
-
- dev = pci_scan_single_device(bus, devfn);
- if (dev) {
- nr++;
+ int fn, nr = 0;
+ struct pci_dev *dev;

- /*
- * If this is a single function device,
- * don't scan past the first function.
- */
- if (!dev->multifunction) {
- if (func > 0) {
- dev->multifunction = 1;
- } else {
- break;
- }
+ dev = pci_scan_single_device(bus, devfn);
+ if (dev && !dev->is_added) /* new device? */
+ nr++;
+
+ if ((dev && dev->multifunction) ||
+ (!dev && pcibios_scan_all_fns(bus, devfn))) {
+ for (fn = 1; fn < 8; fn++) {
+ dev = pci_scan_single_device(bus, devfn + fn);
+ if (dev) {
+ if (!dev->is_added)
+ nr++;
+ dev->multifunction = 1;
}
- } else {
- if (func == 0 && !scan_all_fns)
- break;
}
}

2009-03-20 20:57:54

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 04/13] PCI: always scan child buses

While scanning bridges, we stop our scan if we encounter a bus
that we've seen before, to work around some buggy chipsets. This
is a good idea, but prevents us from fully scanning the PCI bus
at a future time (to find newly hot-added devices, for example).

Change the logic so that we skip _re-adding_ an existing bus
that we've seen before, but also allow the scan to descend to
all child buses.

Now that we're potentially scanning our child buses again, we
also need to be sure not to attempt re-initializing their BARs
so we avoid that.

This patch lays the groundwork to allow the user to issue a
rescan of the PCI bus at any time.

Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/probe.c | 34 ++++++++++++++++++++--------------
1 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index f261741..b75ed4b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -511,21 +511,21 @@ int __devinit pci_scan_bridge(struct pci_bus *bus, struct pci_dev *dev, int max,

/*
* If we already got to this bus through a different bridge,
- * ignore it. This can happen with the i450NX chipset.
+ * don't re-add it. This can happen with the i450NX chipset.
+ *
+ * However, we continue to descend down the hierarchy and
+ * scan remaining child buses.
*/
- if (pci_find_bus(pci_domain_nr(bus), busnr)) {
- dev_info(&dev->dev, "bus %04x:%02x already known\n",
- pci_domain_nr(bus), busnr);
- goto out;
+ child = pci_find_bus(pci_domain_nr(bus), busnr);
+ if (!child) {
+ child = pci_add_new_bus(bus, dev, busnr);
+ if (!child)
+ goto out;
+ child->primary = buses & 0xFF;
+ child->subordinate = (buses >> 16) & 0xFF;
+ child->bridge_ctl = bctl;
}

- child = pci_add_new_bus(bus, dev, busnr);
- if (!child)
- goto out;
- child->primary = buses & 0xFF;
- child->subordinate = (buses >> 16) & 0xFF;
- child->bridge_ctl = bctl;
-
cmax = pci_scan_child_bus(child);
if (cmax > max)
max = cmax;
@@ -1083,8 +1083,14 @@ unsigned int __devinit pci_scan_child_bus(struct pci_bus *bus)
* After performing arch-dependent fixup of the bus, look behind
* all PCI-to-PCI bridges on this bus.
*/
- pr_debug("PCI: Fixups for bus %04x:%02x\n", pci_domain_nr(bus), bus->number);
- pcibios_fixup_bus(bus);
+ if (!bus->is_added) {
+ pr_debug("PCI: Fixups for bus %04x:%02x\n",
+ pci_domain_nr(bus), bus->number);
+ pcibios_fixup_bus(bus);
+ if (pci_is_root_bus(bus))
+ bus->is_added = 1;
+ }
+
for (pass=0; pass < 2; pass++)
list_for_each_entry(dev, &bus->devices, bus_list) {
if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE ||

2009-03-20 20:57:37

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 05/13] PCI: do not initialize bridges more than once

In preparation for PCI core hotplug, we need to ensure that we do
not attempt to re-initialize bridges that have already been initialized.

We only need to worry about non-root buses, since we will not allow
root bus removal.

Reported-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/setup-bus.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 170a3ed..334285a 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -144,6 +144,9 @@ static void pci_setup_bridge(struct pci_bus *bus)
struct pci_bus_region region;
u32 l, bu, lu, io_upper16;

+ if (!pci_is_root_bus(bus) && bus->is_added)
+ return;
+
dev_info(&bridge->dev, "PCI bridge, secondary bus %04x:%02x\n",
pci_domain_nr(bus), bus->number);

2009-03-20 20:58:24

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 06/13] PCI: do not enable bridges more than once

In preparation for PCI core hotplug, we need to ensure that we do
not attempt to re-enable bridges that have already been enabled.

Reported-by: Kenji Kaneshige <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/bus.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 118c777..68f91a2 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -184,8 +184,10 @@ void pci_enable_bridges(struct pci_bus *bus)

list_for_each_entry(dev, &bus->devices, bus_list) {
if (dev->subordinate) {
- retval = pci_enable_device(dev);
- pci_set_master(dev);
+ if (atomic_read(&dev->enable_cnt) == 0) {
+ retval = pci_enable_device(dev);
+ pci_set_master(dev);
+ }
pci_enable_bridges(dev->subordinate);
}
}

2009-03-20 20:58:39

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 07/13] PCI: Introduce pci_rescan_bus()

This API is used by the PCI core to rescan a bus and rediscover
newly added devices.

Over time, it is expected that the various PCI hotplug drivers
will migrate to this interface and away from the old
pci_do_scan_bus() interface.

Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/hotplug/fakephp.c | 6 +++---
drivers/pci/probe.c | 32 ++++++++++++++++++++++++++++++++
include/linux/pci.h | 3 +++
3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/fakephp.c b/drivers/pci/hotplug/fakephp.c
index d8649e1..1606374 100644
--- a/drivers/pci/hotplug/fakephp.c
+++ b/drivers/pci/hotplug/fakephp.c
@@ -245,12 +245,12 @@ static int pci_rescan_slot(struct pci_dev *temp)


/**
- * pci_rescan_bus - Rescan PCI bus
+ * pci_rescan_bus_local - fakephp version of rescan PCI bus
* @bus: the PCI bus to rescan
*
* Call pci_rescan_slot for each possible function of the bus.
*/
-static void pci_rescan_bus(const struct pci_bus *bus)
+static void pci_rescan_bus_local(const struct pci_bus *bus)
{
unsigned int devfn;
struct pci_dev *dev;
@@ -291,7 +291,7 @@ static void pci_rescan_buses(const struct list_head *list)
const struct list_head *l;
list_for_each(l,list) {
const struct pci_bus *b = pci_bus_b(l);
- pci_rescan_bus(b);
+ pci_rescan_bus_local(b);
pci_rescan_buses(&b->children);
}
}
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b75ed4b..53edb7e 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1203,6 +1203,38 @@ struct pci_bus * __devinit pci_scan_bus_parented(struct device *parent,
EXPORT_SYMBOL(pci_scan_bus_parented);

#ifdef CONFIG_HOTPLUG
+/**
+ * pci_rescan_bus - scan a PCI bus for devices.
+ * @bus: PCI bus to scan
+ *
+ * Scan a PCI bus and child buses for new devices, adds them,
+ * and enables them.
+ *
+ * Returns the max number of subordinate bus discovered.
+ */
+unsigned int __devinit pci_rescan_bus(struct pci_bus *bus)
+{
+ unsigned int max;
+ struct pci_dev *dev;
+
+ max = pci_scan_child_bus(bus);
+
+ up_read(&pci_bus_sem);
+ list_for_each_entry(dev, &bus->devices, bus_list)
+ if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE ||
+ dev->hdr_type == PCI_HEADER_TYPE_CARDBUS)
+ if (dev->subordinate)
+ pci_bus_size_bridges(dev->subordinate);
+ down_read(&pci_bus_sem);
+
+ pci_bus_assign_resources(bus);
+ pci_enable_bridges(bus);
+ pci_bus_add_devices(bus);
+
+ return max;
+}
+EXPORT_SYMBOL_GPL(pci_rescan_bus);
+
EXPORT_SYMBOL(pci_add_new_bus);
EXPORT_SYMBOL(pci_scan_slot);
EXPORT_SYMBOL(pci_scan_bridge);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 72ccf15..39e3029 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -710,6 +710,9 @@ int pci_back_from_sleep(struct pci_dev *dev);

/* Functions for PCI Hotplug drivers to use */
int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);
+#ifdef CONFIG_HOTPLUG
+unsigned int pci_rescan_bus(struct pci_bus *bus);
+#endif

/* Vital product data routines */
ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);

2009-03-20 20:58:58

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 08/13] PCI: Introduce /sys/bus/pci/rescan

This interface allows the user to force a rescan of all PCI buses
in system, and rediscover devices that have been removed earlier.

pci_bus_attrs implementation from Trent Piepho.

Thanks to Vegard Nossum for discovering locking issues with the
sysfs interface.

Cc: Trent Piepho <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

Documentation/ABI/testing/sysfs-bus-pci | 9 +++++++++
drivers/pci/pci-driver.c | 1 +
drivers/pci/pci-sysfs.c | 26 ++++++++++++++++++++++++++
drivers/pci/pci.h | 6 ++++++
drivers/pci/probe.c | 4 ++--
5 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 3d29793..1697a16 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -57,6 +57,15 @@ Description:
match the driver to the device. For example:
# echo "8086 10f5" > /sys/bus/pci/drivers/foo/remove_id

+What: /sys/bus/pci/rescan
+Date: January 2009
+Contact: Linux PCI developers <[email protected]>
+Description:
+ Writing a non-zero value to this attribute will
+ force a rescan of all PCI buses in the system, and
+ re-discover previously removed devices.
+ Depends on CONFIG_HOTPLUG.
+
What: /sys/bus/pci/devices/.../vpd
Date: February 2008
Contact: Ben Hutchings <[email protected]>
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 87a5ddb..95d1985 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -1049,6 +1049,7 @@ struct bus_type pci_bus_type = {
.remove = pci_device_remove,
.shutdown = pci_device_shutdown,
.dev_attrs = pci_dev_attrs,
+ .bus_attrs = pci_bus_attrs,
.pm = PCI_PM_OPS_PTR,
};

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index ec7a175..be7468a 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -219,6 +219,32 @@ msi_bus_store(struct device *dev, struct device_attribute *attr,
return count;
}

+#ifdef CONFIG_HOTPLUG
+static DEFINE_MUTEX(pci_remove_rescan_mutex);
+static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
+ size_t count)
+{
+ unsigned long val;
+ struct pci_bus *b = NULL;
+
+ if (strict_strtoul(buf, 0, &val) < 0)
+ return -EINVAL;
+
+ if (val) {
+ mutex_lock(&pci_remove_rescan_mutex);
+ while ((b = pci_find_next_bus(b)) != NULL)
+ pci_rescan_bus(b);
+ mutex_unlock(&pci_remove_rescan_mutex);
+ }
+ return count;
+}
+
+struct bus_attribute pci_bus_attrs[] = {
+ __ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, bus_rescan_store),
+ __ATTR_NULL
+};
+#endif
+
struct device_attribute pci_dev_attrs[] = {
__ATTR_RO(resource),
__ATTR_RO(vendor),
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 2cd1cba..5b2eb30 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -136,6 +136,12 @@ extern int pcie_mch_quirk;
extern struct device_attribute pci_dev_attrs[];
extern struct device_attribute dev_attr_cpuaffinity;
extern struct device_attribute dev_attr_cpulistaffinity;
+#ifdef CONFIG_HOTPLUG
+extern struct bus_attribute pci_bus_attrs[];
+#else
+#define pci_bus_attrs NULL
+#endif
+

/**
* pci_match_one_device - Tell if a PCI device structure has a matching
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 53edb7e..1541899 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1219,13 +1219,13 @@ unsigned int __devinit pci_rescan_bus(struct pci_bus *bus)

max = pci_scan_child_bus(bus);

- up_read(&pci_bus_sem);
+ down_read(&pci_bus_sem);
list_for_each_entry(dev, &bus->devices, bus_list)
if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE ||
dev->hdr_type == PCI_HEADER_TYPE_CARDBUS)
if (dev->subordinate)
pci_bus_size_bridges(dev->subordinate);
- down_read(&pci_bus_sem);
+ up_read(&pci_bus_sem);

pci_bus_assign_resources(bus);
pci_enable_bridges(bus);

2009-03-20 20:59:28

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

This patch adds an attribute named "remove" to a PCI device's sysfs
directory. Writing a non-zero value to this attribute will remove the PCI
device and any children of it.

Trent Piepho wrote the original implementation and documentation.

Thanks to Vegard Nossum for testing under kmemcheck and finding locking
issues with the sysfs interface.

Cc: Trent Piepho <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

Documentation/ABI/testing/sysfs-bus-pci | 8 +++++++
Documentation/filesystems/sysfs-pci.txt | 10 +++++++++
drivers/pci/pci-sysfs.c | 36 +++++++++++++++++++++++++++++++
3 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 1697a16..1350fa6 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -66,6 +66,14 @@ Description:
re-discover previously removed devices.
Depends on CONFIG_HOTPLUG.

+What: /sys/bus/pci/devices/.../remove
+Date: January 2009
+Contact: Linux PCI developers <[email protected]>
+Description:
+ Writing a non-zero value to this attribute will
+ hot-remove the PCI device and any of its children.
+ Depends on CONFIG_HOTPLUG.
+
What: /sys/bus/pci/devices/.../vpd
Date: February 2008
Contact: Ben Hutchings <[email protected]>
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
index 9f8740c..26e4b8b 100644
--- a/Documentation/filesystems/sysfs-pci.txt
+++ b/Documentation/filesystems/sysfs-pci.txt
@@ -12,6 +12,7 @@ that support it. For example, a given bus might look like this:
| |-- enable
| |-- irq
| |-- local_cpus
+ | |-- remove
| |-- resource
| |-- resource0
| |-- resource1
@@ -36,6 +37,7 @@ files, each with their own function.
enable Whether the device is enabled (ascii, rw)
irq IRQ number (ascii, ro)
local_cpus nearby CPU mask (cpumask, ro)
+ remove remove device from kernel's list (ascii, wo)
resource PCI resource host addresses (ascii, ro)
resource0..N PCI resource N, if present (binary, mmap)
resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
@@ -46,6 +48,7 @@ files, each with their own function.

ro - read only file
rw - file is readable and writable
+ wo - write only file
mmap - file is mmapable
ascii - file contains ascii text
binary - file contains binary data
@@ -73,6 +76,13 @@ that the device must be enabled for a rom read to return data succesfully.
In the event a driver is not bound to the device, it can be enabled using the
'enable' file, documented above.

+The 'remove' file is used to remove the PCI device, by writing a non-zero
+integer to the file. This does not involve any kind of hot-plug functionality,
+e.g. powering off the device. The device is removed from the kernel's list of
+PCI devices, the sysfs directory for it is removed, and the device will be
+removed from any drivers attached to it. Removal of PCI root buses is
+disallowed.
+
Accessing legacy resources through sysfs
----------------------------------------

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index be7468a..e16990e 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -243,6 +243,39 @@ struct bus_attribute pci_bus_attrs[] = {
__ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, bus_rescan_store),
__ATTR_NULL
};
+
+static void remove_callback(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ mutex_lock(&pci_remove_rescan_mutex);
+ pci_remove_bus_device(pdev);
+ mutex_unlock(&pci_remove_rescan_mutex);
+}
+
+static ssize_t
+remove_store(struct device *dev, struct device_attribute *dummy,
+ const char *buf, size_t count)
+{
+ int ret = 0;
+ unsigned long val;
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (strict_strtoul(buf, 0, &val) < 0)
+ return -EINVAL;
+
+ if (pci_is_root_bus(pdev->bus))
+ return -EBUSY;
+
+ /* An attribute cannot be unregistered by one of its own methods,
+ * so we have to use this roundabout approach.
+ */
+ if (val)
+ ret = device_schedule_callback(dev, remove_callback);
+ if (ret)
+ count = ret;
+ return count;
+}
#endif

struct device_attribute pci_dev_attrs[] = {
@@ -263,6 +296,9 @@ struct device_attribute pci_dev_attrs[] = {
__ATTR(broken_parity_status,(S_IRUGO|S_IWUSR),
broken_parity_status_show,broken_parity_status_store),
__ATTR(msi_bus, 0644, msi_bus_show, msi_bus_store),
+#ifdef CONFIG_HOTPLUG
+ __ATTR(remove, (S_IWUSR|S_IWGRP), NULL, remove_store),
+#endif
__ATTR_NULL,
};

2009-03-20 20:59:44

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 10/13] PCI: Introduce /sys/bus/pci/devices/.../rescan

This interface allows the user to force a rescan of the device's
parent bus and all subordinate buses, and rediscover devices removed
earlier from this part of the device tree.

Cc: Trent Piepho <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

Documentation/ABI/testing/sysfs-bus-pci | 10 ++++++++++
drivers/pci/pci-sysfs.c | 19 +++++++++++++++++++
2 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 1350fa6..ba0f0e7 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -74,6 +74,16 @@ Description:
hot-remove the PCI device and any of its children.
Depends on CONFIG_HOTPLUG.

+What: /sys/bus/pci/devices/.../rescan
+Date: January 2009
+Contact: Linux PCI developers <[email protected]>
+Description:
+ Writing a non-zero value to this attribute will
+ force a rescan of the device's parent bus and all
+ child buses, and re-discover devices removed earlier
+ from this part of the device tree.
+ Depends on CONFIG_HOTPLUG.
+
What: /sys/bus/pci/devices/.../vpd
Date: February 2008
Contact: Ben Hutchings <[email protected]>
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index e16990e..e9a8706 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -244,6 +244,24 @@ struct bus_attribute pci_bus_attrs[] = {
__ATTR_NULL
};

+static ssize_t
+dev_rescan_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ unsigned long val;
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (strict_strtoul(buf, 0, &val) < 0)
+ return -EINVAL;
+
+ if (val) {
+ mutex_lock(&pci_remove_rescan_mutex);
+ pci_rescan_bus(pdev->bus);
+ mutex_unlock(&pci_remove_rescan_mutex);
+ }
+ return count;
+}
+
static void remove_callback(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
@@ -298,6 +316,7 @@ struct device_attribute pci_dev_attrs[] = {
__ATTR(msi_bus, 0644, msi_bus_show, msi_bus_store),
#ifdef CONFIG_HOTPLUG
__ATTR(remove, (S_IWUSR|S_IWGRP), NULL, remove_store),
+ __ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, dev_rescan_store),
#endif
__ATTR_NULL,
};

2009-03-20 21:00:06

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 11/13] PCI Hotplug: restore fakephp interface with complete reimplementation

From: Trent Piepho <[email protected]>

A complete re-implementation of fakephp is necessary if it is to
present its former interface (pre-2.6.27, when it broke). The
reason is that PCI hotplug drivers call pci_hp_register(), which
enforces the rule that only one /sys/bus/pci/slots/ file may be
created per physical slot.

The change breaks the old fakephp's assumption that it could
create a file per function. So we re-implement fakephp to avoid
using the standard PCI hotplug API so that we can restore the old
fakephp user interface.

It puts entries in /sys/bus/pci/slots with the names of all PCI
devices/functions, exactly symmetrical to what is shown in
/sys/bus/pci/devices. Each slots/ entry has a "power" attribute,
which works the same way as the fakephp driver's power attribute
has worked.

There are a few improvements over old fakephp, which couldn't handle
PCI devices being added or removed via a means outside of
fakephp's knowledge. If a device was added another way, old fakephp
didn't notice and didn't create the fake slot for it. If a
device was removed another way, old fakephp didn't delete the fake
slot for it (and accessing the stale slot caused an oops).

The new implementation overcomes these limitations. As a
consequence, removing a bridge with other devices behind it now
works as well, which is something else old fakephp couldn't do
previously.

This duplicates a tiny bit of the code in the PCI core that does
this same function. Re-using that code ends up being more
complex than duplicating it, and it makes code in the PCI core
more ugly just to support this legacy fakephp interface
compatibility layer.

[[email protected]: rewrite changelog]
[[email protected]: use pci_rescan_bus()]
Reviewed-by: James Cameron <[email protected]>
Signed-off-by: Trent Piepho <[email protected]>
Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/hotplug/Makefile | 2
drivers/pci/hotplug/fakephp.c | 395 ----------------------------------
drivers/pci/hotplug/legacy_fakephp.c | 162 ++++++++++++++
3 files changed, 163 insertions(+), 396 deletions(-)
delete mode 100644 drivers/pci/hotplug/fakephp.c
create mode 100644 drivers/pci/hotplug/legacy_fakephp.c

diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
index 2aa117c..3314fe8 100644
--- a/drivers/pci/hotplug/Makefile
+++ b/drivers/pci/hotplug/Makefile
@@ -20,7 +20,7 @@ obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR) += rpadlpar_io.o
obj-$(CONFIG_HOTPLUG_PCI_SGI) += sgi_hotplug.o

# Link this last so it doesn't claim devices that have a real hotplug driver
-obj-$(CONFIG_HOTPLUG_PCI_FAKE) += fakephp.o
+obj-$(CONFIG_HOTPLUG_PCI_FAKE) += legacy_fakephp.o

pci_hotplug-objs := pci_hotplug_core.o

diff --git a/drivers/pci/hotplug/fakephp.c b/drivers/pci/hotplug/fakephp.c
deleted file mode 100644
index 1606374..0000000
--- a/drivers/pci/hotplug/fakephp.c
+++ /dev/null
@@ -1,395 +0,0 @@
-/*
- * Fake PCI Hot Plug Controller Driver
- *
- * Copyright (C) 2003 Greg Kroah-Hartman <[email protected]>
- * Copyright (C) 2003 IBM Corp.
- * Copyright (C) 2003 Rolf Eike Beer <[email protected]>
- *
- * Based on ideas and code from:
- * Vladimir Kondratiev <[email protected]>
- * Rolf Eike Beer <[email protected]>
- *
- * All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation, version 2 of the License.
- *
- * Send feedback to <[email protected]>
- */
-
-/*
- *
- * This driver will "emulate" removing PCI devices from the system. If
- * the "power" file is written to with "0" then the specified PCI device
- * will be completely removed from the kernel.
- *
- * WARNING, this does NOT turn off the power to the PCI device. This is
- * a "logical" removal, not a physical or electrical removal.
- *
- * Use this module at your own risk, you have been warned!
- *
- * Enabling PCI devices is left as an exercise for the reader...
- *
- */
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/pci.h>
-#include <linux/pci_hotplug.h>
-#include <linux/init.h>
-#include <linux/string.h>
-#include <linux/slab.h>
-#include <linux/workqueue.h>
-#include "../pci.h"
-
-#if !defined(MODULE)
- #define MY_NAME "fakephp"
-#else
- #define MY_NAME THIS_MODULE->name
-#endif
-
-#define dbg(format, arg...) \
- do { \
- if (debug) \
- printk(KERN_DEBUG "%s: " format, \
- MY_NAME , ## arg); \
- } while (0)
-#define err(format, arg...) printk(KERN_ERR "%s: " format, MY_NAME , ## arg)
-#define info(format, arg...) printk(KERN_INFO "%s: " format, MY_NAME , ## arg)
-
-#define DRIVER_AUTHOR "Greg Kroah-Hartman <[email protected]>"
-#define DRIVER_DESC "Fake PCI Hot Plug Controller Driver"
-
-struct dummy_slot {
- struct list_head node;
- struct hotplug_slot *slot;
- struct pci_dev *dev;
- struct work_struct remove_work;
- unsigned long removed;
-};
-
-static int debug;
-static int dup_slots;
-static LIST_HEAD(slot_list);
-static struct workqueue_struct *dummyphp_wq;
-
-static void pci_rescan_worker(struct work_struct *work);
-static DECLARE_WORK(pci_rescan_work, pci_rescan_worker);
-
-static int enable_slot (struct hotplug_slot *slot);
-static int disable_slot (struct hotplug_slot *slot);
-
-static struct hotplug_slot_ops dummy_hotplug_slot_ops = {
- .owner = THIS_MODULE,
- .enable_slot = enable_slot,
- .disable_slot = disable_slot,
-};
-
-static void dummy_release(struct hotplug_slot *slot)
-{
- struct dummy_slot *dslot = slot->private;
-
- list_del(&dslot->node);
- kfree(dslot->slot->info);
- kfree(dslot->slot);
- pci_dev_put(dslot->dev);
- kfree(dslot);
-}
-
-#define SLOT_NAME_SIZE 8
-
-static int add_slot(struct pci_dev *dev)
-{
- struct dummy_slot *dslot;
- struct hotplug_slot *slot;
- char name[SLOT_NAME_SIZE];
- int retval = -ENOMEM;
- static int count = 1;
-
- slot = kzalloc(sizeof(struct hotplug_slot), GFP_KERNEL);
- if (!slot)
- goto error;
-
- slot->info = kzalloc(sizeof(struct hotplug_slot_info), GFP_KERNEL);
- if (!slot->info)
- goto error_slot;
-
- slot->info->power_status = 1;
- slot->info->max_bus_speed = PCI_SPEED_UNKNOWN;
- slot->info->cur_bus_speed = PCI_SPEED_UNKNOWN;
-
- dslot = kzalloc(sizeof(struct dummy_slot), GFP_KERNEL);
- if (!dslot)
- goto error_info;
-
- if (dup_slots)
- snprintf(name, SLOT_NAME_SIZE, "fake");
- else
- snprintf(name, SLOT_NAME_SIZE, "fake%d", count++);
- dbg("slot->name = %s\n", name);
- slot->ops = &dummy_hotplug_slot_ops;
- slot->release = &dummy_release;
- slot->private = dslot;
-
- retval = pci_hp_register(slot, dev->bus, PCI_SLOT(dev->devfn), name);
- if (retval) {
- err("pci_hp_register failed with error %d\n", retval);
- goto error_dslot;
- }
-
- dbg("slot->name = %s\n", hotplug_slot_name(slot));
- dslot->slot = slot;
- dslot->dev = pci_dev_get(dev);
- list_add (&dslot->node, &slot_list);
- return retval;
-
-error_dslot:
- kfree(dslot);
-error_info:
- kfree(slot->info);
-error_slot:
- kfree(slot);
-error:
- return retval;
-}
-
-static int __init pci_scan_buses(void)
-{
- struct pci_dev *dev = NULL;
- int lastslot = 0;
-
- while ((dev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, dev)) != NULL) {
- if (PCI_FUNC(dev->devfn) > 0 &&
- lastslot == PCI_SLOT(dev->devfn))
- continue;
- lastslot = PCI_SLOT(dev->devfn);
- add_slot(dev);
- }
-
- return 0;
-}
-
-static void remove_slot(struct dummy_slot *dslot)
-{
- int retval;
-
- dbg("removing slot %s\n", hotplug_slot_name(dslot->slot));
- retval = pci_hp_deregister(dslot->slot);
- if (retval)
- err("Problem unregistering a slot %s\n",
- hotplug_slot_name(dslot->slot));
-}
-
-/* called from the single-threaded workqueue handler to remove a slot */
-static void remove_slot_worker(struct work_struct *work)
-{
- struct dummy_slot *dslot =
- container_of(work, struct dummy_slot, remove_work);
- remove_slot(dslot);
-}
-
-/**
- * pci_rescan_slot - Rescan slot
- * @temp: Device template. Should be set: bus and devfn.
- *
- * Tries hard not to re-enable already existing devices;
- * also handles scanning of subfunctions.
- */
-static int pci_rescan_slot(struct pci_dev *temp)
-{
- struct pci_bus *bus = temp->bus;
- struct pci_dev *dev;
- int func;
- u8 hdr_type;
- int count = 0;
-
- if (!pci_read_config_byte(temp, PCI_HEADER_TYPE, &hdr_type)) {
- temp->hdr_type = hdr_type & 0x7f;
- if ((dev = pci_get_slot(bus, temp->devfn)) != NULL)
- pci_dev_put(dev);
- else {
- dev = pci_scan_single_device(bus, temp->devfn);
- if (dev) {
- dbg("New device on %s function %x:%x\n",
- bus->name, temp->devfn >> 3,
- temp->devfn & 7);
- count++;
- }
- }
- /* multifunction device? */
- if (!(hdr_type & 0x80))
- return count;
-
- /* continue scanning for other functions */
- for (func = 1, temp->devfn++; func < 8; func++, temp->devfn++) {
- if (pci_read_config_byte(temp, PCI_HEADER_TYPE, &hdr_type))
- continue;
- temp->hdr_type = hdr_type & 0x7f;
-
- if ((dev = pci_get_slot(bus, temp->devfn)) != NULL)
- pci_dev_put(dev);
- else {
- dev = pci_scan_single_device(bus, temp->devfn);
- if (dev) {
- dbg("New device on %s function %x:%x\n",
- bus->name, temp->devfn >> 3,
- temp->devfn & 7);
- count++;
- }
- }
- }
- }
-
- return count;
-}
-
-
-/**
- * pci_rescan_bus_local - fakephp version of rescan PCI bus
- * @bus: the PCI bus to rescan
- *
- * Call pci_rescan_slot for each possible function of the bus.
- */
-static void pci_rescan_bus_local(const struct pci_bus *bus)
-{
- unsigned int devfn;
- struct pci_dev *dev;
- int retval;
- int found = 0;
- dev = alloc_pci_dev();
- if (!dev)
- return;
-
- dev->bus = (struct pci_bus*)bus;
- dev->sysdata = bus->sysdata;
- for (devfn = 0; devfn < 0x100; devfn += 8) {
- dev->devfn = devfn;
- found += pci_rescan_slot(dev);
- }
-
- if (found) {
- pci_bus_assign_resources(bus);
- list_for_each_entry(dev, &bus->devices, bus_list) {
- /* Skip already-added devices */
- if (dev->is_added)
- continue;
- retval = pci_bus_add_device(dev);
- if (retval)
- dev_err(&dev->dev,
- "Error adding device, continuing\n");
- else
- add_slot(dev);
- }
- pci_bus_add_devices(bus);
- }
- kfree(dev);
-}
-
-/* recursively scan all buses */
-static void pci_rescan_buses(const struct list_head *list)
-{
- const struct list_head *l;
- list_for_each(l,list) {
- const struct pci_bus *b = pci_bus_b(l);
- pci_rescan_bus_local(b);
- pci_rescan_buses(&b->children);
- }
-}
-
-/* initiate rescan of all pci buses */
-static inline void pci_rescan(void) {
- pci_rescan_buses(&pci_root_buses);
-}
-
-/* called from the single-threaded workqueue handler to rescan all pci buses */
-static void pci_rescan_worker(struct work_struct *work)
-{
- pci_rescan();
-}
-
-static int enable_slot(struct hotplug_slot *hotplug_slot)
-{
- /* mis-use enable_slot for rescanning of the pci bus */
- cancel_work_sync(&pci_rescan_work);
- queue_work(dummyphp_wq, &pci_rescan_work);
- return 0;
-}
-
-static int disable_slot(struct hotplug_slot *slot)
-{
- struct dummy_slot *dslot;
- struct pci_dev *dev;
- int func;
-
- if (!slot)
- return -ENODEV;
- dslot = slot->private;
-
- dbg("%s - physical_slot = %s\n", __func__, hotplug_slot_name(slot));
-
- for (func = 7; func >= 0; func--) {
- dev = pci_get_slot(dslot->dev->bus, dslot->dev->devfn + func);
- if (!dev)
- continue;
-
- if (test_and_set_bit(0, &dslot->removed)) {
- dbg("Slot already scheduled for removal\n");
- pci_dev_put(dev);
- return -ENODEV;
- }
-
- /* remove the device from the pci core */
- pci_remove_bus_device(dev);
-
- /* queue work item to blow away this sysfs entry and other
- * parts.
- */
- INIT_WORK(&dslot->remove_work, remove_slot_worker);
- queue_work(dummyphp_wq, &dslot->remove_work);
-
- pci_dev_put(dev);
- }
- return 0;
-}
-
-static void cleanup_slots (void)
-{
- struct list_head *tmp;
- struct list_head *next;
- struct dummy_slot *dslot;
-
- destroy_workqueue(dummyphp_wq);
- list_for_each_safe (tmp, next, &slot_list) {
- dslot = list_entry (tmp, struct dummy_slot, node);
- remove_slot(dslot);
- }
-
-}
-
-static int __init dummyphp_init(void)
-{
- info(DRIVER_DESC "\n");
-
- dummyphp_wq = create_singlethread_workqueue(MY_NAME);
- if (!dummyphp_wq)
- return -ENOMEM;
-
- return pci_scan_buses();
-}
-
-
-static void __exit dummyphp_exit(void)
-{
- cleanup_slots();
-}
-
-module_init(dummyphp_init);
-module_exit(dummyphp_exit);
-
-MODULE_AUTHOR(DRIVER_AUTHOR);
-MODULE_DESCRIPTION(DRIVER_DESC);
-MODULE_LICENSE("GPL");
-module_param(debug, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(debug, "Debugging mode enabled or not");
-module_param(dup_slots, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(dup_slots, "Force duplicate slot names for debugging");
diff --git a/drivers/pci/hotplug/legacy_fakephp.c b/drivers/pci/hotplug/legacy_fakephp.c
new file mode 100644
index 0000000..2dc7828
--- /dev/null
+++ b/drivers/pci/hotplug/legacy_fakephp.c
@@ -0,0 +1,162 @@
+/* Works like the fakephp driver used to, except a little better.
+ *
+ * - It's possible to remove devices with subordinate busses.
+ * - New PCI devices that appear via any method, not just a fakephp triggered
+ * rescan, will be noticed.
+ * - Devices that are removed via any method, not just a fakephp triggered
+ * removal, will also be noticed.
+ *
+ * Uses nothing from the pci-hotplug subsystem.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include "../pci.h"
+
+struct legacy_slot {
+ struct kobject kobj;
+ struct pci_dev *dev;
+ struct list_head list;
+};
+
+static LIST_HEAD(legacy_list);
+
+static ssize_t legacy_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
+ strcpy(buf, "1\n");
+ return 2;
+}
+
+static void remove_callback(void *data)
+{
+ pci_remove_bus_device((struct pci_dev *)data);
+}
+
+static ssize_t legacy_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t len)
+{
+ struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
+ unsigned long val;
+
+ if (strict_strtoul(buf, 0, &val) < 0)
+ return -EINVAL;
+
+ if (val)
+ pci_rescan_bus(slot->dev->bus);
+ else
+ sysfs_schedule_callback(&slot->dev->dev.kobj, remove_callback,
+ slot->dev, THIS_MODULE);
+ return len;
+}
+
+static struct attribute *legacy_attrs[] = {
+ &(struct attribute){ .name = "power", .mode = 0644 },
+ NULL,
+};
+
+static void legacy_release(struct kobject *kobj)
+{
+ struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
+
+ pci_dev_put(slot->dev);
+ kfree(slot);
+}
+
+static struct kobj_type legacy_ktype = {
+ .sysfs_ops = &(struct sysfs_ops){
+ .store = legacy_store, .show = legacy_show
+ },
+ .release = &legacy_release,
+ .default_attrs = legacy_attrs,
+};
+
+static int legacy_add_slot(struct pci_dev *pdev)
+{
+ struct legacy_slot *slot = kzalloc(sizeof(*slot), GFP_KERNEL);
+
+ if (!slot)
+ return -ENOMEM;
+
+ if (kobject_init_and_add(&slot->kobj, &legacy_ktype,
+ &pci_slots_kset->kobj, "%s",
+ pdev->dev.bus_id)) {
+ dev_warn(&pdev->dev, "Failed to created legacy fake slot\n");
+ return -EINVAL;
+ }
+ slot->dev = pci_dev_get(pdev);
+
+ list_add(&slot->list, &legacy_list);
+
+ return 0;
+}
+
+static int legacy_notify(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct pci_dev *pdev = to_pci_dev(data);
+
+ if (action == BUS_NOTIFY_ADD_DEVICE) {
+ legacy_add_slot(pdev);
+ } else if (action == BUS_NOTIFY_DEL_DEVICE) {
+ struct legacy_slot *slot;
+
+ list_for_each_entry(slot, &legacy_list, list)
+ if (slot->dev == pdev)
+ goto found;
+
+ dev_warn(&pdev->dev, "Missing legacy fake slot?");
+ return -ENODEV;
+found:
+ kobject_del(&slot->kobj);
+ list_del(&slot->list);
+ kobject_put(&slot->kobj);
+ }
+
+ return 0;
+}
+
+static struct notifier_block legacy_notifier = {
+ .notifier_call = legacy_notify
+};
+
+static int __init init_legacy(void)
+{
+ struct pci_dev *pdev = NULL;
+
+ /* Add existing devices */
+ while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)))
+ legacy_add_slot(pdev);
+
+ /* Be alerted of any new ones */
+ bus_register_notifier(&pci_bus_type, &legacy_notifier);
+ return 0;
+}
+module_init(init_legacy);
+
+static void __exit remove_legacy(void)
+{
+ struct legacy_slot *slot, *tmp;
+
+ bus_unregister_notifier(&pci_bus_type, &legacy_notifier);
+
+ list_for_each_entry_safe(slot, tmp, &legacy_list, list) {
+ list_del(&slot->list);
+ kobject_del(&slot->kobj);
+ kobject_put(&slot->kobj);
+ }
+}
+module_exit(remove_legacy);
+
+
+MODULE_AUTHOR("Trent Piepho <[email protected]>");
+MODULE_DESCRIPTION("Legacy version of the fakephp interface");
+MODULE_LICENSE("GPL");

2009-03-20 21:00:41

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 12/13] PCI Hotplug: rename legacy_fakephp to fakephp

We wanted to replace fakephp wholesale, so rename legacy_fakephp back
to fakephp. Yes, this is a silly commit, but it produces a much easier
patch to read and review.

Signed-off-by: Alex Chiang <[email protected]>
---

drivers/pci/hotplug/Makefile | 2
drivers/pci/hotplug/fakephp.c | 162 ++++++++++++++++++++++++++++++++++
drivers/pci/hotplug/legacy_fakephp.c | 162 ----------------------------------
3 files changed, 163 insertions(+), 163 deletions(-)
create mode 100644 drivers/pci/hotplug/fakephp.c
delete mode 100644 drivers/pci/hotplug/legacy_fakephp.c

diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile
index 3314fe8..2aa117c 100644
--- a/drivers/pci/hotplug/Makefile
+++ b/drivers/pci/hotplug/Makefile
@@ -20,7 +20,7 @@ obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR) += rpadlpar_io.o
obj-$(CONFIG_HOTPLUG_PCI_SGI) += sgi_hotplug.o

# Link this last so it doesn't claim devices that have a real hotplug driver
-obj-$(CONFIG_HOTPLUG_PCI_FAKE) += legacy_fakephp.o
+obj-$(CONFIG_HOTPLUG_PCI_FAKE) += fakephp.o

pci_hotplug-objs := pci_hotplug_core.o

diff --git a/drivers/pci/hotplug/fakephp.c b/drivers/pci/hotplug/fakephp.c
new file mode 100644
index 0000000..2dc7828
--- /dev/null
+++ b/drivers/pci/hotplug/fakephp.c
@@ -0,0 +1,162 @@
+/* Works like the fakephp driver used to, except a little better.
+ *
+ * - It's possible to remove devices with subordinate busses.
+ * - New PCI devices that appear via any method, not just a fakephp triggered
+ * rescan, will be noticed.
+ * - Devices that are removed via any method, not just a fakephp triggered
+ * removal, will also be noticed.
+ *
+ * Uses nothing from the pci-hotplug subsystem.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/init.h>
+#include <linux/pci.h>
+#include "../pci.h"
+
+struct legacy_slot {
+ struct kobject kobj;
+ struct pci_dev *dev;
+ struct list_head list;
+};
+
+static LIST_HEAD(legacy_list);
+
+static ssize_t legacy_show(struct kobject *kobj, struct attribute *attr,
+ char *buf)
+{
+ struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
+ strcpy(buf, "1\n");
+ return 2;
+}
+
+static void remove_callback(void *data)
+{
+ pci_remove_bus_device((struct pci_dev *)data);
+}
+
+static ssize_t legacy_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t len)
+{
+ struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
+ unsigned long val;
+
+ if (strict_strtoul(buf, 0, &val) < 0)
+ return -EINVAL;
+
+ if (val)
+ pci_rescan_bus(slot->dev->bus);
+ else
+ sysfs_schedule_callback(&slot->dev->dev.kobj, remove_callback,
+ slot->dev, THIS_MODULE);
+ return len;
+}
+
+static struct attribute *legacy_attrs[] = {
+ &(struct attribute){ .name = "power", .mode = 0644 },
+ NULL,
+};
+
+static void legacy_release(struct kobject *kobj)
+{
+ struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
+
+ pci_dev_put(slot->dev);
+ kfree(slot);
+}
+
+static struct kobj_type legacy_ktype = {
+ .sysfs_ops = &(struct sysfs_ops){
+ .store = legacy_store, .show = legacy_show
+ },
+ .release = &legacy_release,
+ .default_attrs = legacy_attrs,
+};
+
+static int legacy_add_slot(struct pci_dev *pdev)
+{
+ struct legacy_slot *slot = kzalloc(sizeof(*slot), GFP_KERNEL);
+
+ if (!slot)
+ return -ENOMEM;
+
+ if (kobject_init_and_add(&slot->kobj, &legacy_ktype,
+ &pci_slots_kset->kobj, "%s",
+ pdev->dev.bus_id)) {
+ dev_warn(&pdev->dev, "Failed to created legacy fake slot\n");
+ return -EINVAL;
+ }
+ slot->dev = pci_dev_get(pdev);
+
+ list_add(&slot->list, &legacy_list);
+
+ return 0;
+}
+
+static int legacy_notify(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct pci_dev *pdev = to_pci_dev(data);
+
+ if (action == BUS_NOTIFY_ADD_DEVICE) {
+ legacy_add_slot(pdev);
+ } else if (action == BUS_NOTIFY_DEL_DEVICE) {
+ struct legacy_slot *slot;
+
+ list_for_each_entry(slot, &legacy_list, list)
+ if (slot->dev == pdev)
+ goto found;
+
+ dev_warn(&pdev->dev, "Missing legacy fake slot?");
+ return -ENODEV;
+found:
+ kobject_del(&slot->kobj);
+ list_del(&slot->list);
+ kobject_put(&slot->kobj);
+ }
+
+ return 0;
+}
+
+static struct notifier_block legacy_notifier = {
+ .notifier_call = legacy_notify
+};
+
+static int __init init_legacy(void)
+{
+ struct pci_dev *pdev = NULL;
+
+ /* Add existing devices */
+ while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)))
+ legacy_add_slot(pdev);
+
+ /* Be alerted of any new ones */
+ bus_register_notifier(&pci_bus_type, &legacy_notifier);
+ return 0;
+}
+module_init(init_legacy);
+
+static void __exit remove_legacy(void)
+{
+ struct legacy_slot *slot, *tmp;
+
+ bus_unregister_notifier(&pci_bus_type, &legacy_notifier);
+
+ list_for_each_entry_safe(slot, tmp, &legacy_list, list) {
+ list_del(&slot->list);
+ kobject_del(&slot->kobj);
+ kobject_put(&slot->kobj);
+ }
+}
+module_exit(remove_legacy);
+
+
+MODULE_AUTHOR("Trent Piepho <[email protected]>");
+MODULE_DESCRIPTION("Legacy version of the fakephp interface");
+MODULE_LICENSE("GPL");
diff --git a/drivers/pci/hotplug/legacy_fakephp.c b/drivers/pci/hotplug/legacy_fakephp.c
deleted file mode 100644
index 2dc7828..0000000
--- a/drivers/pci/hotplug/legacy_fakephp.c
+++ /dev/null
@@ -1,162 +0,0 @@
-/* Works like the fakephp driver used to, except a little better.
- *
- * - It's possible to remove devices with subordinate busses.
- * - New PCI devices that appear via any method, not just a fakephp triggered
- * rescan, will be noticed.
- * - Devices that are removed via any method, not just a fakephp triggered
- * removal, will also be noticed.
- *
- * Uses nothing from the pci-hotplug subsystem.
- *
- */
-
-#include <linux/module.h>
-#include <linux/kernel.h>
-#include <linux/types.h>
-#include <linux/list.h>
-#include <linux/kobject.h>
-#include <linux/sysfs.h>
-#include <linux/init.h>
-#include <linux/pci.h>
-#include "../pci.h"
-
-struct legacy_slot {
- struct kobject kobj;
- struct pci_dev *dev;
- struct list_head list;
-};
-
-static LIST_HEAD(legacy_list);
-
-static ssize_t legacy_show(struct kobject *kobj, struct attribute *attr,
- char *buf)
-{
- struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
- strcpy(buf, "1\n");
- return 2;
-}
-
-static void remove_callback(void *data)
-{
- pci_remove_bus_device((struct pci_dev *)data);
-}
-
-static ssize_t legacy_store(struct kobject *kobj, struct attribute *attr,
- const char *buf, size_t len)
-{
- struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
- unsigned long val;
-
- if (strict_strtoul(buf, 0, &val) < 0)
- return -EINVAL;
-
- if (val)
- pci_rescan_bus(slot->dev->bus);
- else
- sysfs_schedule_callback(&slot->dev->dev.kobj, remove_callback,
- slot->dev, THIS_MODULE);
- return len;
-}
-
-static struct attribute *legacy_attrs[] = {
- &(struct attribute){ .name = "power", .mode = 0644 },
- NULL,
-};
-
-static void legacy_release(struct kobject *kobj)
-{
- struct legacy_slot *slot = container_of(kobj, typeof(*slot), kobj);
-
- pci_dev_put(slot->dev);
- kfree(slot);
-}
-
-static struct kobj_type legacy_ktype = {
- .sysfs_ops = &(struct sysfs_ops){
- .store = legacy_store, .show = legacy_show
- },
- .release = &legacy_release,
- .default_attrs = legacy_attrs,
-};
-
-static int legacy_add_slot(struct pci_dev *pdev)
-{
- struct legacy_slot *slot = kzalloc(sizeof(*slot), GFP_KERNEL);
-
- if (!slot)
- return -ENOMEM;
-
- if (kobject_init_and_add(&slot->kobj, &legacy_ktype,
- &pci_slots_kset->kobj, "%s",
- pdev->dev.bus_id)) {
- dev_warn(&pdev->dev, "Failed to created legacy fake slot\n");
- return -EINVAL;
- }
- slot->dev = pci_dev_get(pdev);
-
- list_add(&slot->list, &legacy_list);
-
- return 0;
-}
-
-static int legacy_notify(struct notifier_block *nb,
- unsigned long action, void *data)
-{
- struct pci_dev *pdev = to_pci_dev(data);
-
- if (action == BUS_NOTIFY_ADD_DEVICE) {
- legacy_add_slot(pdev);
- } else if (action == BUS_NOTIFY_DEL_DEVICE) {
- struct legacy_slot *slot;
-
- list_for_each_entry(slot, &legacy_list, list)
- if (slot->dev == pdev)
- goto found;
-
- dev_warn(&pdev->dev, "Missing legacy fake slot?");
- return -ENODEV;
-found:
- kobject_del(&slot->kobj);
- list_del(&slot->list);
- kobject_put(&slot->kobj);
- }
-
- return 0;
-}
-
-static struct notifier_block legacy_notifier = {
- .notifier_call = legacy_notify
-};
-
-static int __init init_legacy(void)
-{
- struct pci_dev *pdev = NULL;
-
- /* Add existing devices */
- while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)))
- legacy_add_slot(pdev);
-
- /* Be alerted of any new ones */
- bus_register_notifier(&pci_bus_type, &legacy_notifier);
- return 0;
-}
-module_init(init_legacy);
-
-static void __exit remove_legacy(void)
-{
- struct legacy_slot *slot, *tmp;
-
- bus_unregister_notifier(&pci_bus_type, &legacy_notifier);
-
- list_for_each_entry_safe(slot, tmp, &legacy_list, list) {
- list_del(&slot->list);
- kobject_del(&slot->kobj);
- kobject_put(&slot->kobj);
- }
-}
-module_exit(remove_legacy);
-
-
-MODULE_AUTHOR("Trent Piepho <[email protected]>");
-MODULE_DESCRIPTION("Legacy version of the fakephp interface");
-MODULE_LICENSE("GPL");

2009-03-20 21:00:55

by Alex Chiang

[permalink] [raw]
Subject: [PATCH v5 13/13] PCI Hotplug: schedule fakephp for feature removal

Now that the PCI core is capable of function-level remove and rescan
as well as bus-level rescan, there's no functional need to keep fakephp
anymore.

We keep it around for userspace compatibility reasons, schedule removal
in three years.

Signed-off-by: Alex Chiang <[email protected]>
---

Documentation/feature-removal-schedule.txt | 33 ++++++++++++++++++++++++++++
1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 20d3b94..8851eea 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -335,6 +335,7 @@ Why: In 2.6.18 the Secmark concept was introduced to replace the "compat_net"
Secmark, it is time to deprecate the older mechanism and start the
process of removing the old code.
Who: Paul Moore <[email protected]>
+
---------------------------

What: sysfs ui for changing p4-clockmod parameters
@@ -344,3 +345,35 @@ Why: See commits 129f8ae9b1b5be94517da76009ea956e89104ce8 and
Removal is subject to fixing any remaining bugs in ACPI which may
cause the thermal throttling not to happen at the right time.
Who: Dave Jones <[email protected]>, Matthew Garrett <[email protected]>
+
+---------------------------
+
+What: fakephp and associated sysfs files in /sys/bus/pci/slots/
+When: 2011
+Why: In 2.6.27, the semantics of /sys/bus/pci/slots was redefined to
+ represent a machine's physical PCI slots. The change in semantics
+ had userspace implications, as the hotplug core no longer allowed
+ drivers to create multiple sysfs files per physical slot (required
+ for multi-function devices, e.g.). fakephp was seen as a developer's
+ tool only, and its interface changed. Too late, we learned that
+ there were some users of the fakephp interface.
+
+ In 2.6.30, the original fakephp interface was restored. At the same
+ time, the PCI core gained the ability that fakephp provided, namely
+ function-level hot-remove and hot-add.
+
+ Since the PCI core now provides the same functionality, exposed in:
+
+ /sys/bus/pci/rescan
+ /sys/bus/pci/devices/.../remove
+ /sys/bus/pci/devices/.../rescan
+
+ there is no functional reason to maintain fakephp as well.
+
+ We will keep the existing module so that 'modprobe fakephp' will
+ present the old /sys/bus/pci/slots/... interface for compatibility,
+ but users are urged to migrate their applications to the API above.
+
+ After a reasonable transition period, we will remove the legacy
+ fakephp interface.
+Who: Alex Chiang <[email protected]>

2009-03-20 22:00:29

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH v5 01/13] PCI: pci_is_root_bus helper

On Fri, 20 Mar 2009 14:55:55 -0600
Alex Chiang <[email protected]> wrote:

> From: Kenji Kaneshige <[email protected]>
>
> Introduce pci_is_root_bus helper function. This will help make code
> more consistent, as well as prevent incorrect assumptions (such as
> pci_bus->self == NULL on a root bus, which is not always true).
>
> Signed-off-by: Kenji Kaneshige <[email protected]>
> Signed-off-by: Alex Chiang <[email protected]>
> ---

Applied this series, thanks everyone!

--
Jesse Barnes, Intel Open Source Technology Center

2009-03-23 09:01:25

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

Alex Chiang wrote:
> This patch adds an attribute named "remove" to a PCI device's sysfs
> directory. Writing a non-zero value to this attribute will remove the PCI
> device and any children of it.
>
> Trent Piepho wrote the original implementation and documentation.
>
> Thanks to Vegard Nossum for testing under kmemcheck and finding locking
> issues with the sysfs interface.
>
> Cc: Trent Piepho <[email protected]>
> Signed-off-by: Alex Chiang <[email protected]>
> ---
>
> Documentation/ABI/testing/sysfs-bus-pci | 8 +++++++
> Documentation/filesystems/sysfs-pci.txt | 10 +++++++++
> drivers/pci/pci-sysfs.c | 36 +++++++++++++++++++++++++++++++
> 3 files changed, 54 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 1697a16..1350fa6 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -66,6 +66,14 @@ Description:
> re-discover previously removed devices.
> Depends on CONFIG_HOTPLUG.
>
> +What: /sys/bus/pci/devices/.../remove
> +Date: January 2009
> +Contact: Linux PCI developers <[email protected]>
> +Description:
> + Writing a non-zero value to this attribute will
> + hot-remove the PCI device and any of its children.
> + Depends on CONFIG_HOTPLUG.
> +
> What: /sys/bus/pci/devices/.../vpd
> Date: February 2008
> Contact: Ben Hutchings <[email protected]>
> diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt
> index 9f8740c..26e4b8b 100644
> --- a/Documentation/filesystems/sysfs-pci.txt
> +++ b/Documentation/filesystems/sysfs-pci.txt
> @@ -12,6 +12,7 @@ that support it. For example, a given bus might look like this:
> | |-- enable
> | |-- irq
> | |-- local_cpus
> + | |-- remove
> | |-- resource
> | |-- resource0
> | |-- resource1
> @@ -36,6 +37,7 @@ files, each with their own function.
> enable Whether the device is enabled (ascii, rw)
> irq IRQ number (ascii, ro)
> local_cpus nearby CPU mask (cpumask, ro)
> + remove remove device from kernel's list (ascii, wo)
> resource PCI resource host addresses (ascii, ro)
> resource0..N PCI resource N, if present (binary, mmap)
> resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap)
> @@ -46,6 +48,7 @@ files, each with their own function.
>
> ro - read only file
> rw - file is readable and writable
> + wo - write only file
> mmap - file is mmapable
> ascii - file contains ascii text
> binary - file contains binary data
> @@ -73,6 +76,13 @@ that the device must be enabled for a rom read to return data succesfully.
> In the event a driver is not bound to the device, it can be enabled using the
> 'enable' file, documented above.
>
> +The 'remove' file is used to remove the PCI device, by writing a non-zero
> +integer to the file. This does not involve any kind of hot-plug functionality,
> +e.g. powering off the device. The device is removed from the kernel's list of
> +PCI devices, the sysfs directory for it is removed, and the device will be
> +removed from any drivers attached to it. Removal of PCI root buses is
> +disallowed.
> +
> Accessing legacy resources through sysfs
> ----------------------------------------
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index be7468a..e16990e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -243,6 +243,39 @@ struct bus_attribute pci_bus_attrs[] = {
> __ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, bus_rescan_store),
> __ATTR_NULL
> };
> +
> +static void remove_callback(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + mutex_lock(&pci_remove_rescan_mutex);
> + pci_remove_bus_device(pdev);
> + mutex_unlock(&pci_remove_rescan_mutex);
> +}
> +
> +static ssize_t
> +remove_store(struct device *dev, struct device_attribute *dummy,
> + const char *buf, size_t count)
> +{
> + int ret = 0;
> + unsigned long val;
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + if (strict_strtoul(buf, 0, &val) < 0)
> + return -EINVAL;
> +
> + if (pci_is_root_bus(pdev->bus))
> + return -EBUSY;
> +
> + /* An attribute cannot be unregistered by one of its own methods,
> + * so we have to use this roundabout approach.
> + */
> + if (val)
> + ret = device_schedule_callback(dev, remove_callback);
> + if (ret)
> + count = ret;
> + return count;
> +}
> #endif
>

I still have the following kernel error messages in testing with your
latest set of patches (Jesse's linux-next). The test case is removing
e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
.../remove".

[ 537.379995] =============================================
[ 537.380124] [ INFO: possible recursive locking detected ]
[ 537.380128] 2.6.29-rc8-kk #1
[ 537.380128] ---------------------------------------------
[ 537.380128] events/4/56 is trying to acquire lock:
[ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
[ 537.380128]
[ 537.380128] but task is already holding lock:
[ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
[ 537.380128]
[ 537.380128] other info that might help us debug this:
[ 537.380128] 3 locks held by events/4/56:
[ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
[ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
[ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
[ 537.380128]
[ 537.380128] stack backtrace:
[ 537.380128] Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
[ 537.380128] Call Trace:
[ 537.380128] [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
[ 537.380128] [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
[ 537.380128] [<ffffffff8026f148>] lock_acquire+0x58/0x80
[ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
[ 537.380128] [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
[ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
[ 537.383380] [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
[ 537.383380] [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
[ 537.383380] [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
[ 537.383380] [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
[ 537.383380] [<ffffffff80441da9>] __device_release_driver+0x59/0x90
[ 537.383380] [<ffffffff80441edb>] device_release_driver+0x2b/0x40
[ 537.383380] [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
[ 537.384382] [<ffffffff8043e46b>] device_del+0x12b/0x190
[ 537.384382] [<ffffffff8043e4f6>] device_unregister+0x26/0x70
[ 537.384382] [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
[ 537.384382] [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
[ 537.384382] [<ffffffff803c10d9>] remove_callback+0x29/0x40
[ 537.384382] [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
[ 537.384382] [<ffffffff8025769a>] run_workqueue+0x15a/0x230
[ 537.384382] [<ffffffff80257648>] ? run_workqueue+0x108/0x230
[ 537.384382] [<ffffffff8025846f>] worker_thread+0x9f/0x100
[ 537.384382] [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
[ 537.384382] [<ffffffff802583d0>] ? worker_thread+0x0/0x100
[ 537.384382] [<ffffffff8025b89d>] kthread+0x4d/0x80
[ 537.384382] [<ffffffff8020d4ba>] child_rip+0xa/0x20
[ 537.386380] [<ffffffff8020cebc>] ? restore_args+0x0/0x30
[ 537.386380] [<ffffffff8025b850>] ? kthread+0x0/0x80
[ 537.386380] [<ffffffff8020d4b0>] ? child_rip+0x0/0x20

I think the cause of this error message is flush_workqueue() from the
work of keventd. When removing device using "/sys/bus/pci/devices/.../
remove", pci_remove_bus_device() is executed by the keventd's work
through device_schedule_callback(), and it invokes e1000e's remove
callback. And then, e1000e's remove callback invokes flush_workqueue().
Actually, the kernel error messages are not displayed when I changed
e1000e driver to not call flush_workqueue(). In my understanding, flush_workqueue() from the work must be avoided because it can cause
a deadlock. Please note that this is not a problem of e1000e driver.
Drivers can use flush_workqueue(), of course.

BTW, I also have another worry about executing pci_remove_bus_device()
by the work of keventd. The pci_remove_bus_device() will take a long
time especially when the bridge device near the root bus is specified.
The long delay of keventd's work will have bad effects to other works
on the workqueue.

Thanks,
Kenji Kaneshige



> struct device_attribute pci_dev_attrs[] = {
> @@ -263,6 +296,9 @@ struct device_attribute pci_dev_attrs[] = {
> __ATTR(broken_parity_status,(S_IRUGO|S_IWUSR),
> broken_parity_status_show,broken_parity_status_store),
> __ATTR(msi_bus, 0644, msi_bus_show, msi_bus_store),
> +#ifdef CONFIG_HOTPLUG
> + __ATTR(remove, (S_IWUSR|S_IWGRP), NULL, remove_store),
> +#endif
> __ATTR_NULL,
> };
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2009-03-24 03:23:21

by Alex Chiang

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

Hi Ingo,

* Kenji Kaneshige <[email protected]>:
> Alex Chiang wrote:
>> This patch adds an attribute named "remove" to a PCI device's sysfs
>> directory. Writing a non-zero value to this attribute will remove the PCI
>> device and any children of it.
>>
>> Trent Piepho wrote the original implementation and documentation.
>>
>> Thanks to Vegard Nossum for testing under kmemcheck and finding locking
>> issues with the sysfs interface.
>>
>> Cc: Trent Piepho <[email protected]>
>> Signed-off-by: Alex Chiang <[email protected]>

[snip part of patch]

>> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>> index be7468a..e16990e 100644
>> --- a/drivers/pci/pci-sysfs.c
>> +++ b/drivers/pci/pci-sysfs.c
>> @@ -243,6 +243,39 @@ struct bus_attribute pci_bus_attrs[] = {
>> __ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, bus_rescan_store),
>> __ATTR_NULL
>> };
>> +
>> +static void remove_callback(struct device *dev)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> + mutex_lock(&pci_remove_rescan_mutex);
>> + pci_remove_bus_device(pdev);
>> + mutex_unlock(&pci_remove_rescan_mutex);
>> +}
>> +
>> +static ssize_t
>> +remove_store(struct device *dev, struct device_attribute *dummy,
>> + const char *buf, size_t count)
>> +{
>> + int ret = 0;
>> + unsigned long val;
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> + if (strict_strtoul(buf, 0, &val) < 0)
>> + return -EINVAL;
>> +
>> + if (pci_is_root_bus(pdev->bus))
>> + return -EBUSY;
>> +
>> + /* An attribute cannot be unregistered by one of its own methods,
>> + * so we have to use this roundabout approach.
>> + */
>> + if (val)
>> + ret = device_schedule_callback(dev, remove_callback);
>> + if (ret)
>> + count = ret;
>> + return count;
>> +}
>> #endif
>>

Kenji Kaneshige reported the below lockdep problem when testing
my patch on one of his machines.

> I still have the following kernel error messages in testing with your
> latest set of patches (Jesse's linux-next). The test case is removing
> e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
> .../remove".
>
> [ 537.379995] =============================================
> [ 537.380124] [ INFO: possible recursive locking detected ]
> [ 537.380128] 2.6.29-rc8-kk #1
> [ 537.380128] ---------------------------------------------
> [ 537.380128] events/4/56 is trying to acquire lock:
> [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> [ 537.380128]
> [ 537.380128] but task is already holding lock:
> [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> [ 537.380128]
> [ 537.380128] other info that might help us debug this:
> [ 537.380128] 3 locks held by events/4/56:
> [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
> [ 537.380128]
> [ 537.380128] stack backtrace:
> [ 537.380128] Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
> [ 537.380128] Call Trace:
> [ 537.380128] [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
> [ 537.380128] [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
> [ 537.380128] [<ffffffff8026f148>] lock_acquire+0x58/0x80
> [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> [ 537.380128] [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
> [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> [ 537.383380] [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
> [ 537.383380] [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
> [ 537.383380] [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
> [ 537.383380] [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
> [ 537.383380] [<ffffffff80441da9>] __device_release_driver+0x59/0x90
> [ 537.383380] [<ffffffff80441edb>] device_release_driver+0x2b/0x40
> [ 537.383380] [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
> [ 537.384382] [<ffffffff8043e46b>] device_del+0x12b/0x190
> [ 537.384382] [<ffffffff8043e4f6>] device_unregister+0x26/0x70
> [ 537.384382] [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
> [ 537.384382] [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
> [ 537.384382] [<ffffffff803c10d9>] remove_callback+0x29/0x40
> [ 537.384382] [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
> [ 537.384382] [<ffffffff8025769a>] run_workqueue+0x15a/0x230
> [ 537.384382] [<ffffffff80257648>] ? run_workqueue+0x108/0x230
> [ 537.384382] [<ffffffff8025846f>] worker_thread+0x9f/0x100
> [ 537.384382] [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
> [ 537.384382] [<ffffffff802583d0>] ? worker_thread+0x0/0x100
> [ 537.384382] [<ffffffff8025b89d>] kthread+0x4d/0x80
> [ 537.384382] [<ffffffff8020d4ba>] child_rip+0xa/0x20
> [ 537.386380] [<ffffffff8020cebc>] ? restore_args+0x0/0x30
> [ 537.386380] [<ffffffff8025b850>] ? kthread+0x0/0x80
> [ 537.386380] [<ffffffff8020d4b0>] ? child_rip+0x0/0x20
>
> I think the cause of this error message is flush_workqueue()
> from the work of keventd. When removing device using
> "/sys/bus/pci/devices/.../ remove", pci_remove_bus_device() is
> executed by the keventd's work through
> device_schedule_callback(), and it invokes e1000e's remove
> callback. And then, e1000e's remove callback invokes
> flush_workqueue(). Actually, the kernel error messages are not
> displayed when I changed e1000e driver to not call
> flush_workqueue(). In my understanding, flush_workqueue() from
> the work must be avoided because it can cause a deadlock.
> Please note that this is not a problem of e1000e driver.
> Drivers can use flush_workqueue(), of course.

I agree with this analysis; the reason we're seeing this lockdep
warning is because the sysfs attributed scheduled a removal for
itself using device_schedule_callback(). This is necessary
because sysfs attributes can't remove themselves due to other
locking issues.

My question is -- is it a bug to call flush_workqueue during
run_workqueue?

Conceptually, I don't think it should be a bug; it should be a
nop, since run_workqueue _is_ flushing the work queue.

Thoughts?

> BTW, I also have another worry about executing pci_remove_bus_device()
> by the work of keventd. The pci_remove_bus_device() will take a long
> time especially when the bridge device near the root bus is specified.
> The long delay of keventd's work will have bad effects to other works
> on the workqueue.

The real fix is to fix sysfs so that attributes can remove
themselves directly. I will work with Tejun Heo on getting this
working sooner rather than later. That will avoid the locking
issue you discovered above as well as the concern you point out
about putting long running tasks in the keventd work queue.

Thanks.

/ac

>
> Thanks,
> Kenji Kaneshige
>
>
>
>> struct device_attribute pci_dev_attrs[] = {
>> @@ -263,6 +296,9 @@ struct device_attribute pci_dev_attrs[] = {
>> __ATTR(broken_parity_status,(S_IRUGO|S_IWUSR),
>> broken_parity_status_show,broken_parity_status_store),
>> __ATTR(msi_bus, 0644, msi_bus_show, msi_bus_store),
>> +#ifdef CONFIG_HOTPLUG
>> + __ATTR(remove, (S_IWUSR|S_IWGRP), NULL, remove_store),
>> +#endif
>> __ATTR_NULL,
>> };
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>
>

2009-03-24 09:26:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove


( Cc:-ed a few more interested parties - the thread is about
workqueue dependency lockdep coverage. )

* Alex Chiang <[email protected]> wrote:

> Hi Ingo,
>
> * Kenji Kaneshige <[email protected]>:
> > Alex Chiang wrote:
> >> This patch adds an attribute named "remove" to a PCI device's sysfs
> >> directory. Writing a non-zero value to this attribute will remove the PCI
> >> device and any children of it.
> >>
> >> Trent Piepho wrote the original implementation and documentation.
> >>
> >> Thanks to Vegard Nossum for testing under kmemcheck and finding locking
> >> issues with the sysfs interface.
> >>
> >> Cc: Trent Piepho <[email protected]>
> >> Signed-off-by: Alex Chiang <[email protected]>
>
> [snip part of patch]
>
> >> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> >> index be7468a..e16990e 100644
> >> --- a/drivers/pci/pci-sysfs.c
> >> +++ b/drivers/pci/pci-sysfs.c
> >> @@ -243,6 +243,39 @@ struct bus_attribute pci_bus_attrs[] = {
> >> __ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, bus_rescan_store),
> >> __ATTR_NULL
> >> };
> >> +
> >> +static void remove_callback(struct device *dev)
> >> +{
> >> + struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >> + mutex_lock(&pci_remove_rescan_mutex);
> >> + pci_remove_bus_device(pdev);
> >> + mutex_unlock(&pci_remove_rescan_mutex);
> >> +}
> >> +
> >> +static ssize_t
> >> +remove_store(struct device *dev, struct device_attribute *dummy,
> >> + const char *buf, size_t count)
> >> +{
> >> + int ret = 0;
> >> + unsigned long val;
> >> + struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >> + if (strict_strtoul(buf, 0, &val) < 0)
> >> + return -EINVAL;
> >> +
> >> + if (pci_is_root_bus(pdev->bus))
> >> + return -EBUSY;
> >> +
> >> + /* An attribute cannot be unregistered by one of its own methods,
> >> + * so we have to use this roundabout approach.
> >> + */
> >> + if (val)
> >> + ret = device_schedule_callback(dev, remove_callback);
> >> + if (ret)
> >> + count = ret;
> >> + return count;
> >> +}
> >> #endif
> >>
>
> Kenji Kaneshige reported the below lockdep problem when testing
> my patch on one of his machines.
>
> > I still have the following kernel error messages in testing with your
> > latest set of patches (Jesse's linux-next). The test case is removing
> > e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
> > .../remove".
> >
> > [ 537.379995] =============================================
> > [ 537.380124] [ INFO: possible recursive locking detected ]
> > [ 537.380128] 2.6.29-rc8-kk #1
> > [ 537.380128] ---------------------------------------------
> > [ 537.380128] events/4/56 is trying to acquire lock:
> > [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> > [ 537.380128]
> > [ 537.380128] but task is already holding lock:
> > [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > [ 537.380128]
> > [ 537.380128] other info that might help us debug this:
> > [ 537.380128] 3 locks held by events/4/56:
> > [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
> > [ 537.380128]
> > [ 537.380128] stack backtrace:
> > [ 537.380128] Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
> > [ 537.380128] Call Trace:
> > [ 537.380128] [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
> > [ 537.380128] [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
> > [ 537.380128] [<ffffffff8026f148>] lock_acquire+0x58/0x80
> > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > [ 537.380128] [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
> > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > [ 537.383380] [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
> > [ 537.383380] [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
> > [ 537.383380] [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
> > [ 537.383380] [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
> > [ 537.383380] [<ffffffff80441da9>] __device_release_driver+0x59/0x90
> > [ 537.383380] [<ffffffff80441edb>] device_release_driver+0x2b/0x40
> > [ 537.383380] [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
> > [ 537.384382] [<ffffffff8043e46b>] device_del+0x12b/0x190
> > [ 537.384382] [<ffffffff8043e4f6>] device_unregister+0x26/0x70
> > [ 537.384382] [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
> > [ 537.384382] [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
> > [ 537.384382] [<ffffffff803c10d9>] remove_callback+0x29/0x40
> > [ 537.384382] [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
> > [ 537.384382] [<ffffffff8025769a>] run_workqueue+0x15a/0x230
> > [ 537.384382] [<ffffffff80257648>] ? run_workqueue+0x108/0x230
> > [ 537.384382] [<ffffffff8025846f>] worker_thread+0x9f/0x100
> > [ 537.384382] [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
> > [ 537.384382] [<ffffffff802583d0>] ? worker_thread+0x0/0x100
> > [ 537.384382] [<ffffffff8025b89d>] kthread+0x4d/0x80
> > [ 537.384382] [<ffffffff8020d4ba>] child_rip+0xa/0x20
> > [ 537.386380] [<ffffffff8020cebc>] ? restore_args+0x0/0x30
> > [ 537.386380] [<ffffffff8025b850>] ? kthread+0x0/0x80
> > [ 537.386380] [<ffffffff8020d4b0>] ? child_rip+0x0/0x20
> >
> > I think the cause of this error message is flush_workqueue()
> > from the work of keventd. When removing device using
> > "/sys/bus/pci/devices/.../ remove", pci_remove_bus_device() is
> > executed by the keventd's work through
> > device_schedule_callback(), and it invokes e1000e's remove
> > callback. And then, e1000e's remove callback invokes
> > flush_workqueue(). Actually, the kernel error messages are not
> > displayed when I changed e1000e driver to not call
> > flush_workqueue(). In my understanding, flush_workqueue() from
> > the work must be avoided because it can cause a deadlock.
> > Please note that this is not a problem of e1000e driver.
> > Drivers can use flush_workqueue(), of course.
>
> I agree with this analysis; the reason we're seeing this lockdep
> warning is because the sysfs attributed scheduled a removal for
> itself using device_schedule_callback(). This is necessary
> because sysfs attributes can't remove themselves due to other
> locking issues.
>
> My question is -- is it a bug to call flush_workqueue during
> run_workqueue?

Yes, it generally is.

> Conceptually, I don't think it should be a bug; it should be a
> nop, since run_workqueue _is_ flushing the work queue.
>
> Thoughts?

well ... but running a work item holds up further processing of the
queue - and there lies the deadlock potential. (but ... i have not
looked deeply, there's always the possibility of a false positive.)

Ingo
>
> > BTW, I also have another worry about executing pci_remove_bus_device()
> > by the work of keventd. The pci_remove_bus_device() will take a long
> > time especially when the bridge device near the root bus is specified.
> > The long delay of keventd's work will have bad effects to other works
> > on the workqueue.
>
> The real fix is to fix sysfs so that attributes can remove
> themselves directly. I will work with Tejun Heo on getting this
> working sooner rather than later. That will avoid the locking
> issue you discovered above as well as the concern you point out
> about putting long running tasks in the keventd work queue.

2009-03-24 11:00:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

On Tue, 24 Mar 2009 10:25:25 +0100 Ingo Molnar <[email protected]> wrote:

>
> ( Cc:-ed a few more interested parties - the thread is about
> workqueue dependency lockdep coverage. )
>
> * Alex Chiang <[email protected]> wrote:
>
> > Hi Ingo,
> >
> > * Kenji Kaneshige <[email protected]>:
> > > Alex Chiang wrote:
> > >> This patch adds an attribute named "remove" to a PCI device's sysfs
> > >> directory. Writing a non-zero value to this attribute will remove the PCI
> > >> device and any children of it.
> > >>
> > >> Trent Piepho wrote the original implementation and documentation.
> > >>
> > >> Thanks to Vegard Nossum for testing under kmemcheck and finding locking
> > >> issues with the sysfs interface.
> > >>
> > >> Cc: Trent Piepho <[email protected]>
> > >> Signed-off-by: Alex Chiang <[email protected]>
> >
> > [snip part of patch]
> >
> > >> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > >> index be7468a..e16990e 100644
> > >> --- a/drivers/pci/pci-sysfs.c
> > >> +++ b/drivers/pci/pci-sysfs.c
> > >> @@ -243,6 +243,39 @@ struct bus_attribute pci_bus_attrs[] = {
> > >> __ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, bus_rescan_store),
> > >> __ATTR_NULL
> > >> };
> > >> +
> > >> +static void remove_callback(struct device *dev)
> > >> +{
> > >> + struct pci_dev *pdev = to_pci_dev(dev);
> > >> +
> > >> + mutex_lock(&pci_remove_rescan_mutex);
> > >> + pci_remove_bus_device(pdev);
> > >> + mutex_unlock(&pci_remove_rescan_mutex);
> > >> +}
> > >> +
> > >> +static ssize_t
> > >> +remove_store(struct device *dev, struct device_attribute *dummy,
> > >> + const char *buf, size_t count)
> > >> +{
> > >> + int ret = 0;
> > >> + unsigned long val;
> > >> + struct pci_dev *pdev = to_pci_dev(dev);
> > >> +
> > >> + if (strict_strtoul(buf, 0, &val) < 0)
> > >> + return -EINVAL;
> > >> +
> > >> + if (pci_is_root_bus(pdev->bus))
> > >> + return -EBUSY;
> > >> +
> > >> + /* An attribute cannot be unregistered by one of its own methods,
> > >> + * so we have to use this roundabout approach.
> > >> + */
> > >> + if (val)
> > >> + ret = device_schedule_callback(dev, remove_callback);
> > >> + if (ret)
> > >> + count = ret;
> > >> + return count;
> > >> +}
> > >> #endif
> > >>
> >
> > Kenji Kaneshige reported the below lockdep problem when testing
> > my patch on one of his machines.
> >
> > > I still have the following kernel error messages in testing with your
> > > latest set of patches (Jesse's linux-next). The test case is removing
> > > e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
> > > .../remove".
> > >
> > > [ 537.379995] =============================================
> > > [ 537.380124] [ INFO: possible recursive locking detected ]
> > > [ 537.380128] 2.6.29-rc8-kk #1
> > > [ 537.380128] ---------------------------------------------
> > > [ 537.380128] events/4/56 is trying to acquire lock:
> > > [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> > > [ 537.380128]
> > > [ 537.380128] but task is already holding lock:
> > > [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > [ 537.380128]
> > > [ 537.380128] other info that might help us debug this:
> > > [ 537.380128] 3 locks held by events/4/56:
> > > [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
> > > [ 537.380128]
> > > [ 537.380128] stack backtrace:
> > > [ 537.380128] Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
> > > [ 537.380128] Call Trace:
> > > [ 537.380128] [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
> > > [ 537.380128] [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
> > > [ 537.380128] [<ffffffff8026f148>] lock_acquire+0x58/0x80
> > > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > > [ 537.380128] [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
> > > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > > [ 537.383380] [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
> > > [ 537.383380] [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
> > > [ 537.383380] [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
> > > [ 537.383380] [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
> > > [ 537.383380] [<ffffffff80441da9>] __device_release_driver+0x59/0x90
> > > [ 537.383380] [<ffffffff80441edb>] device_release_driver+0x2b/0x40
> > > [ 537.383380] [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
> > > [ 537.384382] [<ffffffff8043e46b>] device_del+0x12b/0x190
> > > [ 537.384382] [<ffffffff8043e4f6>] device_unregister+0x26/0x70
> > > [ 537.384382] [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
> > > [ 537.384382] [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
> > > [ 537.384382] [<ffffffff803c10d9>] remove_callback+0x29/0x40
> > > [ 537.384382] [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
> > > [ 537.384382] [<ffffffff8025769a>] run_workqueue+0x15a/0x230
> > > [ 537.384382] [<ffffffff80257648>] ? run_workqueue+0x108/0x230
> > > [ 537.384382] [<ffffffff8025846f>] worker_thread+0x9f/0x100
> > > [ 537.384382] [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
> > > [ 537.384382] [<ffffffff802583d0>] ? worker_thread+0x0/0x100
> > > [ 537.384382] [<ffffffff8025b89d>] kthread+0x4d/0x80
> > > [ 537.384382] [<ffffffff8020d4ba>] child_rip+0xa/0x20
> > > [ 537.386380] [<ffffffff8020cebc>] ? restore_args+0x0/0x30
> > > [ 537.386380] [<ffffffff8025b850>] ? kthread+0x0/0x80
> > > [ 537.386380] [<ffffffff8020d4b0>] ? child_rip+0x0/0x20
> > >
> > > I think the cause of this error message is flush_workqueue()
> > > from the work of keventd. When removing device using
> > > "/sys/bus/pci/devices/.../ remove", pci_remove_bus_device() is
> > > executed by the keventd's work through
> > > device_schedule_callback(), and it invokes e1000e's remove
> > > callback. And then, e1000e's remove callback invokes
> > > flush_workqueue(). Actually, the kernel error messages are not
> > > displayed when I changed e1000e driver to not call
> > > flush_workqueue(). In my understanding, flush_workqueue() from
> > > the work must be avoided because it can cause a deadlock.
> > > Please note that this is not a problem of e1000e driver.
> > > Drivers can use flush_workqueue(), of course.
> >
> > I agree with this analysis; the reason we're seeing this lockdep
> > warning is because the sysfs attributed scheduled a removal for
> > itself using device_schedule_callback(). This is necessary
> > because sysfs attributes can't remove themselves due to other
> > locking issues.
> >
> > My question is -- is it a bug to call flush_workqueue during
> > run_workqueue?
>
> Yes, it generally is.
>
> > Conceptually, I don't think it should be a bug; it should be a
> > nop, since run_workqueue _is_ flushing the work queue.
> >
> > Thoughts?
>
> well ... but running a work item holds up further processing of the
> queue - and there lies the deadlock potential. (but ... i have not
> looked deeply, there's always the possibility of a false positive.)
>

Thing is, we've always supported kevetnd-calls-flush_work(). That's what
"morton gets to eat his hat" in run_workqueue() is all about.

Now, -mm's workqueue-avoid-recursion-in-run_workqueue.patch changes all of
that. And that patch recently triggered a warning due to some games which
USB was playing. I was told this is because USB is being bad.

But I don't think we've seen a coherent description of what's actually
_wrong_ with the current code. flush_cpu_workqueue() has been handling
this case for many years with no problems reported as far as I know.

So what has caused this sudden flurry of reports? Did something change in
lockdep? What is this

[ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
[ 537.380128]
[ 537.380128] but task is already holding lock:
[ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230

supposed to mean? "events" isn't a lock - it's the name of a kernel
thread, isn't it? If this is supposed to be deadlockable then how?

Because I don't immediately see what's wrong with e1000_remove() calling
flush_work(). It's undesirable, and we can perhaps improve it via some
means, but where is the bug?


> >
> > > BTW, I also have another worry about executing pci_remove_bus_device()
> > > by the work of keventd. The pci_remove_bus_device() will take a long
> > > time especially when the bridge device near the root bus is specified.
> > > The long delay of keventd's work will have bad effects to other works
> > > on the workqueue.
> >
> > The real fix is to fix sysfs so that attributes can remove
> > themselves directly. I will work with Tejun Heo on getting this
> > working sooner rather than later. That will avoid the locking
> > issue you discovered above as well as the concern you point out
> > about putting long running tasks in the keventd work queue.
>

2009-03-24 11:18:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

On Tue, 2009-03-24 at 03:46 -0700, Andrew Morton wrote:
>
> Thing is, we've always supported kevetnd-calls-flush_work(). That's what
> "morton gets to eat his hat" in run_workqueue() is all about.

Supported as in not complained about it, but its always presented a
deadlock scenario.

> Now, -mm's workqueue-avoid-recursion-in-run_workqueue.patch changes all of
> that.

See the discussions around that patch, Lai Jiangshan discovered that it
had more deadlock potential than we even suspected.

To quote:

---
On 02/06, Lai Jiangshan wrote:
>
> Oleg Nesterov wrote:
> > On 02/05, Lai Jiangshan wrote:
> >> DEADLOCK EXAMPLE for explain my above option:
> >>
> >> (work_func0() and work_func1() are work callback, and they
> >> calls flush_workqueue())
> >>
> >> CPU#0 CPU#1
> >> run_workqueue() run_workqueue()
> >> work_func0() work_func1()
> >> flush_workqueue() flush_workqueue()
> >> flush_cpu_workqueue(0) .
> >> flush_cpu_workqueue(cpu#1) flush_cpu_workqueue(cpu#0)
> >> waiting work_func1() in cpu#1 waiting work_func0 in cpu#0
> >>
> >> DEADLOCK!
> >
> > I am not sure. Note that when work_func0() calls run_workqueue(),
> > it will clear cwq->current_work, so another flush_ on CPU#1 will
> > not wait for work_func0, no?
>
> cwq->current_work is changed only when
> !list_empty(&cwq->worklist)
> in run_workqueue().
>
> so cwq->current_work may not be changed.

Ah, indeed.

Thanks for correcting me!
---

> And that patch recently triggered a warning due to some games which
> USB was playing. I was told this is because USB is being bad.
>
> But I don't think we've seen a coherent description of what's actually
> _wrong_ with the current code. flush_cpu_workqueue() has been handling
> this case for many years with no problems reported as far as I know.

Might be sheer luck, but afaik we did have some actual deadlocks due to
workqueue flushing -- a particular one I can remember was cpu-hotplug vs
cpufreq.

> So what has caused this sudden flurry of reports? Did something change in
> lockdep? What is this
>
> [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> [ 537.380128]
> [ 537.380128] but task is already holding lock:
> [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>
> supposed to mean? "events" isn't a lock - it's the name of a kernel
> thread, isn't it?

No workqueue lockdep support has been in there for a while now. /me
pokes at git for a bit..

4e6045f134784f4b158b3c0f7a282b04bd816887 -- Oct 2007, ca. 2.6.24-rc1

What it does it gives the workqueue a lock-object and each worklet. It
then validates that you only get:

workqueue
worklet

nestings, eg. calling flush_workqueue() from a worklet will generate

workqueue <-.
worklet |
workqueue -'

recursion, IOW the above splat.

Another thing it does is connect the lockchains of workqueue callers
with those of the worklet. eg.

---
code path 1:
my_function() -> lock(L1); ...; flush_workqueue(); ...

code path 2:
run_workqueue() -> my_work() -> ...; lock(L1); ...

you can get a deadlock when my_work() is queued or running
but my_function() has acquired L1 already.
---

> If this is supposed to be deadlockable then how?
>
> Because I don't immediately see what's wrong with e1000_remove() calling
> flush_work(). It's undesirable, and we can perhaps improve it via some
> means, but where is the bug?

I hope the above answers why flushing a workqueue from within that same
workqueue is a very bad thing.

2009-03-24 12:34:00

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

On Tue, 2009-03-24 at 03:46 -0700, Andrew Morton wrote:

> But I don't think we've seen a coherent description of what's actually
> _wrong_ with the current code. flush_cpu_workqueue() has been handling
> this case for many years with no problems reported as far as I know.
>
> So what has caused this sudden flurry of reports? Did something change in
> lockdep? What is this
>
> [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> [ 537.380128]
> [ 537.380128] but task is already holding lock:
> [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>
> supposed to mean? "events" isn't a lock - it's the name of a kernel
> thread, isn't it? If this is supposed to be deadlockable then how?

events is indeed the schedule_work workqueue thread name -- I just used
that for lack of a better name.

> Because I don't immediately see what's wrong with e1000_remove() calling
> flush_work(). It's undesirable, and we can perhaps improve it via some
> means, but where is the bug?

There is no bug -- it's a false positive in a way. I've pointed this out
in the original thread, see
http://thread.gmane.org/gmane.linux.kernel/550877/focus=550932

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-03-24 13:22:44

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

On Tue, 2009-03-24 at 12:17 +0100, Peter Zijlstra wrote:

> > But I don't think we've seen a coherent description of what's actually
> > _wrong_ with the current code. flush_cpu_workqueue() has been handling
> > this case for many years with no problems reported as far as I know.
>
> Might be sheer luck, but afaik we did have some actual deadlocks due to
> workqueue flushing -- a particular one I can remember was cpu-hotplug vs
> cpufreq.

Two cases are relevant here actually -- the recursion which hasn't ever
shown up before, and a number of possible deadlocks of e.g. some people
doing, effectively:

rtnl_lock();
flush_scheduled_work();
rtnl_unlock();

vs. the linkwatch work that can, at this point in time, still be queued,
and needs the rtnl as well.


A little digging through git logs finds more references, e.g. commits
f90d4118bacef87894621a3e8aba853fa0c89abc and
fd781fa25c9e9c6fd1599df060b05e7c4ad724e5.

Some others were fixed that I remember, but apparently without putting
the lockdep report into the commit log.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-03-24 16:18:36

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

On 03/24, Ingo Molnar wrote:
> >
> > * Kenji Kaneshige <[email protected]>:
> >
> > Kenji Kaneshige reported the below lockdep problem when testing
> > my patch on one of his machines.
> >
> > > I still have the following kernel error messages in testing with your
> > > latest set of patches (Jesse's linux-next). The test case is removing
> > > e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
> > > .../remove".
> > >
> > > [ 537.379995] =============================================
> > > [ 537.380124] [ INFO: possible recursive locking detected ]
> > > [ 537.380128] 2.6.29-rc8-kk #1
> > > [ 537.380128] ---------------------------------------------
> > > [ 537.380128] events/4/56 is trying to acquire lock:
> > > [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> > > [ 537.380128]
> > > [ 537.380128] but task is already holding lock:
> > > [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > [ 537.380128]
> > > [ 537.380128] other info that might help us debug this:
> > > [ 537.380128] 3 locks held by events/4/56:
> > > [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
> > > [ 537.380128]
> > > [ 537.380128] stack backtrace:
> > > [ 537.380128] Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
> > > [ 537.380128] Call Trace:
> > > [ 537.380128] [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
> > > [ 537.380128] [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
> > > [ 537.380128] [<ffffffff8026f148>] lock_acquire+0x58/0x80
> > > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > > [ 537.380128] [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
> > > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > > [ 537.383380] [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
> > > [ 537.383380] [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
> > > [ 537.383380] [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
> > > [ 537.383380] [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
> > > [ 537.383380] [<ffffffff80441da9>] __device_release_driver+0x59/0x90
> > > [ 537.383380] [<ffffffff80441edb>] device_release_driver+0x2b/0x40
> > > [ 537.383380] [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
> > > [ 537.384382] [<ffffffff8043e46b>] device_del+0x12b/0x190
> > > [ 537.384382] [<ffffffff8043e4f6>] device_unregister+0x26/0x70
> > > [ 537.384382] [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
> > > [ 537.384382] [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
> > > [ 537.384382] [<ffffffff803c10d9>] remove_callback+0x29/0x40
> > > [ 537.384382] [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
> > > [ 537.384382] [<ffffffff8025769a>] run_workqueue+0x15a/0x230
> > > [ 537.384382] [<ffffffff80257648>] ? run_workqueue+0x108/0x230
> > > [ 537.384382] [<ffffffff8025846f>] worker_thread+0x9f/0x100
> > > [ 537.384382] [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
> > > [ 537.384382] [<ffffffff802583d0>] ? worker_thread+0x0/0x100
> > > [ 537.384382] [<ffffffff8025b89d>] kthread+0x4d/0x80
> > > [ 537.384382] [<ffffffff8020d4ba>] child_rip+0xa/0x20
> > > [ 537.386380] [<ffffffff8020cebc>] ? restore_args+0x0/0x30
> > > [ 537.386380] [<ffffffff8025b850>] ? kthread+0x0/0x80
> > > [ 537.386380] [<ffffffff8020d4b0>] ? child_rip+0x0/0x20
> > >
> > > I think the cause of this error message is flush_workqueue()
> > > from the work of keventd. When removing device using
> > > "/sys/bus/pci/devices/.../ remove", pci_remove_bus_device() is
> > > executed by the keventd's work through
> > > device_schedule_callback(), and it invokes e1000e's remove
> > > callback. And then, e1000e's remove callback invokes
> > > flush_workqueue(). Actually, the kernel error messages are not
> > > displayed when I changed e1000e driver to not call
> > > flush_workqueue(). In my understanding, flush_workqueue() from
> > > the work must be avoided because it can cause a deadlock.
> > > Please note that this is not a problem of e1000e driver.
> > > Drivers can use flush_workqueue(), of course.
> >
> > I agree with this analysis; the reason we're seeing this lockdep
> > warning is because the sysfs attributed scheduled a removal for
> > itself using device_schedule_callback(). This is necessary
> > because sysfs attributes can't remove themselves due to other
> > locking issues.
> >
> > My question is -- is it a bug to call flush_workqueue during
> > run_workqueue?
>
> Yes, it generally is.
>
> > Conceptually, I don't think it should be a bug; it should be a
> > nop, since run_workqueue _is_ flushing the work queue.

As it was already said, we can deadlock.

Can't e1000_remove() avoid flush_scheduled_work() ? (and it should
be always avoided when possible).

Of course, I don't understand this code. But afaics e1000_remove()
can just cancel its own works (in struct e1000_adapter), no?

cancel_work_sync(work) from run_workqueue() should be OK even if
this work is queued on the same wq. If it is queued on the same CPU
cancel_work_sync() won't block because we are ->current_work.


Btw. Again, I don't understand the code, but this looks suspicious:

e1000_remove:

set_bit(__E1000_DOWN, &adapter->state);
del_timer_sync(&adapter->watchdog_timer);
flush_scheduled_work();

What if e1000_watchdog_task() is running, has already checked
!test_bit(__E1000_DOWN, &adapter->state), but didn't call
mod_timer(&adapter->phy_info_timer) yet?

Oleg.

2009-03-24 17:24:33

by Alex Chiang

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

* Johannes Berg <[email protected]>:
> On Tue, 2009-03-24 at 03:46 -0700, Andrew Morton wrote:
>
> > But I don't think we've seen a coherent description of what's actually
> > _wrong_ with the current code. flush_cpu_workqueue() has been handling
> > this case for many years with no problems reported as far as I know.
> >
> > So what has caused this sudden flurry of reports? Did something change in
> > lockdep? What is this
> >
> > [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> > [ 537.380128]
> > [ 537.380128] but task is already holding lock:
> > [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> >
> > supposed to mean? "events" isn't a lock - it's the name of a kernel
> > thread, isn't it? If this is supposed to be deadlockable then how?
>
> events is indeed the schedule_work workqueue thread name -- I just used
> that for lack of a better name.
>
> > Because I don't immediately see what's wrong with e1000_remove() calling
> > flush_work(). It's undesirable, and we can perhaps improve it via some
> > means, but where is the bug?
>
> There is no bug -- it's a false positive in a way. I've pointed this out
> in the original thread, see
> http://thread.gmane.org/gmane.linux.kernel/550877/focus=550932

I'm actually a bit confused now.

Peter explained why flushing a workqueue from the same queue is
bad, and in general I agree, but what do you mean by "false
positive"?

By the way, this scenario:

code path 1:
my_function() -> lock(L1); ...; flush_workqueue(); ...

code path 2:
run_workqueue() -> my_work() -> ...; lock(L1); ...

is _not_ what is happening here.

sysfs_schedule_callback() is an ugly piece of code that exists
because a sysfs attribute cannot remove itself without
deadlocking. So the callback mechanism was created to allow a
different kernel thread to remove the sysfs attribute and avoid
deadlock.

So what you really have going on is:

sysfs callback -> add remove callback to global workqueue
remove callback fires off (pci_remove_bus_device) and we do...
device_unregister
driver's ->remove method called
driver's ->remove method calls flush_scheduled_work

Yes, after read the thread I agree that generically calling
flush_workqueue in the middle of run_workqueue is bad, but the
lockdep warning that Kenji showed us really won't deadlock.

This is because pci_remove_bus_device() will not acquire any lock
L1 that an individual device driver will attempt to acquire in
the remove path. If that were the case, we would deadlock every
time you rmmod'ed a device driver's module or every time you shut
your machine down.

I think from my end, there are 2 things I need to do:

a) make sysfs_schedule_callback() use its own work queue
instead of global work queue, because too many drivers
call flush_scheduled_work in their remove path

b) give sysfs attributes the ability to commit suicide

(a) is short term work, 2.6.30 timeframe, since it doesn't
involve any large conceptual changes.

(b) is picking up Tejun Heo's existing work, but that was a bit
controversial last time, and I'm not sure it will make it during
this merge window.

Question for the lockdep folks though -- given what I described,
do you agree that the warning we saw was a false positive? Or am
I off in left field?

Thanks.

/ac

2009-03-24 17:33:26

by Alex Chiang

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

(Adding some e1000e folks to answer questions about their driver)

* Oleg Nesterov <[email protected]>:
> On 03/24, Ingo Molnar wrote:
> > >
> > > * Kenji Kaneshige <[email protected]>:
> > >
> > > Kenji Kaneshige reported the below lockdep problem when testing
> > > my patch on one of his machines.
> > >
> > > > I still have the following kernel error messages in testing with your
> > > > latest set of patches (Jesse's linux-next). The test case is removing
> > > > e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
> > > > .../remove".
> > > >
> > > > [ 537.379995] =============================================
> > > > [ 537.380124] [ INFO: possible recursive locking detected ]
> > > > [ 537.380128] 2.6.29-rc8-kk #1
> > > > [ 537.380128] ---------------------------------------------
> > > > [ 537.380128] events/4/56 is trying to acquire lock:
> > > > [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> > > > [ 537.380128]
> > > > [ 537.380128] but task is already holding lock:
> > > > [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > > [ 537.380128]
> > > > [ 537.380128] other info that might help us debug this:
> > > > [ 537.380128] 3 locks held by events/4/56:
> > > > [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > > [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> > > > [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
> > > > [ 537.380128]
> > > > [ 537.380128] stack backtrace:
> > > > [ 537.380128] Pid: 56, comm: events/4 Not tainted 2.6.29-rc8-kk #1
> > > > [ 537.380128] Call Trace:
> > > > [ 537.380128] [<ffffffff8026dfcd>] validate_chain+0xb7d/0x1260
> > > > [ 537.380128] [<ffffffff8026eade>] __lock_acquire+0x42e/0xa40
> > > > [ 537.380128] [<ffffffff8026f148>] lock_acquire+0x58/0x80
> > > > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > > > [ 537.380128] [<ffffffff8025800d>] flush_workqueue+0x4d/0xa0
> > > > [ 537.380128] [<ffffffff80257fc0>] ? flush_workqueue+0x0/0xa0
> > > > [ 537.383380] [<ffffffff80258070>] flush_scheduled_work+0x10/0x20
> > > > [ 537.383380] [<ffffffffa0144065>] e1000_remove+0x55/0xfe [e1000e]
> > > > [ 537.383380] [<ffffffff8033ee30>] ? sysfs_schedule_callback_work+0x0/0x50
> > > > [ 537.383380] [<ffffffff803bfeb2>] pci_device_remove+0x32/0x70
> > > > [ 537.383380] [<ffffffff80441da9>] __device_release_driver+0x59/0x90
> > > > [ 537.383380] [<ffffffff80441edb>] device_release_driver+0x2b/0x40
> > > > [ 537.383380] [<ffffffff804419d6>] bus_remove_device+0xa6/0x120
> > > > [ 537.384382] [<ffffffff8043e46b>] device_del+0x12b/0x190
> > > > [ 537.384382] [<ffffffff8043e4f6>] device_unregister+0x26/0x70
> > > > [ 537.384382] [<ffffffff803ba969>] pci_stop_dev+0x49/0x60
> > > > [ 537.384382] [<ffffffff803baab0>] pci_remove_bus_device+0x40/0xc0
> > > > [ 537.384382] [<ffffffff803c10d9>] remove_callback+0x29/0x40
> > > > [ 537.384382] [<ffffffff8033ee4f>] sysfs_schedule_callback_work+0x1f/0x50
> > > > [ 537.384382] [<ffffffff8025769a>] run_workqueue+0x15a/0x230
> > > > [ 537.384382] [<ffffffff80257648>] ? run_workqueue+0x108/0x230
> > > > [ 537.384382] [<ffffffff8025846f>] worker_thread+0x9f/0x100
> > > > [ 537.384382] [<ffffffff8025bce0>] ? autoremove_wake_function+0x0/0x40
> > > > [ 537.384382] [<ffffffff802583d0>] ? worker_thread+0x0/0x100
> > > > [ 537.384382] [<ffffffff8025b89d>] kthread+0x4d/0x80
> > > > [ 537.384382] [<ffffffff8020d4ba>] child_rip+0xa/0x20
> > > > [ 537.386380] [<ffffffff8020cebc>] ? restore_args+0x0/0x30
> > > > [ 537.386380] [<ffffffff8025b850>] ? kthread+0x0/0x80
> > > > [ 537.386380] [<ffffffff8020d4b0>] ? child_rip+0x0/0x20
> > > >
> > > > I think the cause of this error message is flush_workqueue()
> > > > from the work of keventd. When removing device using
> > > > "/sys/bus/pci/devices/.../ remove", pci_remove_bus_device() is
> > > > executed by the keventd's work through
> > > > device_schedule_callback(), and it invokes e1000e's remove
> > > > callback. And then, e1000e's remove callback invokes
> > > > flush_workqueue(). Actually, the kernel error messages are not
> > > > displayed when I changed e1000e driver to not call
> > > > flush_workqueue(). In my understanding, flush_workqueue() from
> > > > the work must be avoided because it can cause a deadlock.
> > > > Please note that this is not a problem of e1000e driver.
> > > > Drivers can use flush_workqueue(), of course.
> > >
> > > I agree with this analysis; the reason we're seeing this lockdep
> > > warning is because the sysfs attributed scheduled a removal for
> > > itself using device_schedule_callback(). This is necessary
> > > because sysfs attributes can't remove themselves due to other
> > > locking issues.
> > >
> > > My question is -- is it a bug to call flush_workqueue during
> > > run_workqueue?
> >
> > Yes, it generally is.
> >
> > > Conceptually, I don't think it should be a bug; it should be a
> > > nop, since run_workqueue _is_ flushing the work queue.
>
> As it was already said, we can deadlock.

As answered in another email, I agree that generically, it can
deadlock. I think this particular warning is a false positive
though, because pci_remove_bus_device() definitely cannot acquire
any locks that lower level device drivers want to acquire.

If that were true, then hotplug would never work, and you would
also see hangs every time you shut your machine off.

I can't answer the e1000e questions below but...

> Can't e1000_remove() avoid flush_scheduled_work() ? (and it should
> be always avoided when possible).
>
> Of course, I don't understand this code. But afaics e1000_remove()
> can just cancel its own works (in struct e1000_adapter), no?
>
> cancel_work_sync(work) from run_workqueue() should be OK even if
> this work is queued on the same wq. If it is queued on the same CPU
> cancel_work_sync() won't block because we are ->current_work.
>
>
> Btw. Again, I don't understand the code, but this looks suspicious:
>
> e1000_remove:
>
> set_bit(__E1000_DOWN, &adapter->state);
> del_timer_sync(&adapter->watchdog_timer);
> flush_scheduled_work();
>
> What if e1000_watchdog_task() is running, has already checked
> !test_bit(__E1000_DOWN, &adapter->state), but didn't call
> mod_timer(&adapter->phy_info_timer) yet?

Even if e1000e is doing something funny, the conclusion I'm
coming to is that the sysfs_schedule_callback implementation
needs to change, because there are simply too many drivers that
call flush_scheduled_work.

Thanks.

/ac

2009-03-24 19:29:20

by Alex Chiang

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

* Kenji Kaneshige <[email protected]>:
>
> I still have the following kernel error messages in testing with your
> latest set of patches (Jesse's linux-next). The test case is removing
> e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
> .../remove".
>
> [ 537.379995] =============================================
> [ 537.380124] [ INFO: possible recursive locking detected ]
> [ 537.380128] 2.6.29-rc8-kk #1
> [ 537.380128] ---------------------------------------------
> [ 537.380128] events/4/56 is trying to acquire lock:
> [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> [ 537.380128]
> [ 537.380128] but task is already holding lock:
> [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> [ 537.380128]
> [ 537.380128] other info that might help us debug this:
> [ 537.380128] 3 locks held by events/4/56:
> [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40

I still cannot reproduce this lockdep issue, even using your
.config with an e1000e device on an x86_64 kernel. :(

I tried removing the endpoint, an intermediate bridge device, and
the parent bus. I don't know what I'm doing wrong...

Can you please try this patch though, and see if it fixes the
warning? It applies on top of my other sysfs patch that
introduces a mutex in sysfs_schedule_callback.

Thanks.

/ac

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 289c43a..7cc4dc0 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -667,6 +667,7 @@ struct sysfs_schedule_callback_struct {
struct work_struct work;
};

+static struct workqueue_struct *sysfs_workqueue;
static DEFINE_MUTEX(sysfs_workq_mutex);
static LIST_HEAD(sysfs_workq);
static void sysfs_schedule_callback_work(struct work_struct *work)
@@ -720,6 +721,12 @@ int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
}
mutex_unlock(&sysfs_workq_mutex);

+ if (sysfs_workqueue == NULL) {
+ sysfs_workqueue = create_workqueue("sysfsqueue");
+ if (sysfs_workqueue == NULL)
+ return -ENOMEM;
+ }
+
ss = kmalloc(sizeof(*ss), GFP_KERNEL);
if (!ss) {
module_put(owner);
@@ -735,7 +742,7 @@ int sysfs_schedule_callback(struct kobject *kobj, void (*func)(void *),
mutex_lock(&sysfs_workq_mutex);
list_add_tail(&ss->workq_list, &sysfs_workq);
mutex_unlock(&sysfs_workq_mutex);
- schedule_work(&ss->work);
+ queue_work(sysfs_workqueue, &ss->work);
return 0;
}
EXPORT_SYMBOL_GPL(sysfs_schedule_callback);

2009-03-24 20:23:18

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

On Tue, 2009-03-24 at 11:23 -0600, Alex Chiang wrote:

> > There is no bug -- it's a false positive in a way. I've pointed this out
> > in the original thread, see
> > http://thread.gmane.org/gmane.linux.kernel/550877/focus=550932
>
> I'm actually a bit confused now.

Sorry.

> Peter explained why flushing a workqueue from the same queue is
> bad, and in general I agree, but what do you mean by "false
> positive"?

Well, even though generally flushing it from within is bad, the actual
thing lockdep reports is bogus -- it's reporting a nested locking.

> By the way, this scenario:
>
> code path 1:
> my_function() -> lock(L1); ...; flush_workqueue(); ...
>
> code path 2:
> run_workqueue() -> my_work() -> ...; lock(L1); ...
>
> is _not_ what is happening here.

Indeed.

> So what you really have going on is:
>
> sysfs callback -> add remove callback to global workqueue
> remove callback fires off (pci_remove_bus_device) and we do...
> device_unregister
> driver's ->remove method called
> driver's ->remove method calls flush_scheduled_work
>
> Yes, after read the thread I agree that generically calling
> flush_workqueue in the middle of run_workqueue is bad, but the
> lockdep warning that Kenji showed us really won't deadlock.

Exactly that is what I meant by "false positive".

> This is because pci_remove_bus_device() will not acquire any lock
> L1 that an individual device driver will attempt to acquire in
> the remove path. If that were the case, we would deadlock every
> time you rmmod'ed a device driver's module or every time you shut
> your machine down.
>
> I think from my end, there are 2 things I need to do:
>
> a) make sysfs_schedule_callback() use its own work queue
> instead of global work queue, because too many drivers
> call flush_scheduled_work in their remove path
>
> b) give sysfs attributes the ability to commit suicide
>
> (a) is short term work, 2.6.30 timeframe, since it doesn't
> involve any large conceptual changes.
>
> (b) is picking up Tejun Heo's existing work, but that was a bit
> controversial last time, and I'm not sure it will make it during
> this merge window.
>
> Question for the lockdep folks though -- given what I described,
> do you agree that the warning we saw was a false positive? Or am
> I off in left field?

I think we're not sure yet -- it seems Lai Jiangshan described a
scenario in which flushing from within the work actually _can_ deadlock.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-03-25 05:06:49

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

Alex Chiang wrote:
> * Kenji Kaneshige <[email protected]>:
>> I still have the following kernel error messages in testing with your
>> latest set of patches (Jesse's linux-next). The test case is removing
>> e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
>> .../remove".
>>
>> [ 537.379995] =============================================
>> [ 537.380124] [ INFO: possible recursive locking detected ]
>> [ 537.380128] 2.6.29-rc8-kk #1
>> [ 537.380128] ---------------------------------------------
>> [ 537.380128] events/4/56 is trying to acquire lock:
>> [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
>> [ 537.380128]
>> [ 537.380128] but task is already holding lock:
>> [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>> [ 537.380128]
>> [ 537.380128] other info that might help us debug this:
>> [ 537.380128] 3 locks held by events/4/56:
>> [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>> [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>> [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
>
> I still cannot reproduce this lockdep issue, even using your
> .config with an e1000e device on an x86_64 kernel. :(
>
> I tried removing the endpoint, an intermediate bridge device, and
> the parent bus. I don't know what I'm doing wrong...
>

I don't know either...
The reproducibility is 100% on my environment. The steps are
just boot the system and remove the device.

> Can you please try this patch though, and see if it fixes the
> warning? It applies on top of my other sysfs patch that
> introduces a mutex in sysfs_schedule_callback.

Anyway, I confirmed the kernel error messages were gone with
the patch against sysfs. Note that I used the following patch
I made for testing instead since your patch could not be
applied to Jesse's linux-next.

Thanks,
Kenji Kaneshige



fs/sysfs/file.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

Index: linux-next-20090323/fs/sysfs/file.c
===================================================================
--- linux-next-20090323.orig/fs/sysfs/file.c 2009-03-25 12:09:37.000000000 +0900
+++ linux-next-20090323/fs/sysfs/file.c 2009-03-25 13:40:10.000000000 +0900
@@ -677,6 +677,7 @@
kfree(ss);
}

+static struct workqueue_struct *sysfsd_wq;
/**
* sysfs_schedule_callback - helper to schedule a callback for a kobject
* @kobj: object we're acting for.
@@ -704,6 +705,17 @@

if (!try_module_get(owner))
return -ENODEV;
+
+ if (!sysfsd_wq) {
+ sysfsd_wq = create_workqueue("sysfsd");
+ if (!sysfsd_wq) {
+ printk(KERN_ERR
+ "%s: Could not create workqueue\n", __func__);
+ WARN_ON(1);
+ return -ENOMEM;
+ }
+ }
+
ss = kmalloc(sizeof(*ss), GFP_KERNEL);
if (!ss) {
module_put(owner);
@@ -715,7 +727,7 @@
ss->data = data;
ss->owner = owner;
INIT_WORK(&ss->work, sysfs_schedule_callback_work);
- schedule_work(&ss->work);
+ queue_work(sysfsd_wq, &ss->work);
return 0;
}
EXPORT_SYMBOL_GPL(sysfs_schedule_callback);

2009-03-25 05:22:30

by Alex Chiang

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

* Kenji Kaneshige <[email protected]>:
> Alex Chiang wrote:
> > * Kenji Kaneshige <[email protected]>:
> >> I still have the following kernel error messages in testing with your
> >> latest set of patches (Jesse's linux-next). The test case is removing
> >> e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
> >> .../remove".
> >>
> >> [ 537.379995] =============================================
> >> [ 537.380124] [ INFO: possible recursive locking detected ]
> >> [ 537.380128] 2.6.29-rc8-kk #1
> >> [ 537.380128] ---------------------------------------------
> >> [ 537.380128] events/4/56 is trying to acquire lock:
> >> [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
> >> [ 537.380128]
> >> [ 537.380128] but task is already holding lock:
> >> [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> >> [ 537.380128]
> >> [ 537.380128] other info that might help us debug this:
> >> [ 537.380128] 3 locks held by events/4/56:
> >> [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> >> [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
> >> [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
> >
> > I still cannot reproduce this lockdep issue, even using your
> > .config with an e1000e device on an x86_64 kernel. :(
> >
> > I tried removing the endpoint, an intermediate bridge device, and
> > the parent bus. I don't know what I'm doing wrong...
> >
>
> I don't know either...
> The reproducibility is 100% on my environment. The steps are
> just boot the system and remove the device.
>
> > Can you please try this patch though, and see if it fixes the
> > warning? It applies on top of my other sysfs patch that
> > introduces a mutex in sysfs_schedule_callback.
>
> Anyway, I confirmed the kernel error messages were gone with
> the patch against sysfs. Note that I used the following patch
> I made for testing instead since your patch could not be
> applied to Jesse's linux-next.

Great, thank you for testing Kenji-san.

/ac

2009-03-25 05:39:24

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH v5 09/13] PCI: Introduce /sys/bus/pci/devices/.../remove

Alex Chiang wrote:
> * Kenji Kaneshige <[email protected]>:
>> Alex Chiang wrote:
>>> * Kenji Kaneshige <[email protected]>:
>>>> I still have the following kernel error messages in testing with your
>>>> latest set of patches (Jesse's linux-next). The test case is removing
>>>> e1000e device or its parent bridge by "echo 1 > /sys/bus/pci/devices/
>>>> .../remove".
>>>>
>>>> [ 537.379995] =============================================
>>>> [ 537.380124] [ INFO: possible recursive locking detected ]
>>>> [ 537.380128] 2.6.29-rc8-kk #1
>>>> [ 537.380128] ---------------------------------------------
>>>> [ 537.380128] events/4/56 is trying to acquire lock:
>>>> [ 537.380128] (events){--..}, at: [<ffffffff80257fc0>] flush_workqueue+0x0/0xa0
>>>> [ 537.380128]
>>>> [ 537.380128] but task is already holding lock:
>>>> [ 537.380128] (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>>>> [ 537.380128]
>>>> [ 537.380128] other info that might help us debug this:
>>>> [ 537.380128] 3 locks held by events/4/56:
>>>> [ 537.380128] #0: (events){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>>>> [ 537.380128] #1: (&ss->work){--..}, at: [<ffffffff80257648>] run_workqueue+0x108/0x230
>>>> [ 537.380128] #2: (pci_remove_rescan_mutex){--..}, at: [<ffffffff803c10d1>] remove_callback+0x21/0x40
>>> I still cannot reproduce this lockdep issue, even using your
>>> .config with an e1000e device on an x86_64 kernel. :(
>>>
>>> I tried removing the endpoint, an intermediate bridge device, and
>>> the parent bus. I don't know what I'm doing wrong...
>>>
>> I don't know either...
>> The reproducibility is 100% on my environment. The steps are
>> just boot the system and remove the device.
>>
>>> Can you please try this patch though, and see if it fixes the
>>> warning? It applies on top of my other sysfs patch that
>>> introduces a mutex in sysfs_schedule_callback.
>> Anyway, I confirmed the kernel error messages were gone with
>> the patch against sysfs. Note that I used the following patch
>> I made for testing instead since your patch could not be
>> applied to Jesse's linux-next.
>
> Great, thank you for testing Kenji-san.
>

You're welcome.

Just in case, my patch is just for testing, and it is very buggy
(no destroy operation, lack of module_put() in error code path,
and so on). Please consider it as just for testing.

Thanks,
Kenji Kaneshige