2018-08-17 04:52:03

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH 0/6] pci: Rework is_added race fix and address bridge enable races

This is really two series but since they conflict a bit separately
here they are in one:

First we undo the mess of those atomic priv_flags. The atomicity
doesn't provide any security since there's no locking against
the other state pertaining to those flags, it only protects the
flags themselves.

The is_added mess is fixed much more simply by moving the assignment
of the flag to before we start the driver. This is in line with the
rest of the PCI code: until bound to the device model, we are essentially
assuming a single threaded environment. is_added is a flag that is
logically owned by that part of the PCI code, and thus should be set
and cleared within that "safer" environment.

This removes the horrid relative includes in the powerpc code as well.

The second part aims at fixing the enable/disable/set_master races,
and does so by providing a framework for future device state locking
issues.

It introduces a pci_dev->state_mutex which is used at a lower level
than the device_lock (the device lock isn't suitable, as explained
in the cset comments) and uses it to protect enablement and set_master.




2018-08-17 04:52:29

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.

The new pci state mutex provides a simpler way of addressing
this race and avoids the relative includes added to the powerpc
code.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
arch/powerpc/kernel/pci-common.c | 4 +---
arch/powerpc/platforms/powernv/pci-ioda.c | 3 +--
arch/powerpc/platforms/pseries/setup.c | 3 +--
drivers/pci/bus.c | 6 +++---
drivers/pci/hotplug/acpiphp_glue.c | 2 +-
drivers/pci/pci.h | 11 -----------
drivers/pci/probe.c | 4 ++--
drivers/pci/remove.c | 5 ++---
include/linux/pci.h | 1 +
9 files changed, 12 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 471aac313b89..fe9733ffffaa 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -42,8 +42,6 @@
#include <asm/ppc-pci.h>
#include <asm/eeh.h>

-#include "../../../drivers/pci/pci.h"
-
/* hose_spinlock protects accesses to the the phb_bitmap. */
static DEFINE_SPINLOCK(hose_spinlock);
LIST_HEAD(hose_list);
@@ -1016,7 +1014,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
/* Cardbus can call us to add new devices to a bus, so ignore
* those who are already fully discovered
*/
- if (pci_dev_is_added(dev))
+ if (dev->is_added)
continue;

pcibios_setup_device(dev);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 70b2e1e0f23c..5bd0eb6681bc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -46,7 +46,6 @@

#include "powernv.h"
#include "pci.h"
-#include "../../../../drivers/pci/pci.h"

#define PNV_IODA1_M64_NUM 16 /* Number of M64 BARs */
#define PNV_IODA1_M64_SEGS 8 /* Segments per M64 BAR */
@@ -3139,7 +3138,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
struct pci_dn *pdn;
int mul, total_vfs;

- if (!pdev->is_physfn || pci_dev_is_added(pdev))
+ if (!pdev->is_physfn || pdev->is_added)
return;

pdn = pci_get_pdn(pdev);
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index 8a4868a3964b..139f0af6c3d9 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -71,7 +71,6 @@
#include <asm/security_features.h>

#include "pseries.h"
-#include "../../../../drivers/pci/pci.h"

int CMO_PrPSP = -1;
int CMO_SecPSP = -1;
@@ -665,7 +664,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
const int *indexes;
struct device_node *dn = pci_device_to_OF_node(pdev);

- if (!pdev->is_physfn || pci_dev_is_added(pdev))
+ if (!pdev->is_physfn || pdev->is_added)
return;
/*Firmware must support open sriov otherwise dont configure*/
indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 5cb40b2518f9..35b7fc87eac5 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
return;
}

- pci_dev_assign_added(dev, true);
+ dev->is_added = 1;
}
EXPORT_SYMBOL_GPL(pci_bus_add_device);

@@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)

list_for_each_entry(dev, &bus->devices, bus_list) {
/* Skip already-added devices */
- if (pci_dev_is_added(dev))
+ if (dev->is_added)
continue;
pci_bus_add_device(dev);
}

list_for_each_entry(dev, &bus->devices, bus_list) {
/* Skip if device attach failed */
- if (!pci_dev_is_added(dev))
+ if (!dev->is_added)
continue;
child = dev->subordinate;
if (child)
diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
index ef0b1b6ba86f..3a17b290df5d 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)

list_for_each_entry(dev, &bus->devices, bus_list) {
/* Assume that newly added devices are powered on already. */
- if (!pci_dev_is_added(dev))
+ if (!dev->is_added)
dev->current_state = PCI_D0;
}

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6e0d1528d471..473aa10a5dbf 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -295,7 +295,6 @@ struct pci_sriov {

/* pci_dev priv_flags */
#define PCI_DEV_DISCONNECTED 0
-#define PCI_DEV_ADDED 1

static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
{
@@ -308,16 +307,6 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
}

-static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
-{
- assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
-}
-
-static inline bool pci_dev_is_added(const struct pci_dev *dev)
-{
- return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
-}
-
#ifdef CONFIG_PCIEAER
#include <linux/aer.h>

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index ec784009a36b..440445ac7dfa 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2525,13 +2525,13 @@ int pci_scan_slot(struct pci_bus *bus, int devfn)
dev = pci_scan_single_device(bus, devfn);
if (!dev)
return 0;
- if (!pci_dev_is_added(dev))
+ if (!dev->is_added)
nr++;

for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
dev = pci_scan_single_device(bus, devfn + fn);
if (dev) {
- if (!pci_dev_is_added(dev))
+ if (!dev->is_added)
nr++;
dev->multifunction = 1;
}
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 461e7fd2756f..01ec7fcb5634 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -18,12 +18,11 @@ static void pci_stop_dev(struct pci_dev *dev)
{
pci_pme_active(dev, false);

- if (pci_dev_is_added(dev)) {
+ if (dev->is_added) {
device_release_driver(&dev->dev);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
-
- pci_dev_assign_added(dev, false);
+ dev->is_added = 0;
}

if (dev->bus->self)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9b87f1936906..9799109c5e25 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -373,6 +373,7 @@ struct pci_dev {
unsigned int transparent:1; /* Subtractive decode bridge */
unsigned int multifunction:1; /* Multi-function device */

+ unsigned int is_added:1;
unsigned int is_busmaster:1; /* Is busmaster */
unsigned int no_msi:1; /* May not use MSI */
unsigned int no_64bit_msi:1; /* May only use 32-bit MSIs */
--
2.17.1


2018-08-17 04:59:09

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Fri, 2018-08-17 at 14:48 +1000, Benjamin Herrenschmidt wrote:
> This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
>
> The new pci state mutex provides a simpler way of addressing
> this race and avoids the relative includes added to the powerpc
> code.

Ignore the cset comment, my "fix" no longer relies on a state mutex,
I'll re-post with a better comment after discussions.

The actual fix is in patch 2:

[RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

(and yes that was supposed be device_attach ... ugh, not enough
caffeine today).

> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
> ---
> arch/powerpc/kernel/pci-common.c | 4 +---
> arch/powerpc/platforms/powernv/pci-ioda.c | 3 +--
> arch/powerpc/platforms/pseries/setup.c | 3 +--
> drivers/pci/bus.c | 6 +++---
> drivers/pci/hotplug/acpiphp_glue.c | 2 +-
> drivers/pci/pci.h | 11 -----------
> drivers/pci/probe.c | 4 ++--
> drivers/pci/remove.c | 5 ++---
> include/linux/pci.h | 1 +
> 9 files changed, 12 insertions(+), 27 deletions(-)
>
> diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
> index 471aac313b89..fe9733ffffaa 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -42,8 +42,6 @@
> #include <asm/ppc-pci.h>
> #include <asm/eeh.h>
>
> -#include "../../../drivers/pci/pci.h"
> -
> /* hose_spinlock protects accesses to the the phb_bitmap. */
> static DEFINE_SPINLOCK(hose_spinlock);
> LIST_HEAD(hose_list);
> @@ -1016,7 +1014,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
> /* Cardbus can call us to add new devices to a bus, so ignore
> * those who are already fully discovered
> */
> - if (pci_dev_is_added(dev))
> + if (dev->is_added)
> continue;
>
> pcibios_setup_device(dev);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 70b2e1e0f23c..5bd0eb6681bc 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -46,7 +46,6 @@
>
> #include "powernv.h"
> #include "pci.h"
> -#include "../../../../drivers/pci/pci.h"
>
> #define PNV_IODA1_M64_NUM 16 /* Number of M64 BARs */
> #define PNV_IODA1_M64_SEGS 8 /* Segments per M64 BAR */
> @@ -3139,7 +3138,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
> struct pci_dn *pdn;
> int mul, total_vfs;
>
> - if (!pdev->is_physfn || pci_dev_is_added(pdev))
> + if (!pdev->is_physfn || pdev->is_added)
> return;
>
> pdn = pci_get_pdn(pdev);
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index 8a4868a3964b..139f0af6c3d9 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -71,7 +71,6 @@
> #include <asm/security_features.h>
>
> #include "pseries.h"
> -#include "../../../../drivers/pci/pci.h"
>
> int CMO_PrPSP = -1;
> int CMO_SecPSP = -1;
> @@ -665,7 +664,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev *pdev)
> const int *indexes;
> struct device_node *dn = pci_device_to_OF_node(pdev);
>
> - if (!pdev->is_physfn || pci_dev_is_added(pdev))
> + if (!pdev->is_physfn || pdev->is_added)
> return;
> /*Firmware must support open sriov otherwise dont configure*/
> indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 5cb40b2518f9..35b7fc87eac5 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
> return;
> }
>
> - pci_dev_assign_added(dev, true);
> + dev->is_added = 1;
> }
> EXPORT_SYMBOL_GPL(pci_bus_add_device);
>
> @@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
>
> list_for_each_entry(dev, &bus->devices, bus_list) {
> /* Skip already-added devices */
> - if (pci_dev_is_added(dev))
> + if (dev->is_added)
> continue;
> pci_bus_add_device(dev);
> }
>
> list_for_each_entry(dev, &bus->devices, bus_list) {
> /* Skip if device attach failed */
> - if (!pci_dev_is_added(dev))
> + if (!dev->is_added)
> continue;
> child = dev->subordinate;
> if (child)
> diff --git a/drivers/pci/hotplug/acpiphp_glue.c b/drivers/pci/hotplug/acpiphp_glue.c
> index ef0b1b6ba86f..3a17b290df5d 100644
> --- a/drivers/pci/hotplug/acpiphp_glue.c
> +++ b/drivers/pci/hotplug/acpiphp_glue.c
> @@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)
>
> list_for_each_entry(dev, &bus->devices, bus_list) {
> /* Assume that newly added devices are powered on already. */
> - if (!pci_dev_is_added(dev))
> + if (!dev->is_added)
> dev->current_state = PCI_D0;
> }
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 6e0d1528d471..473aa10a5dbf 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -295,7 +295,6 @@ struct pci_sriov {
>
> /* pci_dev priv_flags */
> #define PCI_DEV_DISCONNECTED 0
> -#define PCI_DEV_ADDED 1
>
> static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
> {
> @@ -308,16 +307,6 @@ static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
> return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
> }
>
> -static inline void pci_dev_assign_added(struct pci_dev *dev, bool added)
> -{
> - assign_bit(PCI_DEV_ADDED, &dev->priv_flags, added);
> -}
> -
> -static inline bool pci_dev_is_added(const struct pci_dev *dev)
> -{
> - return test_bit(PCI_DEV_ADDED, &dev->priv_flags);
> -}
> -
> #ifdef CONFIG_PCIEAER
> #include <linux/aer.h>
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index ec784009a36b..440445ac7dfa 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2525,13 +2525,13 @@ int pci_scan_slot(struct pci_bus *bus, int devfn)
> dev = pci_scan_single_device(bus, devfn);
> if (!dev)
> return 0;
> - if (!pci_dev_is_added(dev))
> + if (!dev->is_added)
> nr++;
>
> for (fn = next_fn(bus, dev, 0); fn > 0; fn = next_fn(bus, dev, fn)) {
> dev = pci_scan_single_device(bus, devfn + fn);
> if (dev) {
> - if (!pci_dev_is_added(dev))
> + if (!dev->is_added)
> nr++;
> dev->multifunction = 1;
> }
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index 461e7fd2756f..01ec7fcb5634 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -18,12 +18,11 @@ static void pci_stop_dev(struct pci_dev *dev)
> {
> pci_pme_active(dev, false);
>
> - if (pci_dev_is_added(dev)) {
> + if (dev->is_added) {
> device_release_driver(&dev->dev);
> pci_proc_detach_device(dev);
> pci_remove_sysfs_dev_files(dev);
> -
> - pci_dev_assign_added(dev, false);
> + dev->is_added = 0;
> }
>
> if (dev->bus->self)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 9b87f1936906..9799109c5e25 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -373,6 +373,7 @@ struct pci_dev {
> unsigned int transparent:1; /* Subtractive decode bridge */
> unsigned int multifunction:1; /* Multi-function device */
>
> + unsigned int is_added:1;
> unsigned int is_busmaster:1; /* Is busmaster */
> unsigned int no_msi:1; /* May not use MSI */
> unsigned int no_64bit_msi:1; /* May only use 32-bit MSIs */


2018-08-17 04:59:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

s/device_add/device_attach .. ugh.

On Fri, 2018-08-17 at 14:48 +1000, Benjamin Herrenschmidt wrote:
> This re-fixes the bug reported by Hari Vyas <[email protected]>
> after my revert of his commit but in a much simpler way.
>
> The main issues is that is_added was being set after the driver
> got bound and started, and thus setting it could race with other
> changes to struct pci_dev.
>
> This fixes it by setting the flag first, which also has the
> advantage of matching the fact that we are clearing it *after*
> unbinding in the remove path, thus the flag is now symtetric
> and always set while the driver code is running.
>
> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
> ---
> drivers/pci/bus.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 35b7fc87eac5..48ae63673aa8 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -321,16 +321,16 @@ void pci_bus_add_device(struct pci_dev *dev)
> pci_proc_attach_device(dev);
> pci_bridge_d3_update(dev);
>
> + dev->is_added = 1;
> dev->match_driver = true;
> retval = device_attach(&dev->dev);
> if (retval < 0 && retval != -EPROBE_DEFER) {
> + dev->is_added = 0;
> pci_warn(dev, "device attach failed (%d)\n", retval);
> pci_proc_detach_device(dev);
> pci_remove_sysfs_dev_files(dev);
> return;
> }
> -
> - dev->is_added = 1;
> }
> EXPORT_SYMBOL_GPL(pci_bus_add_device);
>


2018-08-17 05:05:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] pci: Rework is_added race fix and address bridge enable races

On Fri, 2018-08-17 at 14:48 +1000, Benjamin Herrenschmidt wrote:
> The second part aims at fixing the enable/disable/set_master races,
> and does so by providing a framework for future device state locking
> issues.
>
> It introduces a pci_dev->state_mutex which is used at a lower level
> than the device_lock (the device lock isn't suitable, as explained
> in the cset comments) and uses it to protect enablement and set_master.

As discussed in the series, I'm not using the device_lock because I
don't like it creeping out of drivers/base too much unless you
explicitly try to lock against concurrent add/remove.

That being said, if we decided we prefer using it to solve the
enable/disable race, then we have to be careful of a few things:

- Driver callbacks hold it, so we can't take it from within
pci_enable_device(), pci_set_master() etc... themselves. We'll have to
assume the caller has it

- The above means auditing callers of these various APIs that might be
calling them from outside of a driver callback and add the necessary
lock

- We need to take great care about the possibility of the parent
device(s) lock being held. It can happen for USB for example. Now we
don't have USB->PCI adapters that I know of that uses the PCI stack in
Linux but generally assuming your parent lock isn't held is risky. So I
would be a bit weary of walking up the bridge chain and taking it
unconditionally.

Alternatively, we could hijack an existing global lock, for example
mvoe the bridge_mutex outside of ACPI and use it for the upwards bridge
walk, and ignore the other possible races with enable/disable.

Cheers,
Ben.




2018-08-17 05:07:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

This protects enable/disable operations using the state mutex to
avoid races with, for example, concurrent enables on a bridge.

The bus hierarchy is walked first before taking the lock to
avoid lock nesting (though it would be ok if we ensured that
we always nest them bottom-up, it is better to just avoid the
issue alltogether, especially as we might find cases where
we want to take it top-down later).

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
drivers/pci/pci.c | 51 ++++++++++++++++++++++++++++++++---------------
1 file changed, 35 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 62591c153999..68152de2b5a0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1540,26 +1540,33 @@ static int do_pci_enable_device(struct pci_dev *dev, int bars)
*/
int pci_reenable_device(struct pci_dev *dev)
{
+ int rc = 0;
+
+ pci_dev_state_lock(dev);
if (pci_is_enabled(dev))
- return do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
- return 0;
+ rc = do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
+ pci_dev_state_unlock(dev);
+ return rc;
}
EXPORT_SYMBOL(pci_reenable_device);

static void pci_enable_bridge(struct pci_dev *dev)
{
struct pci_dev *bridge;
- int retval;
+ int retval, enabled;

bridge = pci_upstream_bridge(dev);
if (bridge)
pci_enable_bridge(bridge);

- if (pci_is_enabled(dev)) {
- if (!dev->is_busmaster)
- pci_set_master(dev);
+ /* Already enabled ? */
+ pci_dev_state_lock(dev);
+ enabled = pci_is_enabled(dev);
+ if (enabled && !dev->is_busmaster)
+ pci_set_master(dev);
+ pci_dev_state_unlock(dev);
+ if (enabled)
return;
- }

retval = pci_enable_device(dev);
if (retval)
@@ -1571,9 +1578,16 @@ static void pci_enable_bridge(struct pci_dev *dev)
static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
{
struct pci_dev *bridge;
- int err;
+ int err = 0;
int i, bars = 0;

+ /* Handle upstream bridges first to avoid locking issues */
+ bridge = pci_upstream_bridge(dev);
+ if (bridge)
+ pci_enable_bridge(bridge);
+
+ pci_dev_state_lock(dev);
+
/*
* Power state could be unknown at this point, either due to a fresh
* boot or a device removal call. So get the current power state
@@ -1586,12 +1600,9 @@ static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
}

+ /* Already enabled ? */
if (atomic_inc_return(&dev->enable_cnt) > 1)
- return 0; /* already enabled */
-
- bridge = pci_upstream_bridge(dev);
- if (bridge)
- pci_enable_bridge(bridge);
+ goto bail;

/* only skip sriov related */
for (i = 0; i <= PCI_ROM_RESOURCE; i++)
@@ -1604,6 +1615,8 @@ static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
err = do_pci_enable_device(dev, bars);
if (err < 0)
atomic_dec(&dev->enable_cnt);
+ bail:
+ pci_dev_state_unlock(dev);
return err;
}

@@ -1820,12 +1833,16 @@ static void do_pci_disable_device(struct pci_dev *dev)
* @dev: PCI device to disable
*
* NOTE: This function is a backend of PCI power management routines and is
- * not supposed to be called drivers.
+ * not supposed to be called drivers. It will keep enable_cnt and is_busmaster
+ * unmodified so that the resume code knows how to restore the corresponding
+ * command register bits.
*/
void pci_disable_enabled_device(struct pci_dev *dev)
{
+ pci_dev_state_lock(dev);
if (pci_is_enabled(dev))
do_pci_disable_device(dev);
+ pci_dev_state_unlock(dev);
}

/**
@@ -1849,12 +1866,14 @@ void pci_disable_device(struct pci_dev *dev)
dev_WARN_ONCE(&dev->dev, atomic_read(&dev->enable_cnt) <= 0,
"disabling already-disabled device");

+ pci_dev_state_lock(dev);
if (atomic_dec_return(&dev->enable_cnt) != 0)
- return;
+ goto bail;

do_pci_disable_device(dev);
-
dev->is_busmaster = 0;
+ bail:
+ pci_dev_state_unlock(dev);
}
EXPORT_SYMBOL(pci_disable_device);

--
2.17.1


2018-08-17 05:13:07

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

This re-fixes the bug reported by Hari Vyas <[email protected]>
after my revert of his commit but in a much simpler way.

The main issues is that is_added was being set after the driver
got bound and started, and thus setting it could race with other
changes to struct pci_dev.

This fixes it by setting the flag first, which also has the
advantage of matching the fact that we are clearing it *after*
unbinding in the remove path, thus the flag is now symtetric
and always set while the driver code is running.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
drivers/pci/bus.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 35b7fc87eac5..48ae63673aa8 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -321,16 +321,16 @@ void pci_bus_add_device(struct pci_dev *dev)
pci_proc_attach_device(dev);
pci_bridge_d3_update(dev);

+ dev->is_added = 1;
dev->match_driver = true;
retval = device_attach(&dev->dev);
if (retval < 0 && retval != -EPROBE_DEFER) {
+ dev->is_added = 0;
pci_warn(dev, "device attach failed (%d)\n", retval);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
return;
}
-
- dev->is_added = 1;
}
EXPORT_SYMBOL_GPL(pci_bus_add_device);

--
2.17.1


2018-08-17 05:15:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH v2 3/6] pci: Remove priv_flags and use dev->error_state for "disconnected" status

This already represents whether a device is accessible or not,
creating a new flag isn't particularly helpful.

dev->error_state being an int, assigning it doesn't require
an atomic operation per-se. The existing atomic bitop only
protects the field, not anything else anyway.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---

v2. Initialize the io channel state before we try to read the header

drivers/pci/access.c | 12 ++++++------
drivers/pci/msi.c | 4 ++--
drivers/pci/pci.c | 2 +-
drivers/pci/pci.h | 11 ++---------
drivers/pci/probe.c | 4 +++-
include/linux/pci.h | 2 --
6 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index a3ad2fe185b9..ad97a3544636 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -535,7 +535,7 @@ EXPORT_SYMBOL(pcie_capability_clear_and_set_dword);

int pci_read_config_byte(const struct pci_dev *dev, int where, u8 *val)
{
- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -545,7 +545,7 @@ EXPORT_SYMBOL(pci_read_config_byte);

int pci_read_config_word(const struct pci_dev *dev, int where, u16 *val)
{
- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -556,7 +556,7 @@ EXPORT_SYMBOL(pci_read_config_word);
int pci_read_config_dword(const struct pci_dev *dev, int where,
u32 *val)
{
- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -566,7 +566,7 @@ EXPORT_SYMBOL(pci_read_config_dword);

int pci_write_config_byte(const struct pci_dev *dev, int where, u8 val)
{
- if (pci_dev_is_disconnected(dev))
+ if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_byte(dev->bus, dev->devfn, where, val);
}
@@ -574,7 +574,7 @@ EXPORT_SYMBOL(pci_write_config_byte);

int pci_write_config_word(const struct pci_dev *dev, int where, u16 val)
{
- if (pci_dev_is_disconnected(dev))
+ if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_word(dev->bus, dev->devfn, where, val);
}
@@ -583,7 +583,7 @@ EXPORT_SYMBOL(pci_write_config_word);
int pci_write_config_dword(const struct pci_dev *dev, int where,
u32 val)
{
- if (pci_dev_is_disconnected(dev))
+ if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
}
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..2f872db40042 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -298,7 +298,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
struct pci_dev *dev = msi_desc_to_pci_dev(entry);

- if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
+ if (dev->current_state != PCI_D0 || pci_channel_offline(dev)) {
/* Don't touch the hardware now */
} else if (entry->msi_attrib.is_msix) {
void __iomem *base = pci_msix_desc_addr(entry);
@@ -975,7 +975,7 @@ static void pci_msix_shutdown(struct pci_dev *dev)
if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;

- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
dev->msix_enabled = 0;
return;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..62591c153999 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5713,7 +5713,7 @@ bool pci_device_is_present(struct pci_dev *pdev)
{
u32 v;

- if (pci_dev_is_disconnected(pdev))
+ if (pci_channel_offline(pdev))
return false;
return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 473aa10a5dbf..f7704333e6a1 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -293,18 +293,11 @@ struct pci_sriov {
bool drivers_autoprobe; /* Auto probing of VFs by driver */
};

-/* pci_dev priv_flags */
-#define PCI_DEV_DISCONNECTED 0
-
static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
{
- set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
- return 0;
-}
+ dev->error_state = pci_channel_io_perm_failure;

-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
- return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
+ return 0;
}

#ifdef CONFIG_PCIEAER
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 440445ac7dfa..5cf982b3031b 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1581,6 +1581,9 @@ int pci_setup_device(struct pci_dev *dev)
struct pci_bus_region region;
struct resource *res;

+ /* This must be set before any config space access */
+ dev->error_state = pci_channel_io_normal;
+
hdr_type = pci_hdr_type(dev);

dev->sysdata = dev->bus->sysdata;
@@ -1588,7 +1591,6 @@ int pci_setup_device(struct pci_dev *dev)
dev->dev.bus = &pci_bus_type;
dev->hdr_type = hdr_type & 0x7f;
dev->multifunction = !!(hdr_type & 0x80);
- dev->error_state = pci_channel_io_normal;
set_pcie_port_type(dev);

pci_dev_assign_slot(dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9799109c5e25..f58bda204f09 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -443,8 +443,6 @@ struct pci_dev {
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */
char *driver_override; /* Driver name to force a match */
-
- unsigned long priv_flags; /* Private flags for the PCI driver */
};

static inline struct pci_dev *pci_physfn(struct pci_dev *dev)



2018-08-17 05:25:29

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH 3/6] pci: Remove priv_flags and use dev->error_state for "disconnected" status

This already represents whether a device is accessible or not,
creating a new flag isn't particularly helpful.

dev->error_state being an int, assigning it doesn't require
an atomic operation per-se. The existing atomic bitop only
protects the field, not anything else anyway.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
drivers/pci/access.c | 12 ++++++------
drivers/pci/msi.c | 4 ++--
drivers/pci/pci.c | 2 +-
drivers/pci/pci.h | 11 ++---------
include/linux/pci.h | 2 --
5 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index a3ad2fe185b9..ad97a3544636 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -535,7 +535,7 @@ EXPORT_SYMBOL(pcie_capability_clear_and_set_dword);

int pci_read_config_byte(const struct pci_dev *dev, int where, u8 *val)
{
- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -545,7 +545,7 @@ EXPORT_SYMBOL(pci_read_config_byte);

int pci_read_config_word(const struct pci_dev *dev, int where, u16 *val)
{
- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -556,7 +556,7 @@ EXPORT_SYMBOL(pci_read_config_word);
int pci_read_config_dword(const struct pci_dev *dev, int where,
u32 *val)
{
- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -566,7 +566,7 @@ EXPORT_SYMBOL(pci_read_config_dword);

int pci_write_config_byte(const struct pci_dev *dev, int where, u8 val)
{
- if (pci_dev_is_disconnected(dev))
+ if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_byte(dev->bus, dev->devfn, where, val);
}
@@ -574,7 +574,7 @@ EXPORT_SYMBOL(pci_write_config_byte);

int pci_write_config_word(const struct pci_dev *dev, int where, u16 val)
{
- if (pci_dev_is_disconnected(dev))
+ if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_word(dev->bus, dev->devfn, where, val);
}
@@ -583,7 +583,7 @@ EXPORT_SYMBOL(pci_write_config_word);
int pci_write_config_dword(const struct pci_dev *dev, int where,
u32 val)
{
- if (pci_dev_is_disconnected(dev))
+ if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
}
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..2f872db40042 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -298,7 +298,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct msi_msg *msg)
{
struct pci_dev *dev = msi_desc_to_pci_dev(entry);

- if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
+ if (dev->current_state != PCI_D0 || pci_channel_offline(dev)) {
/* Don't touch the hardware now */
} else if (entry->msi_attrib.is_msix) {
void __iomem *base = pci_msix_desc_addr(entry);
@@ -975,7 +975,7 @@ static void pci_msix_shutdown(struct pci_dev *dev)
if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;

- if (pci_dev_is_disconnected(dev)) {
+ if (pci_channel_offline(dev)) {
dev->msix_enabled = 0;
return;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..62591c153999 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5713,7 +5713,7 @@ bool pci_device_is_present(struct pci_dev *pdev)
{
u32 v;

- if (pci_dev_is_disconnected(pdev))
+ if (pci_channel_offline(pdev))
return false;
return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, &v, 0);
}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 473aa10a5dbf..f7704333e6a1 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -293,18 +293,11 @@ struct pci_sriov {
bool drivers_autoprobe; /* Auto probing of VFs by driver */
};

-/* pci_dev priv_flags */
-#define PCI_DEV_DISCONNECTED 0
-
static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
{
- set_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
- return 0;
-}
+ dev->error_state = pci_channel_io_perm_failure;

-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
- return test_bit(PCI_DEV_DISCONNECTED, &dev->priv_flags);
+ return 0;
}

#ifdef CONFIG_PCIEAER
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9799109c5e25..f58bda204f09 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -443,8 +443,6 @@ struct pci_dev {
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */
char *driver_override; /* Driver name to force a match */
-
- unsigned long priv_flags; /* Private flags for the PCI driver */
};

static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
--
2.17.1


2018-08-17 05:26:24

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH 6/6] pci: Protect is_busmaster using the state lock

This wraps pci_set_master() and pci_clear_master() with the pci_dev
state lock. The clearing of is_busmaster in pci_disable_device()
is already covered.

This also adds a comment explaining why is_busmaster must not be
checked in pci_set_master() due to how the power management code
uses it.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
drivers/pci/pci.c | 9 +++++++++
include/linux/pci.h | 2 +-
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 68152de2b5a0..13d988d5b2a3 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4052,11 +4052,18 @@ void __weak pcibios_set_master(struct pci_dev *dev)
*
* Enables bus-mastering on the device and calls pcibios_set_master()
* to do the needed arch specific settings.
+ *
+ * Note: This must not check dev->is_busmaster because the power management
+ * code will call this in order to restore the config space to the
+ * state of is_busmaster, thus is_busmaster might be set but the
+ * config space bit cleared.
*/
void pci_set_master(struct pci_dev *dev)
{
+ pci_dev_state_lock(dev);
__pci_set_master(dev, true);
pcibios_set_master(dev);
+ pci_dev_state_unlock(dev);
}
EXPORT_SYMBOL(pci_set_master);

@@ -4066,7 +4073,9 @@ EXPORT_SYMBOL(pci_set_master);
*/
void pci_clear_master(struct pci_dev *dev)
{
+ pci_dev_state_lock(dev);
__pci_set_master(dev, false);
+ pci_dev_state_unlock(dev);
}
EXPORT_SYMBOL(pci_clear_master);

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0d4fc22df190..a5bac5b21454 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -375,7 +375,6 @@ struct pci_dev {
unsigned int multifunction:1; /* Multi-function device */

unsigned int is_added:1;
- unsigned int is_busmaster:1; /* Is busmaster */
unsigned int no_msi:1; /* May not use MSI */
unsigned int no_64bit_msi:1; /* May only use 32-bit MSIs */
unsigned int block_cfg_access:1; /* Config space access blocked */
@@ -450,6 +449,7 @@ struct pci_dev {
struct mutex state_lock; /* Protect local state bits */

/* --- Fields below this line are protected by the state_lock mutex */
+ unsigned int is_busmaster:1; /* Is busmaster */
};

static inline void pci_dev_state_lock(struct pci_dev *dev)
--
2.17.1


2018-08-17 05:51:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [RFC PATCH 4/6] pci: Add a mutex to pci_dev to protect device state

This adds a pci_dev mutex which will be used to fix a number
of races related to the various state bits maintained in
that structure.

We call it "state_lock" with similarly named accessors to avoid
confusion with the pci_dev_lock() which uses the device_lock().

This is a low level mutex meant to protect the mapping between
the state fields and the hardware state, for example enabling
disabling, setting/clearing bus master etc...

These operations can happen while the device_lock() is already
held (but don't have to) so a separate mutex is preferable.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---
drivers/pci/probe.c | 1 +
include/linux/pci.h | 17 +++++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 440445ac7dfa..3ce287ab6150 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2155,6 +2155,7 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
INIT_LIST_HEAD(&dev->bus_list);
dev->dev.type = &pci_dev_type;
dev->bus = pci_bus_get(bus);
+ mutex_init(&dev->state_lock);

return dev;
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f58bda204f09..0d4fc22df190 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -33,6 +33,7 @@
#include <linux/io.h>
#include <linux/resource_ext.h>
#include <uapi/linux/pci.h>
+#include <linux/mutex.h>

#include <linux/pci_ids.h>

@@ -443,8 +444,24 @@ struct pci_dev {
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */
char *driver_override; /* Driver name to force a match */
+
+ unsigned long priv_flags; /* Private flags for the PCI driver */
+
+ struct mutex state_lock; /* Protect local state bits */
+
+ /* --- Fields below this line are protected by the state_lock mutex */
};

+static inline void pci_dev_state_lock(struct pci_dev *dev)
+{
+ mutex_lock(&dev->state_lock);
+}
+
+static inline void pci_dev_state_unlock(struct pci_dev *dev)
+{
+ mutex_unlock(&dev->state_lock);
+}
+
static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
{
#ifdef CONFIG_PCI_IOV
--
2.17.1


2018-08-17 08:32:07

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

On Fri, 2018-08-17 at 10:09 +0200, Marta Rybczynska wrote:
>
> ----- On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt [email protected] wrote:
>
> > This protects enable/disable operations using the state mutex to
> > avoid races with, for example, concurrent enables on a bridge.
> >
> > The bus hierarchy is walked first before taking the lock to
> > avoid lock nesting (though it would be ok if we ensured that
> > we always nest them bottom-up, it is better to just avoid the
> > issue alltogether, especially as we might find cases where
> > we want to take it top-down later).
> >
> > Signed-off-by: Benjamin Herrenschmidt <[email protected]>
>
>
> >
> > static void pci_enable_bridge(struct pci_dev *dev)
> > {
> > struct pci_dev *bridge;
> > - int retval;
> > + int retval, enabled;
> >
> > bridge = pci_upstream_bridge(dev);
> > if (bridge)
> > pci_enable_bridge(bridge);
> >
> > - if (pci_is_enabled(dev)) {
> > - if (!dev->is_busmaster)
> > - pci_set_master(dev);
> > + /* Already enabled ? */
> > + pci_dev_state_lock(dev);
> > + enabled = pci_is_enabled(dev);
> > + if (enabled && !dev->is_busmaster)
> > + pci_set_master(dev);
> > + pci_dev_state_unlock(dev);
> > + if (enabled)
> > return;
> > - }
> >
>
> This looks complicated too me especially with the double locking. What do you
> think about a function doing that check that used an unlocked version of
> pcie_set_master?
>
> Like:
>
> dev_state_lock(dev);
> enabled = pci_is_enabled(dev);
> if (enabled && !dev->is_busmaster)
> pci_set_master_unlocked();
> pci_dev_state_unlock(dev);
>
> BTW If I remember correctly the code today can set bus mastering multiple
> times without checking if already done.

I don't mind but I tend to dislike all those _unlocked() suffixes, I
suppose my generation is typing adverse enough that we call these
__something instead :)

As for setting multiple times, yes pci_set_master() doesn't check but
look at the "-" hunks of my patch, I'm not changing the existing test
here. Not that it matters much, it's an optimization.

In fact my original version just completely removed the whole lot
and just unconditionally did pci_enable_device() + pci_set_master(),
just ignoring the existing state :-)

But I decided to keep the patch functionally equivalent so I added it
back.

Cheers,
Ben.



2018-08-17 08:38:23

by Marta Rybczynska

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex



----- On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt [email protected] wrote:

> This protects enable/disable operations using the state mutex to
> avoid races with, for example, concurrent enables on a bridge.
>
> The bus hierarchy is walked first before taking the lock to
> avoid lock nesting (though it would be ok if we ensured that
> we always nest them bottom-up, it is better to just avoid the
> issue alltogether, especially as we might find cases where
> we want to take it top-down later).
>
> Signed-off-by: Benjamin Herrenschmidt <[email protected]>


>
> static void pci_enable_bridge(struct pci_dev *dev)
> {
> struct pci_dev *bridge;
> - int retval;
> + int retval, enabled;
>
> bridge = pci_upstream_bridge(dev);
> if (bridge)
> pci_enable_bridge(bridge);
>
> - if (pci_is_enabled(dev)) {
> - if (!dev->is_busmaster)
> - pci_set_master(dev);
> + /* Already enabled ? */
> + pci_dev_state_lock(dev);
> + enabled = pci_is_enabled(dev);
> + if (enabled && !dev->is_busmaster)
> + pci_set_master(dev);
> + pci_dev_state_unlock(dev);
> + if (enabled)
> return;
> - }
>

This looks complicated too me especially with the double locking. What do you
think about a function doing that check that used an unlocked version of
pcie_set_master?

Like:

dev_state_lock(dev);
enabled = pci_is_enabled(dev);
if (enabled && !dev->is_busmaster)
pci_set_master_unlocked();
pci_dev_state_unlock(dev);

BTW If I remember correctly the code today can set bus mastering multiple
times without checking if already done.

Marta

2018-08-17 09:02:08

by Hari Vyas

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

On Fri, Aug 17, 2018 at 2:00 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Fri, 2018-08-17 at 10:09 +0200, Marta Rybczynska wrote:
>>
>> ----- On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt [email protected] wrote:
>>
>> > This protects enable/disable operations using the state mutex to
>> > avoid races with, for example, concurrent enables on a bridge.
>> >
>> > The bus hierarchy is walked first before taking the lock to
>> > avoid lock nesting (though it would be ok if we ensured that
>> > we always nest them bottom-up, it is better to just avoid the
>> > issue alltogether, especially as we might find cases where
>> > we want to take it top-down later).
>> >
>> > Signed-off-by: Benjamin Herrenschmidt <[email protected]>
>>
>>
>> >
>> > static void pci_enable_bridge(struct pci_dev *dev)
>> > {
>> > struct pci_dev *bridge;
>> > - int retval;
>> > + int retval, enabled;
>> >
>> > bridge = pci_upstream_bridge(dev);
>> > if (bridge)
>> > pci_enable_bridge(bridge);
>> >
>> > - if (pci_is_enabled(dev)) {
>> > - if (!dev->is_busmaster)
>> > - pci_set_master(dev);
>> > + /* Already enabled ? */
>> > + pci_dev_state_lock(dev);
>> > + enabled = pci_is_enabled(dev);
>> > + if (enabled && !dev->is_busmaster)
>> > + pci_set_master(dev);
>> > + pci_dev_state_unlock(dev);
>> > + if (enabled)
>> > return;
>> > - }
>> >
>>
>> This looks complicated too me especially with the double locking. What do you
>> think about a function doing that check that used an unlocked version of
>> pcie_set_master?
>>
>> Like:
>>
>> dev_state_lock(dev);
>> enabled = pci_is_enabled(dev);
>> if (enabled && !dev->is_busmaster)
>> pci_set_master_unlocked();
>> pci_dev_state_unlock(dev);
>>
>> BTW If I remember correctly the code today can set bus mastering multiple
>> times without checking if already done.
>
> I don't mind but I tend to dislike all those _unlocked() suffixes, I
> suppose my generation is typing adverse enough that we call these
> __something instead :)
>
> As for setting multiple times, yes pci_set_master() doesn't check but
> look at the "-" hunks of my patch, I'm not changing the existing test
> here. Not that it matters much, it's an optimization.
>
> In fact my original version just completely removed the whole lot
> and just unconditionally did pci_enable_device() + pci_set_master(),
> just ignoring the existing state :-)
>
> But I decided to keep the patch functionally equivalent so I added it
> back.
>
> Cheers,
> Ben.
>
>
So many mail threads for common issues going so just trying to
summarize concern from my side.
1) HW level
PCI Config data overwritten (ex: PCI_COMMAND bits etc) was happening
at lower layer so from my
perspective, it is best to handle concern at lower level with locking
mechanism and that is what
I proposed in my upcoming patch. Without that, I guess, we may end up
in issues later with some weird scenario
which may not be known today and we will again be fine tuning stuff
here and there.
2) SW level(internal data structure):
About is_added,is_busmaster: These all are bit fields so infact I too
suggested to remove those bit fields and
make separate variables in pci_dev structure. This will avoid all
data-overwritten,concurrency issues but ofcourse
at the level of space cost.
Other possibility will be either to use atomic or locking mechanism
for these. My point here is also same, better
avoid fine tuning later stage.
Moving is_added up and down doesn't look like good as we are just
shifting issue up and down.

Please check and decide accordingly. I will hold my to-be-submitted
modify() patch about handling hw level
over-written case with locking around read-write operation.

Regards,
hari

2018-08-17 09:41:42

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

On Fri, 2018-08-17 at 14:30 +0530, Hari Vyas wrote:
> So many mail threads for common issues going so just trying to
> summarize concern from my side.
> 1) HW level
> PCI Config data overwritten (ex: PCI_COMMAND bits etc) was happening
> at lower layer so from my
> perspective, it is best to handle concern at lower level with locking
> mechanism and that is what
> I proposed in my upcoming patch. Without that, I guess, we may end up
> in issues later with some weird scenario
> which may not be known today and we will again be fine tuning stuff
> here and there.

Maybe... at this point I'm not trying to address that specific issue.
That being said, the PCI_COMMAND register should be under control of
the driver at runtime and thus shouldn't be exposed to races like that
except in the usual case of races in configuring bridges. Those races
definitely need to be handled at the higher level.

So I'm rather dubious of adding a whole new layer of "modify" callbacks
to config space accessors for that, especially since they won't do any
good on architectures with lockless config accesses such as ... x86

> 2) SW level(internal data structure):
> About is_added,is_busmaster: These all are bit fields so infact I too
> suggested to remove those bit fields and
> make separate variables in pci_dev structure.
> This will avoid all data-overwritten,concurrency issues but ofcourse
> at the level of space cost.

It's unnecessary to do blanket changes without first understanding the
details of what's going on. A lot of these things are never touched
outside of the overall single threaded environment of
discovery/assignment or under driver control, in which case it's
expectd that the driver doesn't call them in racy ways

That said, I'm happy to move some of them under my proposed
dev_state_lock.

For is_added specifically, the field is simply set at the wrong time as
you can see in my previous patch.

> Other possibility will be either to use atomic

Atomic is almost always wrong. It doesn't synchronize anything other
than the field, so either it's a big waste of it gives a false sense of
security for something that's almost always still racy.

I'd rather keep the indication that a bunch of those fields *are*
unprotected and rely on the outer layers being single threaded.

> or locking mechanism
> for these. My point here is also same, better
> avoid fine tuning later stage.
> Moving is_added up and down doesn't look like good as we are just
> shifting issue up and down.

No, you are wrong. I also originally covered is_added with my new mutex
but in hindsight it's the wrong thing to do, and went back to the
correct fix at this point which is precisely to move it up.

is_added is essentially owned by the part of the PCI probe code that
runs single threaded before the device gets exposed to the device
model.

Pretty much all of the code, including all the corresponding fields
(struct resources etc...) in pci_dev are unprotected and rely on being
accessed in a single threaded manner. is_added is no exception.

It was simply a mistake to set it after device_attach().

Thus moving it back up fixes it *correctly* by making it follow the
same rules as all the related fields.

That said, if we want to start adding protection to all those other
fields, then this is a different matter alltogether. But at this point,
this is the best fix for the problem at hand.

> Please check and decide accordingly. I will hold my to-be-submitted
> modify() patch about handling hw level
> over-written case with locking around read-write operation.

Can you remind us in this thread which specific cases of RMW races of
config space you were trying to address ?

Cheers,
Ben.



2018-08-17 10:12:51

by Hari Vyas

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

On Fri, Aug 17, 2018 at 3:09 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Fri, 2018-08-17 at 14:30 +0530, Hari Vyas wrote:
>> So many mail threads for common issues going so just trying to
>> summarize concern from my side.
>> 1) HW level
>> PCI Config data overwritten (ex: PCI_COMMAND bits etc) was happening
>> at lower layer so from my
>> perspective, it is best to handle concern at lower level with locking
>> mechanism and that is what
>> I proposed in my upcoming patch. Without that, I guess, we may end up
>> in issues later with some weird scenario
>> which may not be known today and we will again be fine tuning stuff
>> here and there.
>
> Maybe... at this point I'm not trying to address that specific issue.
> That being said, the PCI_COMMAND register should be under control of
> the driver at runtime and thus shouldn't be exposed to races like that
> except in the usual case of races in configuring bridges. Those races
> definitely need to be handled at the higher level.
>
> So I'm rather dubious of adding a whole new layer of "modify" callbacks
> to config space accessors for that, especially since they won't do any
> good on architectures with lockless config accesses such as ... x86
>
modify() is optional. host/controller layers may override it.
It was just a proposal to avoid race issues which happens in SMP environment and
solves other issues. I agree, nothing great anyhow can be done in
lockless config

>> 2) SW level(internal data structure):
>> About is_added,is_busmaster: These all are bit fields so infact I too
>> suggested to remove those bit fields and
>> make separate variables in pci_dev structure.
>> This will avoid all data-overwritten,concurrency issues but ofcourse
>> at the level of space cost.
>
> It's unnecessary to do blanket changes without first understanding the
> details of what's going on. A lot of these things are never touched
> outside of the overall single threaded environment of
> discovery/assignment or under driver control, in which case it's
> expectd that the driver doesn't call them in racy ways
>
> That said, I'm happy to move some of them under my proposed
> dev_state_lock.
>
> For is_added specifically, the field is simply set at the wrong time as
> you can see in my previous patch.
>
Issue needs to be addressed and that is our goal.
Some times simple mistakes need lot of debugging which happened in my case and
my suggestion is to just avoid. SMP related issues are popping up now
so we just need
to be careful.
>> Other possibility will be either to use atomic
>
> Atomic is almost always wrong. It doesn't synchronize anything other
> than the field, so either it's a big waste of it gives a false sense of
> security for something that's almost always still racy.
>
> I'd rather keep the indication that a bunch of those fields *are*
> unprotected and rely on the outer layers being single threaded.
>
>> or locking mechanism
>> for these. My point here is also same, better
>> avoid fine tuning later stage.
>> Moving is_added up and down doesn't look like good as we are just
>> shifting issue up and down.
>
> No, you are wrong. I also originally covered is_added with my new mutex
> but in hindsight it's the wrong thing to do, and went back to the
> correct fix at this point which is precisely to move it up.
>
> is_added is essentially owned by the part of the PCI probe code that
> runs single threaded before the device gets exposed to the device
> model.
>
> Pretty much all of the code, including all the corresponding fields
> (struct resources etc...) in pci_dev are unprotected and rely on being
> accessed in a single threaded manner. is_added is no exception.
>
> It was simply a mistake to set it after device_attach().
>
> Thus moving it back up fixes it *correctly* by making it follow the
> same rules as all the related fields.
>
> That said, if we want to start adding protection to all those other
> fields, then this is a different matter alltogether. But at this point,
> this is the best fix for the problem at hand.
>
>> Please check and decide accordingly. I will hold my to-be-submitted
>> modify() patch about handling hw level
>> over-written case with locking around read-write operation.
>
> Can you remind us in this thread which specific cases of RMW races of
> config space you were trying to address ?
>
Same pci bridge master, memory bit setting concern only (which my
colleague Srinath figured
out after lot of effort some time back) where only one bit in
PCI_COMMAND was getting set.
(Bug 200793 - PCI: read-write config operation doesn't look like SMP safe)
My approach is to handle with modify operations at lower level so bits
are not over-written or lost.


As stated earlier, issue should be just resolved in better way. No
issue in going with majority
Regards,
hari
> Cheers,
> Ben.
>
>

2018-08-17 10:26:36

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

On Fri, 2018-08-17 at 15:40 +0530, Hari Vyas wrote:
>
> > So I'm rather dubious of adding a whole new layer of "modify" callbacks
> > to config space accessors for that, especially since they won't do any
> > good on architectures with lockless config accesses such as ... x86
> >
>
> modify() is optional. host/controller layers may override it.
> It was just a proposal to avoid race issues which happens in SMP environment and
> solves other issues. I agree, nothing great anyhow can be done in
> lockless config

Right so let's not go down the path of creating low level accessors
providing a semantic that cannot actually be provided by some
architectures :-)

> > > 2) SW level(internal data structure):
> > > About is_added,is_busmaster: These all are bit fields so infact I too
> > > suggested to remove those bit fields and
> > > make separate variables in pci_dev structure.
> > > This will avoid all data-overwritten,concurrency issues but ofcourse
> > > at the level of space cost.
> >
> > It's unnecessary to do blanket changes without first understanding the
> > details of what's going on. A lot of these things are never touched
> > outside of the overall single threaded environment of
> > discovery/assignment or under driver control, in which case it's
> > expectd that the driver doesn't call them in racy ways
> >
> > That said, I'm happy to move some of them under my proposed
> > dev_state_lock.
> >
> > For is_added specifically, the field is simply set at the wrong time as
> > you can see in my previous patch.
> >
> Issue needs to be addressed and that is our goal.

Yes.

> Some times simple mistakes need lot of debugging which happened in my case and
> my suggestion is to just avoid. SMP related issues are popping up now
> so we just need to be careful.

Oh I agree, and I took me a while to re-debug independently some of the
issues on my side too :-) I should be clearer that I very much do
appreciate your debugging work and finding those problems !

I might disagree on the solution but your effort is very valuable and
I'm sure we can converge on a solution that works for everybody.

.../...

> > Can you remind us in this thread which specific cases of RMW races of
> > config space you were trying to address ?
> >
>
> Same pci bridge master, memory bit setting concern only (which my
> colleague Srinath figured out after lot of effort some time back) where only one bit in
> PCI_COMMAND was getting set.
> (Bug 200793 - PCI: read-write config operation doesn't look like SMP safe)
> My approach is to handle with modify operations at lower level so bits
> are not over-written or lost.

Right so those are the same old races with pci_set_master() on the
bridge upward chain ?

If yes then this should be addressed by my patches but there are still
some debate about whether to add that new mutex or try to use an
existing one, so let's see what Bjorn has to say.

I think the set-master and enable/disable issues are intimately related
and should be solved together by some higher level locking.

This is a typical case where the "atomic" enable_cnt provided people
with a false sense of safety while the code in practice was still
completely racy.

> As stated earlier, issue should be just resolved in better way. No
> issue in going with majority

Hehe, ok. I think the is_added fix is simply to move it up since it
matches the rest of the code in that area, but let's see what Bjorn has
to say in that regard.

As for the enable/disable/set_master races, Ive wrote my arguments in
favor of a dedicated lock, let's see what others have to respond.

Cheers,
Ben.

> Regards,
> hari
> > Cheers,
> > Ben.
> >
> >


2018-08-17 15:46:00

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
> This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.

Just to be clear, if I understand correctly, this is a pure revert of
44bda4b7d26e and as such it reintroduces the problem solved by that
commit.

If your solution turns out to be better, that's great, but it would be
nice to avoid the bisection hole of reintroducing the problem, then
fixing it again later.

> The new pci state mutex provides a simpler way of addressing
> this race and avoids the relative includes added to the powerpc
> code.

2018-08-17 16:26:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

On Fri, Aug 17, 2018 at 02:48:58PM +1000, Benjamin Herrenschmidt wrote:
> This re-fixes the bug reported by Hari Vyas <[email protected]>
> after my revert of his commit but in a much simpler way.
>
> The main issues is that is_added was being set after the driver
> got bound and started, and thus setting it could race with other
> changes to struct pci_dev.

The "bind driver, then set dev->added = 1" order seems to have been
there since the beginning of dev->is_added:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a1bc9013a03

This patch seems reasonable, but I'm a little dubious about the
existence of "is_added" in the first place. As far as I can tell, the
only other buses with something similar are the MEN Chameleon bus and
the Intel Management Engine Interface.

The PCI uses of "is_added" don't seem *that* critical or unique to
PCI, so I'm not 100% convinced we need it at all. But I haven't
looked into it enough to be able to propose an alternative.

> This fixes it by setting the flag first, which also has the
> advantage of matching the fact that we are clearing it *after*
> unbinding in the remove path, thus the flag is now symtetric
> and always set while the driver code is running.
>
> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
> ---
> drivers/pci/bus.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 35b7fc87eac5..48ae63673aa8 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -321,16 +321,16 @@ void pci_bus_add_device(struct pci_dev *dev)
> pci_proc_attach_device(dev);
> pci_bridge_d3_update(dev);
>
> + dev->is_added = 1;
> dev->match_driver = true;
> retval = device_attach(&dev->dev);
> if (retval < 0 && retval != -EPROBE_DEFER) {
> + dev->is_added = 0;
> pci_warn(dev, "device attach failed (%d)\n", retval);
> pci_proc_detach_device(dev);
> pci_remove_sysfs_dev_files(dev);
> return;
> }
> -
> - dev->is_added = 1;
> }
> EXPORT_SYMBOL_GPL(pci_bus_add_device);
>
> --
> 2.17.1
>

2018-08-17 18:18:22

by Lukas Wunner

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

On Fri, Aug 17, 2018 at 11:25:34AM -0500, Bjorn Helgaas wrote:
> The "bind driver, then set dev->added = 1" order seems to have been
> there since the beginning of dev->is_added:
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a1bc9013a03
>
> This patch seems reasonable, but I'm a little dubious about the
> existence of "is_added" in the first place. As far as I can tell, the
> only other buses with something similar are the MEN Chameleon bus and
> the Intel Management Engine Interface.
>
> The PCI uses of "is_added" don't seem *that* critical or unique to
> PCI, so I'm not 100% convinced we need it at all. But I haven't
> looked into it enough to be able to propose an alternative.

Addition of new PCI devices, e.g. on boot or on hotplug, consists of two
stages: First, new devices are created by pci_scan_slot(), afterwards
they're bound to a driver by pci_bus_add_devices(). The idea is that
the bus is scanned in full before drivers are attached to devices.

In the second step, i.e. in pci_bus_add_devices(), the is_added flag is
set on devices. The flag is significant because pci_scan_slot() returns
the number of newly discovered PCI functions in the given slot, and it
calculates that number based on the is_added flag.

More specifically, it invokes pci_scan_single_device() which either
returns an existing device or a newly created device. The is_added flag
is basically a way for pci_scan_single_device() to communicate back to
pci_scan_slot() which of the two code paths it took.

The two steps (enumeration and driver attachment) are protected by
pci_lock_rescan_remove(). Initially, when a pci_dev is newly allocated
with kzalloc(), is_added is 0. The transition from 0 to 1 happens while
under pci_lock_rescan_remove(). When that lock is released, is_added is
always 1, barring a device_attach() error, in which case it will remain
at 0.

AFAICS, there is no second mutex needed to ensure synchronization of
is_added, the existing mutex should be sufficient and the only problem
are RMW races when accessing adjacent flags. Those were correctly addressed
by Hari's patch. Benjamin seems to be alleging that concurrency issues
exist beyond the RMW races, I fail to see them, sorry.

Thanks,

Lukas

2018-08-18 03:28:50

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
> On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
> > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
>
> Just to be clear, if I understand correctly, this is a pure revert of
> 44bda4b7d26e and as such it reintroduces the problem solved by that
> commit.
>
> If your solution turns out to be better, that's great, but it would be
> nice to avoid the bisection hole of reintroducing the problem, then
> fixing it again later.

There is no way to do that other than merging the revert and the fix
into one. That said, the race is sufficiently hard to hit that I
wouldn't worry too much about it.

> > The new pci state mutex provides a simpler way of addressing
> > this race and avoids the relative includes added to the powerpc
> > code.


2018-08-18 03:30:03

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

On Fri, 2018-08-17 at 11:25 -0500, Bjorn Helgaas wrote:
> On Fri, Aug 17, 2018 at 02:48:58PM +1000, Benjamin Herrenschmidt wrote:
> > This re-fixes the bug reported by Hari Vyas <[email protected]>
> > after my revert of his commit but in a much simpler way.
> >
> > The main issues is that is_added was being set after the driver
> > got bound and started, and thus setting it could race with other
> > changes to struct pci_dev.
>
> The "bind driver, then set dev->added = 1" order seems to have been
> there since the beginning of dev->is_added:
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a1bc9013a03
>
> This patch seems reasonable, but I'm a little dubious about the
> existence of "is_added" in the first place. As far as I can tell, the
> only other buses with something similar are the MEN Chameleon bus and
> the Intel Management Engine Interface.
>
> The PCI uses of "is_added" don't seem *that* critical or unique to
> PCI, so I'm not 100% convinced we need it at all. But I haven't
> looked into it enough to be able to propose an alternative.

This is a whole different conversation you are taking us into :-)

is_added is currently needed for a number of reasons, mostly relating
to partial hotplug, and historically comes from the fact that we
separated the PCI probing & tree construction from the registration
with the device-model. This of course comes from the fact that the
device model didn't actually exist yet when the PCI code was
created :-)

So let's keep things separate shall we ? I'd rather fix this correctly
now, and get rid of that pesky atomic priv_flags which I think is just
going to be a long term add to the mess rather than an improvement, and
separately we can discuss whether is_added is something that can go
away, but I suspect this will come in the form of either a deeper
rework of how we do PCI probing, or simply finding a struct device/kobj
field we can use as a hint that we've added the device already for
hotplug.

> > This fixes it by setting the flag first, which also has the
> > advantage of matching the fact that we are clearing it *after*
> > unbinding in the remove path, thus the flag is now symtetric
> > and always set while the driver code is running.
> >
> > Signed-off-by: Benjamin Herrenschmidt <[email protected]>
> > ---
> > drivers/pci/bus.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index 35b7fc87eac5..48ae63673aa8 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -321,16 +321,16 @@ void pci_bus_add_device(struct pci_dev *dev)
> > pci_proc_attach_device(dev);
> > pci_bridge_d3_update(dev);
> >
> > + dev->is_added = 1;
> > dev->match_driver = true;
> > retval = device_attach(&dev->dev);
> > if (retval < 0 && retval != -EPROBE_DEFER) {
> > + dev->is_added = 0;
> > pci_warn(dev, "device attach failed (%d)\n", retval);
> > pci_proc_detach_device(dev);
> > pci_remove_sysfs_dev_files(dev);
> > return;
> > }
> > -
> > - dev->is_added = 1;
> > }
> > EXPORT_SYMBOL_GPL(pci_bus_add_device);
> >
> > --
> > 2.17.1
> >


2018-08-18 03:45:49

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

On Fri, 2018-08-17 at 20:15 +0200, Lukas Wunner wrote:
>
> The two steps (enumeration and driver attachment) are protected by
> pci_lock_rescan_remove(). Initially, when a pci_dev is newly allocated
> with kzalloc(), is_added is 0. The transition from 0 to 1 happens while
> under pci_lock_rescan_remove(). When that lock is released, is_added is
> always 1, barring a device_attach() error, in which case it will remain
> at 0.
>
> AFAICS, there is no second mutex needed to ensure synchronization of
> is_added, the existing mutex should be sufficient and the only problem
> are RMW races when accessing adjacent flags. Those were correctly addressed
> by Hari's patch. Benjamin seems to be alleging that concurrency issues
> exist beyond the RMW races, I fail to see them, sorry.

Im saying that fixing those issues using atomic bitops is a generally
unsafe practice even if it happens to work in a specific case.

In this one, I argue that the root problem was simply that is_added was
set in the wrong place. The "fix" from Hari touches 9 files, adds
horrible relative includes to an architecture and generally bloats
things while a single 3 liner would have been sufficient.

The current use of is_added is asymetric (it's cleared during
device_attach but set during detach) which is bogus and the entire race
stems from the fact that it is set after device_attach returns.

So setting it early is, imho, the right fix, is much simpler, and fixes
the odd imbalance we have to begin with.

Ben.



2018-08-19 02:28:14

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Sat, Aug 18, 2018 at 01:24:51PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
> > On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
> > > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
> >
> > Just to be clear, if I understand correctly, this is a pure revert of
> > 44bda4b7d26e and as such it reintroduces the problem solved by that
> > commit.
> >
> > If your solution turns out to be better, that's great, but it would be
> > nice to avoid the bisection hole of reintroducing the problem, then
> > fixing it again later.
>
> There is no way to do that other than merging the revert and the fix
> into one. That said, the race is sufficiently hard to hit that I
> wouldn't worry too much about it.

OK, then at least mention that in the changelog.

> > > The new pci state mutex provides a simpler way of addressing
> > > this race and avoids the relative includes added to the powerpc
> > > code.
>

2018-08-20 02:14:04

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Sat, 2018-08-18 at 21:24 -0500, Bjorn Helgaas wrote:
> On Sat, Aug 18, 2018 at 01:24:51PM +1000, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
> > > On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
> > > > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
> > >
> > > Just to be clear, if I understand correctly, this is a pure revert of
> > > 44bda4b7d26e and as such it reintroduces the problem solved by that
> > > commit.
> > >
> > > If your solution turns out to be better, that's great, but it would be
> > > nice to avoid the bisection hole of reintroducing the problem, then
> > > fixing it again later.
> >
> > There is no way to do that other than merging the revert and the fix
> > into one. That said, the race is sufficiently hard to hit that I
> > wouldn't worry too much about it.
>
> OK, then at least mention that in the changelog.

Sure will do. This is just RFC at this stage :-)

As for the race with enable, what's your take on my approach ? The
basic premise is that we need some kind of mutex to make the updates to
enable_cnt and the actual config space changes atomic. While at it we
can fold pci_set_master vs. is_busmaster as well as those are racy too.

I chose to create a new mutex which we should be able to address other
similar races if we find them. The other solutions that I dismissed
were:

- Using the device_lock. A explained previously, this is tricky, I
prefer not using this for anything other than locking against
concurrent add/remove. The main issue is that drivers will be sometimes
called in context where that's already held, so we can't take it inside
pci_enable_device() and I'd rather not add new constraints such as
"pci_enable_device() must be only called from probe() unless you also
take the device lock". It would be tricky to audit everybody...

- Using a global mutex. We could move the bridge lock from AER to core
code for example, and use that. But it doesn't buy us much, and
slightly redecuces parallelism. It also makes it a little bit more
messy to walk up the bridge chain, we'd have to do a
pci_enable_device_unlocked or something, messy.

So are you ok with the approach ? Do you prefer one of the above
regardless ? Something else ?

Cheers,
Ben.



2018-08-20 06:27:26

by Hari Vyas

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Mon, Aug 20, 2018 at 7:40 AM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Sat, 2018-08-18 at 21:24 -0500, Bjorn Helgaas wrote:
>> On Sat, Aug 18, 2018 at 01:24:51PM +1000, Benjamin Herrenschmidt wrote:
>> > On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
>> > > On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
>> > > > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
>> > >
>> > > Just to be clear, if I understand correctly, this is a pure revert of
>> > > 44bda4b7d26e and as such it reintroduces the problem solved by that
>> > > commit.
>> > >
>> > > If your solution turns out to be better, that's great, but it would be
>> > > nice to avoid the bisection hole of reintroducing the problem, then
>> > > fixing it again later.
>> >
>> > There is no way to do that other than merging the revert and the fix
>> > into one. That said, the race is sufficiently hard to hit that I
>> > wouldn't worry too much about it.
>>
>> OK, then at least mention that in the changelog.
>
> Sure will do. This is just RFC at this stage :-)
>
> As for the race with enable, what's your take on my approach ? The
> basic premise is that we need some kind of mutex to make the updates to
> enable_cnt and the actual config space changes atomic. While at it we
> can fold pci_set_master vs. is_busmaster as well as those are racy too.
>
> I chose to create a new mutex which we should be able to address other
> similar races if we find them. The other solutions that I dismissed
> were:
>
> - Using the device_lock. A explained previously, this is tricky, I
> prefer not using this for anything other than locking against
> concurrent add/remove. The main issue is that drivers will be sometimes
> called in context where that's already held, so we can't take it inside
> pci_enable_device() and I'd rather not add new constraints such as
> "pci_enable_device() must be only called from probe() unless you also
> take the device lock". It would be tricky to audit everybody...
>
> - Using a global mutex. We could move the bridge lock from AER to core
> code for example, and use that. But it doesn't buy us much, and
> slightly redecuces parallelism. It also makes it a little bit more
> messy to walk up the bridge chain, we'd have to do a
> pci_enable_device_unlocked or something, messy.
>
> So are you ok with the approach ? Do you prefer one of the above
> regardless ? Something else ?
>
> Cheers,
> Ben.
>
>
Some concern was raised about race situation so just to be more clear
about race condition.
Situation is described in Bug 200283 - PCI: Data corruption happening
due to a race condition.
Issue was discovered by our broadcom QA team.
Initially commit was to use one separate lock only for avoiding race
condition but after review, approach was slightly changed to move
is_added to pci_dev private flags as it was completely
internal/private variable of pci core driver.
Powerpc legacy arch code was using is_added flag directly which looks
bit strange so ../../ type of inclusion of pci.h was required. I know
it looks ugly but it is being used in some legacy code still.
Anyway, as stated earlier too, problem should be just solved in better way.

Regards,
hari

2018-08-20 07:19:41

by Lukas Wunner

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Mon, Aug 20, 2018 at 12:10:59PM +1000, Benjamin Herrenschmidt wrote:
> I chose to create a new mutex which we should be able to address other
> similar races if we find them. The other solutions that I dismissed
> were:
>
> - Using the device_lock. A explained previously, this is tricky, I
> prefer not using this for anything other than locking against
> concurrent add/remove. The main issue is that drivers will be sometimes
> called in context where that's already held, so we can't take it inside
> pci_enable_device() and I'd rather not add new constraints such as
> "pci_enable_device() must be only called from probe() unless you also
> take the device lock". It would be tricky to audit everybody...
>
> - Using a global mutex. We could move the bridge lock from AER to core
> code for example, and use that. But it doesn't buy us much, and
> slightly redecuces parallelism. It also makes it a little bit more
> messy to walk up the bridge chain, we'd have to do a
> pci_enable_device_unlocked or something, messy.

+1 from my side for adding a struct mutex to struct pci_dev to protect
state changes.

The device_lock() primarily protects binding / unbinding of the device
and pci_dev state may have to be changed while binding / unbinding.

A global lock invites deadlocks if multiple devices are added / removed
concurrently where one is a parent of the other. (Think hot-removal of
multiple devices on a Thunderbolt daisy-chain.)

As said I'd also welcome folding PCI_DEV_DISCONNECTED into enum
pci_channel_state, either as an additional state or by using
pci_channel_io_perm_failure.

Thanks,

Lukas

2018-08-20 11:11:22

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Mon, 2018-08-20 at 11:55 +0530, Hari Vyas wrote:
> On Mon, Aug 20, 2018 at 7:40 AM, Benjamin Herrenschmidt
> <[email protected]> wrote:
> > On Sat, 2018-08-18 at 21:24 -0500, Bjorn Helgaas wrote:
> > > On Sat, Aug 18, 2018 at 01:24:51PM +1000, Benjamin Herrenschmidt wrote:
> > > > On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
> > > > > On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
> > > > > > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
> > > > >
> > > > > Just to be clear, if I understand correctly, this is a pure revert of
> > > > > 44bda4b7d26e and as such it reintroduces the problem solved by that
> > > > > commit.
> > > > >
> > > > > If your solution turns out to be better, that's great, but it would be
> > > > > nice to avoid the bisection hole of reintroducing the problem, then
> > > > > fixing it again later.
> > > >
> > > > There is no way to do that other than merging the revert and the fix
> > > > into one. That said, the race is sufficiently hard to hit that I
> > > > wouldn't worry too much about it.
> > >
> > > OK, then at least mention that in the changelog.
> >
> > Sure will do. This is just RFC at this stage :-)
> >
> > As for the race with enable, what's your take on my approach ? The
> > basic premise is that we need some kind of mutex to make the updates to
> > enable_cnt and the actual config space changes atomic. While at it we
> > can fold pci_set_master vs. is_busmaster as well as those are racy too.
> >
> > I chose to create a new mutex which we should be able to address other
> > similar races if we find them. The other solutions that I dismissed
> > were:
> >
> > - Using the device_lock. A explained previously, this is tricky, I
> > prefer not using this for anything other than locking against
> > concurrent add/remove. The main issue is that drivers will be sometimes
> > called in context where that's already held, so we can't take it inside
> > pci_enable_device() and I'd rather not add new constraints such as
> > "pci_enable_device() must be only called from probe() unless you also
> > take the device lock". It would be tricky to audit everybody...
> >
> > - Using a global mutex. We could move the bridge lock from AER to core
> > code for example, and use that. But it doesn't buy us much, and
> > slightly redecuces parallelism. It also makes it a little bit more
> > messy to walk up the bridge chain, we'd have to do a
> > pci_enable_device_unlocked or something, messy.
> >
> > So are you ok with the approach ? Do you prefer one of the above
> > regardless ? Something else ?
> >
> > Cheers,
> > Ben.
> >
> >
>
> Some concern was raised about race situation so just to be more clear
> about race condition.

This is not what the above discussion is about.

The race with is is_added is due to the fact that the bit is set too
later as discussed previously and in my changelog.

The race I'm talking about here is the race related to multiple
subtrees trying to simultaneously enable a parent bridge.

> Situation is described in Bug 200283 - PCI: Data corruption happening
> due to a race condition.
> Issue was discovered by our broadcom QA team.
> Initially commit was to use one separate lock only for avoiding race
> condition but after review, approach was slightly changed to move
> is_added to pci_dev private flags as it was completely
> internal/private variable of pci core driver.
> Powerpc legacy arch code was using is_added flag directly which looks
> bit strange so ../../ type of inclusion of pci.h was required. I know
> it looks ugly but it is being used in some legacy code still.
> Anyway, as stated earlier too, problem should be just solved in better way.

The is_added race can be fixed with a 3 lines patch moving is_added up
to before device_attach() I believe. I yet have to find a scenario
where doing so would break something but it's possible that I missed a
case.

At this stage, I'm more intested however in us agreeing how to fix the
other race, the one with enabling. As I wrote above, I proposed an
approach based on a mutex in pci_dev, and this is what I would like
discussed.

This race is currently causing our large systems to randomly error out
at boot time when probing some PCIe devices. I'm putting a band-aid in
the powerpc code for now to pre-enable bridges at boot, but I'd rather
have the race fixed in the generic code.

Ben.



2018-08-20 11:13:57

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Mon, 2018-08-20 at 09:17 +0200, Lukas Wunner wrote:
> On Mon, Aug 20, 2018 at 12:10:59PM +1000, Benjamin Herrenschmidt wrote:
> > I chose to create a new mutex which we should be able to address other
> > similar races if we find them. The other solutions that I dismissed
> > were:
> >
> > - Using the device_lock. A explained previously, this is tricky, I
> > prefer not using this for anything other than locking against
> > concurrent add/remove. The main issue is that drivers will be sometimes
> > called in context where that's already held, so we can't take it inside
> > pci_enable_device() and I'd rather not add new constraints such as
> > "pci_enable_device() must be only called from probe() unless you also
> > take the device lock". It would be tricky to audit everybody...
> >
> > - Using a global mutex. We could move the bridge lock from AER to core
> > code for example, and use that. But it doesn't buy us much, and
> > slightly redecuces parallelism. It also makes it a little bit more
> > messy to walk up the bridge chain, we'd have to do a
> > pci_enable_device_unlocked or something, messy.
>
> +1 from my side for adding a struct mutex to struct pci_dev to protect
> state changes.

Ok thanks. This is what my patch proposes. We can use it later to
protect more things if we wish to do so.

> The device_lock() primarily protects binding / unbinding of the device
> and pci_dev state may have to be changed while binding / unbinding.

Yup, precisely.

> A global lock invites deadlocks if multiple devices are added / removed
> concurrently where one is a parent of the other. (Think hot-removal of
> multiple devices on a Thunderbolt daisy-chain.)

Yes.

> As said I'd also welcome folding PCI_DEV_DISCONNECTED into enum
> pci_channel_state, either as an additional state or by using
> pci_channel_io_perm_failure.

Ok. I have that in my tentative series but I think for robustness, I
should make the error_state field atomically updated in order to ensure
that no transition out of "disconnected" can happen while racing with
concurrent error_state updates at interrupt time (at least with EEH, it
can be updated from any read{b,w,l,q}).

I'll do a bit more work on the patches this week as time permits and
send a non-RFC series.

Cheers,
Ben.

> Thanks,
>
> Lukas


2018-08-20 11:45:02

by Hari Vyas

[permalink] [raw]
Subject: Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

On Mon, Aug 20, 2018 at 4:39 PM, Benjamin Herrenschmidt
<[email protected]> wrote:
> On Mon, 2018-08-20 at 11:55 +0530, Hari Vyas wrote:
>> On Mon, Aug 20, 2018 at 7:40 AM, Benjamin Herrenschmidt
>> <[email protected]> wrote:
>> > On Sat, 2018-08-18 at 21:24 -0500, Bjorn Helgaas wrote:
>> > > On Sat, Aug 18, 2018 at 01:24:51PM +1000, Benjamin Herrenschmidt wrote:
>> > > > On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
>> > > > > On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
>> > > > > > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
>> > > > >
>> > > > > Just to be clear, if I understand correctly, this is a pure revert of
>> > > > > 44bda4b7d26e and as such it reintroduces the problem solved by that
>> > > > > commit.
>> > > > >
>> > > > > If your solution turns out to be better, that's great, but it would be
>> > > > > nice to avoid the bisection hole of reintroducing the problem, then
>> > > > > fixing it again later.
>> > > >
>> > > > There is no way to do that other than merging the revert and the fix
>> > > > into one. That said, the race is sufficiently hard to hit that I
>> > > > wouldn't worry too much about it.
>> > >
>> > > OK, then at least mention that in the changelog.
>> >
>> > Sure will do. This is just RFC at this stage :-)
>> >
>> > As for the race with enable, what's your take on my approach ? The
>> > basic premise is that we need some kind of mutex to make the updates to
>> > enable_cnt and the actual config space changes atomic. While at it we
>> > can fold pci_set_master vs. is_busmaster as well as those are racy too.
>> >
>> > I chose to create a new mutex which we should be able to address other
>> > similar races if we find them. The other solutions that I dismissed
>> > were:
>> >
>> > - Using the device_lock. A explained previously, this is tricky, I
>> > prefer not using this for anything other than locking against
>> > concurrent add/remove. The main issue is that drivers will be sometimes
>> > called in context where that's already held, so we can't take it inside
>> > pci_enable_device() and I'd rather not add new constraints such as
>> > "pci_enable_device() must be only called from probe() unless you also
>> > take the device lock". It would be tricky to audit everybody...
>> >
>> > - Using a global mutex. We could move the bridge lock from AER to core
>> > code for example, and use that. But it doesn't buy us much, and
>> > slightly redecuces parallelism. It also makes it a little bit more
>> > messy to walk up the bridge chain, we'd have to do a
>> > pci_enable_device_unlocked or something, messy.
>> >
>> > So are you ok with the approach ? Do you prefer one of the above
>> > regardless ? Something else ?
>> >
>> > Cheers,
>> > Ben.
>> >
>> >
>>
>> Some concern was raised about race situation so just to be more clear
>> about race condition.
>
> This is not what the above discussion is about.
>
> The race with is is_added is due to the fact that the bit is set too
> later as discussed previously and in my changelog.
>
> The race I'm talking about here is the race related to multiple
> subtrees trying to simultaneously enable a parent bridge.
>
>> Situation is described in Bug 200283 - PCI: Data corruption happening
>> due to a race condition.
>> Issue was discovered by our broadcom QA team.
>> Initially commit was to use one separate lock only for avoiding race
>> condition but after review, approach was slightly changed to move
>> is_added to pci_dev private flags as it was completely
>> internal/private variable of pci core driver.
>> Powerpc legacy arch code was using is_added flag directly which looks
>> bit strange so ../../ type of inclusion of pci.h was required. I know
>> it looks ugly but it is being used in some legacy code still.
>> Anyway, as stated earlier too, problem should be just solved in better way.
>
> The is_added race can be fixed with a 3 lines patch moving is_added up
> to before device_attach() I believe. I yet have to find a scenario
> where doing so would break something but it's possible that I missed a
> case.
>
> At this stage, I'm more intested however in us agreeing how to fix the
> other race, the one with enabling. As I wrote above, I proposed an
> approach based on a mutex in pci_dev, and this is what I would like
> discussed.
>
> This race is currently causing our large systems to randomly error out
> at boot time when probing some PCIe devices. I'm putting a band-aid in
> the powerpc code for now to pre-enable bridges at boot, but I'd rather
> have the race fixed in the generic code.
>
> Ben.
>
>
I was initially using spin lock in
"PATCH v1] PCI: Data corruption happening due to race condition" and didn't
face issue at-least in our environment for our race condition.
Anyway, we can leave this minor is_added issue time-being and concentrate on
current pci bridge concern. Could you re-share your latest patch in
your environment
and your first doubt where race situation could happen. May be forum
can suggest
some-thing good.


Regard,
hari.