2017-03-03 02:09:55

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 1/4] pci: Allow lockless access path to PCI mmconfig

From: Andi Kleen <[email protected]>

The Intel uncore driver can do a lot of PCI config accesses to read
performance counters. I had a situation on a 4S system where it
was spending 40+% of CPU time grabbing the pci_cfg_lock due to that.

For 64bit x86 with MMCONFIG there isn't really any reason to take
a lock. The access is directly mapped to an underlying MMIO area,
which can fully operate lockless.

Add a new flag that allows the PCI mid layer to skip the lock
and set it for the 64bit mmconfig code.

There's a small risk that someone relies on this lock for synchronization,
but I think that's unlikely because there isn't really any useful
synchronization at this individual operation level. Any useful
synchronization would likely need to protect at least a
read-modify-write or similar. So I made it unconditional without opt-in.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/pci/mmconfig_64.c | 1 +
drivers/pci/access.c | 14 ++++++++++----
include/linux/pci.h | 2 ++
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c
index bea52496aea6..8bf10f41e626 100644
--- a/arch/x86/pci/mmconfig_64.c
+++ b/arch/x86/pci/mmconfig_64.c
@@ -121,6 +121,7 @@ int __init pci_mmcfg_arch_init(void)
}

raw_pci_ext_ops = &pci_mmcfg;
+ pci_root_ops.ll_allowed = true;

return 1;
}
diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index db239547fefd..22552c6606c1 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -32,11 +32,14 @@ int pci_bus_read_config_##size \
int res; \
unsigned long flags; \
u32 data = 0; \
+ bool ll_allowed = bus->ops->ll_allowed; \
if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER; \
- raw_spin_lock_irqsave(&pci_lock, flags); \
+ if (!ll_allowed) \
+ raw_spin_lock_irqsave(&pci_lock, flags); \
res = bus->ops->read(bus, devfn, pos, len, &data); \
*value = (type)data; \
- raw_spin_unlock_irqrestore(&pci_lock, flags); \
+ if (!ll_allowed) \
+ raw_spin_unlock_irqrestore(&pci_lock, flags); \
return res; \
}

@@ -46,10 +49,13 @@ int pci_bus_write_config_##size \
{ \
int res; \
unsigned long flags; \
+ bool ll_allowed = bus->ops->ll_allowed; \
if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER; \
- raw_spin_lock_irqsave(&pci_lock, flags); \
+ if (!ll_allowed) \
+ raw_spin_lock_irqsave(&pci_lock, flags); \
res = bus->ops->write(bus, devfn, pos, len, value); \
- raw_spin_unlock_irqrestore(&pci_lock, flags); \
+ if (!ll_allowed) \
+ raw_spin_unlock_irqrestore(&pci_lock, flags); \
return res; \
}

diff --git a/include/linux/pci.h b/include/linux/pci.h
index e2d1a124216a..9b234cbc7ae1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -612,6 +612,8 @@ struct pci_ops {
void __iomem *(*map_bus)(struct pci_bus *bus, unsigned int devfn, int where);
int (*read)(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *val);
int (*write)(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 val);
+ /* Set to true when pci_lock is not needed for read/write */
+ bool ll_allowed;
};

/*
--
2.9.3


2017-03-03 01:48:24

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 2/4] pci: Add generic pci_bus_force_mmconfig interface

From: Andi Kleen <[email protected]>

x86 traditionally used mmconfig only for extended config space accesses
with offsets larger than 256. For lower offsets it uses the classic
Type 1 IO port access. This is quite slow and also requires taking
a global spin lock to protect the Type 1 IO port mailbox.

IIRC (I added it originally) it was merely to be conservative;
I don't remember any actual cases where mmconfig did not work
after passing the other sanity checks. But most devices
don't use extended config space, so most devices were never
tested with MMCONFIG. Starting to use MMCONFIG everywhere
unconditionally seems somewhat risky as we never tested this

Also for most drivers it doesn't really matter because they only
use config accesses during driver initialization, which is
not performance critical.

But some drivers can do a lot of config bus accesses < 256.
I ran into this problem with the Intel uncore PMU drivers
when sampling uncore counters in many PMUs at a high frequency
on a 4S system. This showed significant spin lock contention
and also large overhead in accessing the IO port.

For drivers like this is useful to be able to force MMCONFIG
access. The interface is intended for devices integrated on a SOC,
so there is no concern that some strange PCI bridges may
incorrectly handle mmconfig.

This patch adds a new function pci_bus_force_mmconfig() where
a driver can force MMCONFIG for all accesses on the bus.

This is a stub implementation, but the next patch will
fill it in for x86.

Signed-off-by: Andi Kleen <[email protected]>
---
drivers/pci/pci.c | 15 +++++++++++++++
include/linux/pci.h | 2 ++
2 files changed, 17 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 7904d02ffdb9..af68f1d9eed7 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5236,6 +5236,21 @@ int pci_bus_find_domain_nr(struct pci_bus *bus, struct device *parent)
#endif

/**
+ * pci_bus_force_mmconfig - enable mmconfig for all config accesses on bus.
+ *
+ * Allow mmconfig for all accesses to devices. This is mainly a performance
+ * optimization, and should be only used if the driver "knows" all the bridges
+ * involved (e.g. SOC on chip devices)
+ *
+ * Default implementation. Architectures can override this.
+ */
+__weak int pci_bus_force_mmconfig(struct pci_bus *bus)
+{
+ return 0;
+}
+EXPORT_SYMBOL(pci_bus_force_mmconfig);
+
+/**
* pci_ext_cfg_avail - can we access extended PCI config space?
*
* Returns 1 if we can access PCI extended config space (offsets
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9b234cbc7ae1..61b4437dc8e5 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -891,6 +891,8 @@ void pci_sort_breadthfirst(void);

/* Generic PCI functions exported to card drivers */

+int pci_bus_force_mmconfig(struct pci_bus *bus);
+
enum pci_lost_interrupt_reason {
PCI_LOST_IRQ_NO_INFORMATION = 0,
PCI_LOST_IRQ_DISABLE_MSI,
--
2.9.3

2017-03-03 02:05:51

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 4/4] perf/x86/intel/uncore: Enable forced mmconfig for Intel uncore

From: Andi Kleen <[email protected]>

On Intel systems some uncore counters are located in PCI config space.
On 4S systems with many uncore events being sampled at a high frequency
we can see significant overhead from the type 1 accesses: both
from the IO port accesses and also from lock contention on the locks
protection the IO port.

This patch uses the pci_bus_force_mmconfig() interface earlier
to force lockless MMCONFIG for the bus the uncore devices are
residing on, which significantly lowers overhead. These
busses only contain Intel on chip devices which support mmconfig.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/events/intel/uncore.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index 758c1aa5009d..4cc2ee3d0165 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -875,6 +875,12 @@ static int uncore_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id
struct intel_uncore_box *box;
int phys_id, pkg, ret;

+ /*
+ * Force MMCONFIG for all accesses, as we can otherwise
+ * have significant lock contention on the type1 IO port spinlock.
+ */
+ pci_bus_force_mmconfig(pdev->bus);
+
phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
--
2.9.3

2017-03-03 05:05:17

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 3/4] x86, pci: Add interface to force mmconfig

From: Andi Kleen <[email protected]>

This fills in the pci_bus_force_mmconfig interface that was
added earlier for x86 to allow drivers to optimize config
space accesses. The implementation is straight forward
and uses the existing mmconfig access functions, just forcing
mmconfig access.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/pci/mmconfig-shared.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index dd30b7e08bc2..bb56533290aa 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -816,3 +816,31 @@ int pci_mmconfig_delete(u16 seg, u8 start, u8 end)

return -ENOENT;
}
+
+static int pci_mmconfig_read(struct pci_bus *bus, unsigned int devfn,
+ int where, int size, u32 *value)
+{
+ return raw_pci_ext_ops->read(pci_domain_nr(bus), bus->number,
+ devfn, where, size, value);
+}
+
+static int pci_mmconfig_write(struct pci_bus *bus, unsigned int devfn,
+ int where, int size, u32 value)
+{
+ return raw_pci_ext_ops->write(pci_domain_nr(bus), bus->number,
+ devfn, where, size, value);
+}
+
+struct pci_ops pci_mmconfig_ops = {
+ .read = pci_mmconfig_read,
+ .write = pci_mmconfig_write,
+};
+
+/* Force all config accesses to go through mmconfig. */
+int pci_bus_force_mmconfig(struct pci_bus *bus)
+{
+ if (!raw_pci_ext_ops)
+ return -1;
+ bus->ops = &pci_mmconfig_ops;
+ return 0;
+}
--
2.9.3

2017-03-14 13:06:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/4] pci: Allow lockless access path to PCI mmconfig

On Thu, 2 Mar 2017, Andi Kleen wrote:

> From: Andi Kleen <[email protected]>
>
> The Intel uncore driver can do a lot of PCI config accesses to read
> performance counters. I had a situation on a 4S system where it
> was spending 40+% of CPU time grabbing the pci_cfg_lock due to that.
>
> For 64bit x86 with MMCONFIG there isn't really any reason to take
> a lock. The access is directly mapped to an underlying MMIO area,
> which can fully operate lockless.
>
> Add a new flag that allows the PCI mid layer to skip the lock
> and set it for the 64bit mmconfig code.
>
> There's a small risk that someone relies on this lock for synchronization,
> but I think that's unlikely because there isn't really any useful
> synchronization at this individual operation level. Any useful
> synchronization would likely need to protect at least a
> read-modify-write or similar. So I made it unconditional without opt-in.

This part of the changelog is just crap.

The reason why pci_lock exists and is taken for each single read/write
config is that some ops implementations, e.g. the generic ones, must
protect at this granularity level because

ops->map_bus()
read/writeX()

needs to be 'atomic'.

MMCONFIG obviously does not require this at all because it's a simple
byte/word/dword read/write which is serialized by itself. So it's obvious
that the serialization with pci_lock is pointless in this case.

It's not that hard to figure it out and write up a proper changelog instead
of handwaving about risk and whatever.

Thanks,

tglx

2017-03-14 13:55:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Thu, 2 Mar 2017, Andi Kleen wrote:
> +struct pci_ops pci_mmconfig_ops = {
> + .read = pci_mmconfig_read,
> + .write = pci_mmconfig_write,
> +};
> +
> +/* Force all config accesses to go through mmconfig. */
> +int pci_bus_force_mmconfig(struct pci_bus *bus)
> +{
> + if (!raw_pci_ext_ops)
> + return -1;

We have error defines. That aside, the weak version of this returns 0,
i.e. success, but here you return fail. Consistency is overrated, right?

> + bus->ops = &pci_mmconfig_ops;

What guarantees that raw_pci_ext_ops == pci_mmcfg? Nothing as far as I can
tell. So that function name is nonsensical. It does not force anything.

And the way how this function is used is a horrible hack. It's called from
a random driver at some random point in time.

The proper solution is to identify the bus at the point where the bus is
discovered and switch it to mmconfig if possible.

Thanks,

tglx







2017-03-14 15:42:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

> And the way how this function is used is a horrible hack. It's called from
> a random driver at some random point in time.
>
> The proper solution is to identify the bus at the point where the bus is
> discovered and switch it to mmconfig if possible.

But how would you know that it is safe?
AFAIK the only one who knows "this is an internal SOC device" is the driver.

Or are you saying it should be enabled unconditionally for everything?
Or define some quirk table just for this purpose?

-Andi

2017-03-14 16:41:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Tue, 14 Mar 2017, Andi Kleen wrote:

> > And the way how this function is used is a horrible hack. It's called from
> > a random driver at some random point in time.
> >
> > The proper solution is to identify the bus at the point where the bus is
> > discovered and switch it to mmconfig if possible.
>
> But how would you know that it is safe?
> AFAIK the only one who knows "this is an internal SOC device" is the driver.

The driver handles a specific device, but you apply that tweak to the bus
of which the device hangs off.

> Or are you saying it should be enabled unconditionally for everything?

No.

> Or define some quirk table just for this purpose?

Nope. It's about identifying the bus.

The bus which contains the uncore devices:

The Uncore devices reside on CPUBUSNO(1), which is the PCI bus assigned
for the processor socket. The bus number is derived from the max bus range
setting and the processor socket number.

So there should be a way to detect that and it would be appreciated if you
could talk to your hardware folks what's the programmatic way to figure it
out. Maybe there is information in ACPI as well.

Thanks,

tglx


2017-03-14 17:03:08

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

> > Or define some quirk table just for this purpose?
>
> Nope. It's about identifying the bus.

PCI just has no good way to identify busses.

>
> The bus which contains the uncore devices:
>
> The Uncore devices reside on CPUBUSNO(1), which is the PCI bus assigned
> for the processor socket. The bus number is derived from the max bus range
> setting and the processor socket number.

I have had bad experiences with hard coding topology like this. These
things tend to be very configurable by BIOS.

>
> So there should be a way to detect that and it would be appreciated if you
> could talk to your hardware folks what's the programmatic way to figure it
> out.

There's likely some hidden register somewhere. But then it would be

check model number
look for hidden register
configure these busses

and it will likely change often.

Frankly I fail to see how such a thing is better than just using
the device ID. Perhaps the driver is not the right place for it,
but the ID seems to be.

The other option is to simply make it unconditional. That would
be even simpler, but it is a bit more risky. Hmm, I suppose may
be worth trying to find out what Windows uses. If they get away
with MMCONFIG everywhere we should be too.

Or the third option would be to move the ops pointer into the
PCI device, so it's a per device setting. If people are ok with that
I can look into it.

> Maybe there is information in ACPI as well.

I don't think ACPI has any concept of "internal SOC busses"

-Andi

2017-03-14 17:29:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/4] pci: Allow lockless access path to PCI mmconfig

On 03/02/17 15:21, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> The Intel uncore driver can do a lot of PCI config accesses to read
> performance counters. I had a situation on a 4S system where it
> was spending 40+% of CPU time grabbing the pci_cfg_lock due to that.
>
> For 64bit x86 with MMCONFIG there isn't really any reason to take
> a lock. The access is directly mapped to an underlying MMIO area,
> which can fully operate lockless.
>
> Add a new flag that allows the PCI mid layer to skip the lock
> and set it for the 64bit mmconfig code.
>
> There's a small risk that someone relies on this lock for synchronization,
> but I think that's unlikely because there isn't really any useful
> synchronization at this individual operation level. Any useful
> synchronization would likely need to protect at least a
> read-modify-write or similar. So I made it unconditional without opt-in.
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> arch/x86/pci/mmconfig_64.c | 1 +
> drivers/pci/access.c | 14 ++++++++++----
> include/linux/pci.h | 2 ++
> 3 files changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/pci/mmconfig_64.c b/arch/x86/pci/mmconfig_64.c
> index bea52496aea6..8bf10f41e626 100644
> --- a/arch/x86/pci/mmconfig_64.c
> +++ b/arch/x86/pci/mmconfig_64.c
> @@ -121,6 +121,7 @@ int __init pci_mmcfg_arch_init(void)
> }
>
> raw_pci_ext_ops = &pci_mmcfg;
> + pci_root_ops.ll_allowed = true;
>

"ll_allowed" is pretty awful naming... you spend almost all the
characters telling us nothing. I spend several seconds trying to figure
out what "ll" stood for, and without the context of the patch I'd have
had to go a massive grep. Just call it "lockless" or something.

-hpa


2017-03-14 17:35:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/4] pci: Add generic pci_bus_force_mmconfig interface

On 03/02/17 15:21, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> x86 traditionally used mmconfig only for extended config space accesses
> with offsets larger than 256. For lower offsets it uses the classic
> Type 1 IO port access. This is quite slow and also requires taking
> a global spin lock to protect the Type 1 IO port mailbox.
>
> IIRC (I added it originally) it was merely to be conservative;
> I don't remember any actual cases where mmconfig did not work
> after passing the other sanity checks. But most devices
> don't use extended config space, so most devices were never
> tested with MMCONFIG. Starting to use MMCONFIG everywhere
> unconditionally seems somewhat risky as we never tested this
>

I would rather enable it by default, but having a command-line switch to
disable it. This seems a helluva lot saner than mucking with the entire
PCI bus from an individual device driver.

Also, it seems crazy to say we will take this penalty forever. My
opinion is we should enable it in the absence of evidence of
malfunctions, but if we end up having problems we could put an ACPI date
cutoff quirk in.

-hpa


2017-03-14 17:56:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Tue, 14 Mar 2017, Andi Kleen wrote:
> The other option is to simply make it unconditional. That would
> be even simpler, but it is a bit more risky. Hmm, I suppose may
> be worth trying to find out what Windows uses. If they get away
> with MMCONFIG everywhere we should be too.

>From the PCIe spec:

The PCI 3.0 compatible Configuration Space can be accessed using either
the mechanism defined in the PCI Local Bus Specification or the PCI
Express Enhanced Configuration Access Mechanism (ECAM) described later in
this section. Accesses made using either access mechanism are
equivalent. The PCI Express Extended Configuration Space can only be
accessed by using the ECAM.

The 1.0 spec has a similar wording (s/3.0/2.3).

Definitely the best way to go.

Thanks,

tglx

2017-03-14 19:47:25

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Tue, Mar 14, 2017 at 06:56:26PM +0100, Thomas Gleixner wrote:
> On Tue, 14 Mar 2017, Andi Kleen wrote:
> > The other option is to simply make it unconditional. That would
> > be even simpler, but it is a bit more risky. Hmm, I suppose may
> > be worth trying to find out what Windows uses. If they get away
> > with MMCONFIG everywhere we should be too.
>
> From the PCIe spec:
>
> The PCI 3.0 compatible Configuration Space can be accessed using either
> the mechanism defined in the PCI Local Bus Specification or the PCI
> Express Enhanced Configuration Access Mechanism (ECAM) described later in
> this section. Accesses made using either access mechanism are
> equivalent. The PCI Express Extended Configuration Space can only be
> accessed by using the ECAM.
>
> The 1.0 spec has a similar wording (s/3.0/2.3).
>
> Definitely the best way to go.

I agree that it should be fairly safe to do ECAM/MMCONFIG without
locking. Can we handle the decision part by adding a "lockless" bit
to struct pci_ops? Old ops don't mention that bit, so it will be
initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops
can set it and we can skip the locking.

Bjorn

2017-03-15 02:24:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

> I agree that it should be fairly safe to do ECAM/MMCONFIG without
> locking. Can we handle the decision part by adding a "lockless" bit
> to struct pci_ops? Old ops don't mention that bit, so it will be
> initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops
> can set it and we can skip the locking.

That's what my other patch already did.

-Andi

2017-03-15 02:55:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote:
> > I agree that it should be fairly safe to do ECAM/MMCONFIG without
> > locking. Can we handle the decision part by adding a "lockless" bit
> > to struct pci_ops? Old ops don't mention that bit, so it will be
> > initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops
> > can set it and we can skip the locking.
>
> That's what my other patch already did.

Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops.

What I was wondering, but didn't explain very well, was whether
instead of setting that bit at run-time in pci_mmcfg_arch_init(), we
could set it statically in the pci_ops definition, e.g.,

static struct pci_ops ecam_ops = {
.lockless = 1,
.read = ecam_read,
.write = ecam_write,
};

I think it would be easier to read if the lockless-ness were declared
right next to the accessors that need it (or don't need it).

But it is a little confusing with all the different paths, at least on
x86, so maybe it wouldn't be quite that simple.

Bjorn

2017-03-15 10:00:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Tue, 14 Mar 2017, Bjorn Helgaas wrote:
> On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote:
> > > I agree that it should be fairly safe to do ECAM/MMCONFIG without
> > > locking. Can we handle the decision part by adding a "lockless" bit
> > > to struct pci_ops? Old ops don't mention that bit, so it will be
> > > initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops
> > > can set it and we can skip the locking.
> >
> > That's what my other patch already did.
>
> Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops.
>
> What I was wondering, but didn't explain very well, was whether
> instead of setting that bit at run-time in pci_mmcfg_arch_init(), we
> could set it statically in the pci_ops definition, e.g.,
>
> static struct pci_ops ecam_ops = {
> .lockless = 1,
> .read = ecam_read,
> .write = ecam_write,
> };
>
> I think it would be easier to read if the lockless-ness were declared
> right next to the accessors that need it (or don't need it).
>
> But it is a little confusing with all the different paths, at least on
> x86, so maybe it wouldn't be quite that simple.

The pci_ops in x86 are a complete mess. We have

struct pci_ops pci_root_ops = {
.read = pci_read,
.write = pci_write,
};

That's the default and the r/w functions look like this:

pci_read/write()
{
if (domain == 0 && reg && raw_pci_ops)
return raw_pci_ops->read/write();
if (raw_pci_ext_ops)
return raw_pci_ext_ops->read/write();
return -EINVAL;
}

raw_pci_ops and raw_pci_ext_ops are setup through an impenetrable maze of
functions. Some of them overwrite pci_root_ops to something entirely
different.

pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
argument for any bus segment no matter which type it is.

The locking aspect is interesting as well. The type0/1 functions are having
their own internal locking. Oh, well.

What we really want is to differentiate bus segments. That means a PCIe
segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
can do what you suggested above, i.e. marking the ecam/mmconfig ops as
lockless.

Sure that's more work than just whacking a sloppy quirk into the code, but
the right thing to do.

Thanks,

tglx




2017-03-15 14:09:34

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Wed, Mar 15, 2017 at 11:00:22AM +0100, Thomas Gleixner wrote:
> On Tue, 14 Mar 2017, Bjorn Helgaas wrote:
> > On Tue, Mar 14, 2017 at 07:24:14PM -0700, Andi Kleen wrote:
> > > > I agree that it should be fairly safe to do ECAM/MMCONFIG without
> > > > locking. Can we handle the decision part by adding a "lockless" bit
> > > > to struct pci_ops? Old ops don't mention that bit, so it will be
> > > > initialized to zero and we'll do locking as today. ECAM/MMCONFIG ops
> > > > can set it and we can skip the locking.
> > >
> > > That's what my other patch already did.
> >
> > Yes, your 1/4 patch does add the "ll_allowed" bit in struct pci_ops.
> >
> > What I was wondering, but didn't explain very well, was whether
> > instead of setting that bit at run-time in pci_mmcfg_arch_init(), we
> > could set it statically in the pci_ops definition, e.g.,
> >
> > static struct pci_ops ecam_ops = {
> > .lockless = 1,
> > .read = ecam_read,
> > .write = ecam_write,
> > };
> >
> > I think it would be easier to read if the lockless-ness were declared
> > right next to the accessors that need it (or don't need it).
> >
> > But it is a little confusing with all the different paths, at least on
> > x86, so maybe it wouldn't be quite that simple.
>
> The pci_ops in x86 are a complete mess.

That's certainly a pithy summary :)

> pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
> argument for any bus segment no matter which type it is.
>
> The locking aspect is interesting as well. The type0/1 functions are having
> their own internal locking. Oh, well.
>
> What we really want is to differentiate bus segments. That means a PCIe
> segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
> can do what you suggested above, i.e. marking the ecam/mmconfig ops as
> lockless.

If we were starting from scratch, I think we would probably put the
locking inside the device-specific config accessors at the lowest
level. Then it would be directly at the place where it's obvious
what's needed, and it would be easy to do no locking, per-host bridge
locking, or system-wide locking. Right now we have many places that
implicitly depend on pci_lock but there's no direct connection.

We could conceivably migrate to that, but it would be a fair amount of
work.

Bjorn

2017-03-16 00:03:46

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

> pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
> argument for any bus segment no matter which type it is.

mmconfig is only initialized after PCI is initialized
(an ordering problem with ACPI). So it would require
updating existing busses with likely interesting race
conditions.

There are also other ordering problems in the PCI layer,
that is one of the reason early and raw PCI accesses even exist.

>
> The locking aspect is interesting as well. The type0/1 functions are having
> their own internal locking. Oh, well.

Right it could set lockless too. The internal locking is still needed
because there are other users too.

> What we really want is to differentiate bus segments. That means a PCIe
> segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
> can do what you suggested above, i.e. marking the ecam/mmconfig ops as
> lockless.

There's no need to separate PCIe and PCI. mmconfig has nothing to do
with that.

> Sure that's more work than just whacking a sloppy quirk into the code, but
> the right thing to do.

Before proposing grandiose plans it would be better if you
acquired some basic understanding of the constraints this
code is operating under first.

-Andi

2017-03-16 22:48:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86, pci: Add interface to force mmconfig

On Wed, 15 Mar 2017, Andi Kleen wrote:
> > pci_root_ops is what is finally handed in to pci_scan_root_bus() as ops
> > argument for any bus segment no matter which type it is.
>
> mmconfig is only initialized after PCI is initialized (an ordering
> problem with ACPI).

Wrong. It can be initialized before that and it actually is on most of my
machines. Unfortunately its not guaranteed.

> So it would require updating existing busses with likely interesting race
> conditions.

More racy than switching it from a random driver after the PCI bus has been
completely initialized and is already in use? Surely not.

> There are also other ordering problems in the PCI layer,
> that is one of the reason early and raw PCI accesses even exist.

Early accesses are a different class for PCI accesses _before_
pci_arch_init() or acpi_init() has been invoked. That's handled by the
early accessors which are hardcoded to use PCI type 1 configuration access
via CF8/CFC. These are completely seperate and not in any way related to
this.

So lets look how this works:

pci_arch_init()

Setup of raw_pci_ops and raw_pci_ext_ops

This sets mmconfig, when the information is available already.

acpi_init()

Parses the ACPI tables and sets up the PCI root.

Sets mmconfig when not yet set.

pci_subsys_init()

Final x86 pci init calls, which might affect pci ops.

So ideally we would switch to ECAM before acpi_init(), but

- mmconfig might not yet be available

- x86_init.pci.init() which is called from pci_subsys_init() can modify
pci_root_ops or raw_pci_ops

Though that's a non issue simple because after x86_init.pci.init() still
nothing operates on PCI devices and it's safe and simple to replace the
pci_root_ops read and write pointers with ECAM based variants.

> > The locking aspect is interesting as well. The type0/1 functions are having
> > their own internal locking. Oh, well.

> Right it could set lockless too. The internal locking is still needed
> because there are other users too.

Looking at the x86 pci ops variants, there is only the ce4100 one, which
relies on the external locking in the generic pci code. That's reasonable
easy to fix and once that is done the whole conditional locking in the
generic PCI accessors can be avoided. The locking can simply be compiled
out.

> > What we really want is to differentiate bus segments. That means a PCIe
> > segment takes mmconfig ops and a PCI segment the type0/1 ops. That way we
> > can do what you suggested above, i.e. marking the ecam/mmconfig ops as
> > lockless.
>
> There's no need to separate PCIe and PCI. mmconfig has nothing to do
> with that.

What? If the system does not have a PCIe compliant root complex/host
bridge, then you cannot use mmconfig at all. So yes, there needs to be a
decision made.

Sure, we don't have to treat PCI busses behind a PCIe to PCI(-X) bridge
differently as that handled by the host bridge and the PCIe/PCI(-X)
bridge. There might be dragons lurking, but those can be handled with a
date cutoff or a small set of quirks.

> > Sure that's more work than just whacking a sloppy quirk into the code, but
> > the right thing to do.
>
> Before proposing grandiose plans it would be better if you acquired some
> basic understanding of the constraints this code is operating under
> first.

Contrary to you I studied the code and the spec before making uneducated
claims and accusations.

And contrary to you I care about the correctness and the maintainability of
the code. Your works for me and know everything better attitude is the main
reason for the mess which exists today. Your thought termination cliche,
that others do not understand what they are talking about has been proven
wrong over and over.

I did not claim that it's simple and I merily talked about the ideal
solution while I was well aware that there are dependencies and corner
cases.

It took me a only couple of hours to analyze all possible corner cases
which reconfigure pci_root_ops or raw_pci_*ops to find a spot where this
can be done in a sane way.

Patches come with a seperate mail. They get rid of the global pci_lock in
the generic accessors completely and avoid the extra pointer indirection
and do not even get near a driver.

It might look like a grandiose plan to you, but that might be due to a
gross overestimation of the complexity of that code or the lack of basic
engineering principles.

Thanks,

tglx