2004-11-10 21:57:20

by long

[permalink] [raw]
Subject: [PATCH] pci-mmconfig fix for 2.6.9

Here I have attached pci mmconfig fix for 2.6.9 kernel.

This will fix the flush error in pci_mmcfg_write.

When pci_mmcfg_write is used to program the PMCSR in the Power
Management Capability structure of PCI config space in the PCI Express
device to a different power state, the dummy readl to flush the previous
write violates the transition delay specified in the PCI power
management spec. Please see PCI Power Management Spec. 1.2 Table 5-6.
For example, while changing the power state of the device through PMCSR
register, a transition delay of 10msec is required before any access can
be made to the device.

Since the configuration write access for PCI Express is non posted,
flushing is not necessary and it will be safe to remove the dummy
readl.

This patch will remove dummy readl function implemented in
"pci_mmcfg_write" and use set_fixmap_nocahe instead of set_fixmap.

Signed-off-by: Sundarapandian Durairaj <[email protected]>
Signed-off-by: T. Long Nguyen <[email protected]>

--------------------------------------------------------------------------
diff -urpN linux-2.6.9/arch/i386/pci/mmconfig.c linux-2.6.9-mmconfig-patch/arch/i386/pci/mmconfig.c
--- linux-2.6.9/arch/i386/pci/mmconfig.c 2004-10-18 17:54:38.000000000 -0400
+++ linux-2.6.9-mmconfig-patch/arch/i386/pci/mmconfig.c 2004-11-10 13:36:29.287713784 -0500
@@ -30,7 +30,7 @@ static inline void pci_exp_set_dev_base(
u32 dev_base = pci_mmcfg_base_addr | (bus << 20) | (devfn << 12);
if (dev_base != mmcfg_last_accessed_device) {
mmcfg_last_accessed_device = dev_base;
- set_fixmap(FIX_PCIE_MCFG, dev_base);
+ set_fixmap_nocache(FIX_PCIE_MCFG, dev_base);
}
}

@@ -85,9 +85,6 @@ static int pci_mmcfg_write(int seg, int
break;
}

- /* Dummy read to flush PCI write */
- readl(mmcfg_virt_addr);
-
spin_unlock_irqrestore(&pci_config_lock, flags);

return 0;
diff -urpN linux-2.6.9/arch/x86_64/pci/mmconfig.c linux-2.6.9-mmconfig-patch/arch/x86_64/pci/mmconfig.c
--- linux-2.6.9/arch/x86_64/pci/mmconfig.c 2004-10-18 17:53:41.000000000 -0400
+++ linux-2.6.9-mmconfig-patch/arch/x86_64/pci/mmconfig.c 2004-11-10 13:37:49.585506664 -0500
@@ -63,9 +63,6 @@ static int pci_mmcfg_write(int seg, int
break;
}

- /* Dummy read to flush PCI write */
- readl(addr);
-
return 0;
}


2004-11-11 01:40:42

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Wed, Nov 10, 2004 at 02:20:23PM -0800, long wrote:
> Here I have attached pci mmconfig fix for 2.6.9 kernel.
>
> This will fix the flush error in pci_mmcfg_write.

Applied, thanks.

greg k-h

2004-11-11 07:58:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

> Since the configuration write access for PCI Express is non posted,
> flushing is not necessary and it will be safe to remove the dummy
> readl.

Where is it guaranteed that these writes are non posted?

-Andi

2004-11-11 16:35:37

by Nguyen, Tom L

[permalink] [raw]
Subject: RE: [PATCH] pci-mmconfig fix for 2.6.9

On Wednesday, November 10, 2004 11:58 PM Andi Kleen wrote:
>Where is it guaranteed that these writes are non posted?

Please refer to section 2.6.1 Flow Control Rules of PCI Express Base
Specs Rev1.0.

Thanks,
Long

2004-11-12 17:53:28

by Michael Chan

[permalink] [raw]
Subject: RE: [PATCH] pci-mmconfig fix for 2.6.9

On Wednesday, November 10, 2004 11:58 PM Andi Kleen wrote:
> Where is it guaranteed that these writes are non posted?

Intel chipset engineer confirmed that they are non-posted. Here are
excerpts from some email exchange with Intel. We reported the
out-of-spec dummy read problem to Intel a while back.

Begin forwarded message:

Hi Michael,

I have checked our chipset engineer. He specified that the mmconfig is
truly non posted and memory cycles will be completed only after the
config write are finished.

So I think flushing is not necessary and readl can be removed.

Thanks,
Sundar

-----Original Message-----
From: Michael Chan [mailto:[email protected]]
Sent: Friday, November 05, 2004 5:08 AM
To: Durairaj, Sundarapandian
Subject: RE: MMCONFIG Bug

Hi Sundar,

Thanks for the update. I agree that config cycles are non-posted and
therefore flushing is unnecessary. However, since config cycles are not
directly generated by the CPU, it is a bit more complicated. When the
CPU issues the memory cycle (writel) to the chipset, the chipset will
translate the memory cycle into a PCI Express config request on the
appropriate PCI Express link. The target device will then return a
completion.

Is the memory cycle to the chipset posted or not? In other words, does
the chipset complete the memory cycle before issuing the config cycle,
or does it issue the config cycle and wait for completion before
completing the memory cycle (writel) from the CPU? If the latter is
true, then it is unnecessary to flush the writel as it is truly
non-posted. If the former is true, I think flushing is still necessary.
Just wanted to confirm this.

Thanks,
Michael


2004-11-12 18:35:57

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Fri, Nov 12, 2004 at 09:52:17AM -0800, Michael Chan wrote:
> On Wednesday, November 10, 2004 11:58 PM Andi Kleen wrote:
> > Where is it guaranteed that these writes are non posted?
>
> Intel chipset engineer confirmed that they are non-posted.

Michael,
Thanks for digging that up.
I think Andi was looking for references to the PCI-E spec.
I found such a statement in "PCI Express(TM) Base Specification
Revision 1.0a".

Table 2-3 on page 47 says:
| Cpl 00 0 1010 Completion without Data ? Used for I/O and
| Configuration Write Completions and Read
| Completions (I/O, Configuration, or
| Memory) with Completion Status other than
| Successful Completion.

Section "2.2. Transaction Layer Protocol - Packet Definition"
| Transactions are carried using Requests and Completions. Completions
| are used only where required, for example, to return read data, or
| to acknowledge Completion of I/O and Configuration Write Transactions.
| Completions are associated with their corresponding Requests by the value
| in the Transaction ID field of the Packet header.

And "2.6.1 Flow Control Rules":
| Flow Control distinguishes three types of TLPs (note relationship
| to ordering rules ? see Section 2.4):
| ? Posted Requests (P) ? Messages and Memory Writes
| ? Non-Posted Requests (NP) ? All Reads, I/O, and Configuration Writes
| ? Completions (CPL) ? Associated with corresponding NP Requests


hth,
grant

2004-11-12 19:39:07

by Michael Chan

[permalink] [raw]
Subject: RE: [PATCH] pci-mmconfig fix for 2.6.9

On Fri, 12 Nov 2004, 11:35:47 -0700, Grant Grundler wrote:

> Michael,
> Thanks for digging that up.
> I think Andi was looking for references to the PCI-E spec.
> I found such a statement in "PCI Express(TM) Base
> Specification Revision 1.0a".
>

Hi Grant,

I think it is well documented that config cycles are non-posted in PCI,
PCIX, and PCI Express specs as you pointed out. The only ambiguity is
whether the mmconfig memory cycle from the CPU to the chipset is posted
or not. The Implementation Note in the MMCONFIG ECN from pcisig (link
below) allows the mmconfig write cycle to be posted, meaning mmconfig
write cycle can complete before the real config write cycle completes.
That's why we needed confirmations from chipset engineers.

Michael

http://www.pcisig.com/specifications/pciexpress/specifications/specifica
tions

2004-11-12 20:55:24

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Fri, Nov 12, 2004 at 11:23:06AM -0800, Michael Chan wrote:
> Hi Grant,
>
> I think it is well documented that config cycles are non-posted in PCI,
> PCIX, and PCI Express specs as you pointed out. The only ambiguity is
> whether the mmconfig memory cycle from the CPU to the chipset is posted
> or not.

sorry - I was wrongly assuming mmconfig has to follow the same
semantics as config since it's intended as a replacement.

In short, the ECN answers Andi's question with "Yes" - thanks for
pointing it out.
For those who don't want to read the whole ECN, bits of it below.

> The Implementation Note in the MMCONFIG ECN from pcisig (link
> below) allows the mmconfig write cycle to be posted, meaning mmconfig
> write cycle can complete before the real config write cycle completes.

Yes. I found it on page 5 of PciEx_ECN_MMCONFIG_040217.pdf.
AFAICT, this section only applies to "systems that implement a
processor-architecture-specific firmware interface standard".
e.g. ia64 SAL calls.

> That's why we needed confirmations from chipset engineers.

Well, Intel confirmed existing chipset comply with this bit of the ECN:
| For systems that are PC-compatible, or that do not implement a
| processor-architecture-specific firmware interface standard that
| allows access to the Configuration Space, the enhanced configuration
| access mechanism is required as defined in this section.
....
| The system hardware must provide a method for the system software
| to guarantee that a write transaction using the enhanced configuration
| access mechanism is completed by the completer before system software
| execution continues.


> http://www.pcisig.com/specifications/pciexpress/specifications/specifica
> tions

I assumed this one was meant:
http://www.pcisig.com/specifications/pciexpress/PciEx_ECN_MMCONFIG_040217.pdf

thanks,
grant

2004-11-12 21:53:40

by Michael Chan

[permalink] [raw]
Subject: RE: [PATCH] pci-mmconfig fix for 2.6.9

On Fri, 12 Nov 2004, 13:55:09 -0700, Grant Grundler wrote:

> Yes. I found it on page 5 of PciEx_ECN_MMCONFIG_040217.pdf.
> AFAICT, this section only applies to "systems that implement
> a processor-architecture-specific firmware interface
> standard". e.g. ia64 SAL calls.
>

Hi Grant,

I disagree with your interpretations of the ECN. Here's my
interpretations:

1. PC-compatible systems or systems that do not implement a
processor-architecture-specific firmware interface must implement
mmconfig as specified in the ECN.

2. mmconfig implementation must provide a method for software to
guarantee that the config access has completed before software execution
continues. In Implementation Note, it provides some examples on how to
do this. One example is to make mmconfig non-posted. But there are other
examples.

In short, I believe mmconfig is allowed to be posted or non-posted. If
it is posted, there must be a method to allow software to flush it.

Michael


2004-11-12 22:32:08

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Fri, Nov 12, 2004 at 01:49:18PM -0800, Michael Chan wrote:
> I disagree with your interpretations of the ECN.

Yeah - I think the alternatives suggested in the new Implementation
Note are confusing and distracting from the actual definitions
and declarations in the previous parts of the spec. The Implementation
Note is NOT the spec. It's just advisory.

The ECN starts out by defining two classes of systems:

| Make the Enhanced Configuration Access Mechanism required for
| PC-compatible platforms, but optional for platforms based on other
| processor/system architectures where firmware abstractions are
| provided for the configuration space access (e.g., DIG64 compliant systems).

The last phrase "where firmware abstractions" is the key bit.


> 2. mmconfig implementation must provide a method for software to
> guarantee that the config access has completed before software execution
> continues.

Agreed.

> In Implementation Note, it provides some examples on how to
> do this. One example is to make mmconfig non-posted. But there are other
> examples.

Yes, but the patch only modifies code for arches which use
direct access to generate the mmconfig cycles. I believe the
posted write examples are for systems which provide "an architected
firmware interface". I'm pretty sure "software" in this context
refers to the "firmware" (e.g. SAL). This spec wasn't written exclusively
for OS dorks like us.

> In short, I believe mmconfig is allowed to be posted or non-posted. If
> it is posted, there must be a method to allow software to flush it.

Yes. Agreed.
But existing direct access methods must implement non-postable writes
to be compliant.

E.g. the second paragraph of the Implementation Note:
| In those cases in which the software must know that a posted
| transaction is completed by the completer, ...

IMHO, "In those cases" refers to the second class of systems.
i386 and x86_64 are (still) in the first class of "legacy" systems.

hth,
grant

2004-11-13 00:05:38

by Michael Chan

[permalink] [raw]
Subject: RE: [PATCH] pci-mmconfig fix for 2.6.9

On Fri, 12 Nov 2004, 15:31:59 -0700, Grant Grundler wrote:

>
> On Fri, Nov 12, 2004 at 01:49:18PM -0800, Michael Chan wrote:
> > In short, I believe mmconfig is allowed to be posted or
> non-posted. If
> > it is posted, there must be a method to allow software to flush it.
>
> Yes. Agreed.
> But existing direct access methods must implement
> non-postable writes to be compliant.
>
> E.g. the second paragraph of the Implementation Note:
> | In those cases in which the software must know that a posted
> | transaction is completed by the completer, ...
>
> IMHO, "In those cases" refers to the second class of systems.
> i386 and x86_64 are (still) in the first class of "legacy" systems.

Hi Grant,

I think we are almost in agreement. IMHO, the entire ECN applies the
PC-compatible systems and any examples in the Implementation Note is a
valid implementation. mmconfig can be posted, as long as there is a way
to guarantee that the write has completed, such as flush by reading. We
disagree on this point.

The Intel mmconfig implementation is non-posted which is a valid
implementation. Therefore readl is unnecessary and can be removed.

Michael

2004-11-13 09:23:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

> The Intel mmconfig implementation is non-posted which is a valid
> implementation. Therefore readl is unnecessary and can be removed.

If I got the discussion so far correctly then the PCI-SGI spec does not
guarantee that there is no posting, but you know that the chipset
you are using right now doesn't do it.

The problem with the explanation is that there will be very soon
chipsets not from Intel that also implement PCI-Express. And also
even systems with non Intel CPUs that also do PCI-Express and
mmconfig.

Did you check with other chipset vendors like Nvidia or VIA too?

I would like us to not fall into the "all world runs a Intel chipset"
trap.

-Andi

2004-11-13 16:23:07

by Michael Chan

[permalink] [raw]
Subject: RE: [PATCH] pci-mmconfig fix for 2.6.9

> If I got the discussion so far correctly then the PCI-SGI spec does not
> guarantee that there is no posting, but you know that the chipset
> you are using right now doesn't do it.

Yes, that's my understanding of the spec. Grant Grundler does not agree and thinks that non-posting is the only compliant implementation. I wish he was right as it would be the easiest to deal with. We contacted Intel about the out-of-spec readl when writing to the PMCSR to change power state as they were the original author of the mmconfig code. Their solution was to remove the readl after confirming that mmconfig was non-posted on their chipsets.

> The problem with the explanation is that there will be very soon
> chipsets not from Intel that also implement PCI-Express. And also
> even systems with non Intel CPUs that also do PCI-Express and
> mmconfig.

I agree with you. Having the readl would be the safest and the most conservative approach. But it is out-of-spec to have a readl immediately following writel to the PMCSR when power state is changed. So should we have a relaxed version of pci_write_config_* with no flushing read for programming the PMCSR?

Michael


2004-11-13 19:46:38

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Sat, Nov 13, 2004 at 08:22:50AM -0800, Michael Chan wrote:
> > If I got the discussion so far correctly then the PCI-SGI spec does not
> > guarantee that there is no posting, but you know that the chipset
> > you are using right now doesn't do it.
>
> Yes, that's my understanding of the spec. Grant Grundler does not agree
> and thinks that non-posting is the only compliant implementation.

That's not what I said. I think we do agree. I'll rephrase.
The code currently in arch/i386 and arch/x86_64 support a chipset that
is compliant with the part of the spec that requires non-postable
config writes.

Other chipsets can implement postable config space. To be compliant
with the ECN, the architecture must define a method to guarantee
the posted writes have reached the target device. I think the
ECN we've been talking about assumes that method will be implemented
in firmware somehow and NOT as a direct access method in the OS.

> I wish he was right as it would be the easiest to deal with.
> We contacted Intel about the out-of-spec readl when writing to
> the PMCSR to change power state as they were the original author
> of the mmconfig code. Their solution was to remove the readl after
> confirming that mmconfig was non-posted on their chipsets.

That means someone has to introduce a new method to access
mmconfig if they implement postable writes.

hth,
grant

2004-11-14 08:58:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Sat, Nov 13, 2004 at 12:46:34PM -0700, Grant Grundler wrote:
> On Sat, Nov 13, 2004 at 08:22:50AM -0800, Michael Chan wrote:
> > > If I got the discussion so far correctly then the PCI-SGI spec does not
> > > guarantee that there is no posting, but you know that the chipset
> > > you are using right now doesn't do it.
> >
> > Yes, that's my understanding of the spec. Grant Grundler does not agree
> > and thinks that non-posting is the only compliant implementation.
>
> That's not what I said. I think we do agree. I'll rephrase.
> The code currently in arch/i386 and arch/x86_64 support a chipset that
> is compliant with the part of the spec that requires non-postable
> config writes.
>
> Other chipsets can implement postable config space. To be compliant
> with the ECN, the architecture must define a method to guarantee
> the posted writes have reached the target device. I think the
> ECN we've been talking about assumes that method will be implemented
> in firmware somehow and NOT as a direct access method in the OS.

Hmm, but there is no way for the chipset to tell us that this
is needed.

Perhaps we really need to special case this and add posted pci config
writes to handle Michael's power management issue properly.

That would be definitely the safer approach.

>
> > I wish he was right as it would be the easiest to deal with.
> > We contacted Intel about the out-of-spec readl when writing to
> > the PMCSR to change power state as they were the original author
> > of the mmconfig code. Their solution was to remove the readl after
> > confirming that mmconfig was non-posted on their chipsets.
>
> That means someone has to introduce a new method to access
> mmconfig if they implement postable writes.


Problem is that it adds silently a very subtle bug and there
is no way I know of for ACPI to tell the firmware it shouldn't use
posting. The driver should know when the read is forbidden though.

-Andi

2004-11-15 06:00:34

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Sun, Nov 14, 2004 at 09:58:31AM +0100, Andi Kleen wrote:
> > Other chipsets can implement postable config space. To be compliant
> > with the ECN, the architecture must define a method to guarantee
> > the posted writes have reached the target device. I think the
> > ECN we've been talking about assumes that method will be implemented
> > in firmware somehow and NOT as a direct access method in the OS.
>
> Hmm, but there is no way for the chipset to tell us that this
> is needed.

Why not?
Can't we just "know" if specific chipsets implement posted
mmconfig writes or not?

Intel folks have confirmed their chipsets to date implement
non-postable mmconfig writes.


> Perhaps we really need to special case this and add posted pci config
> writes to handle Michael's power management issue properly.
> That would be definitely the safer approach.

hrm...are you suggesting another entry point in the struct raw_pci_ops?

There are two tests in arch/i386/pci/mmconfig.c:pci_mmcfg_init() before
the raw_pci_ops is set to &pci_mmcfg. Perhaps some additional crude tests
could select a different set of pci_raw_ops to deal with posted writes
to mmconfig space. Someone more familiar with those chipsets might
find a more elegant solution.

> > That means someone has to introduce a new method to access
> > mmconfig if they implement postable writes.
>
>
> Problem is that it adds silently a very subtle bug and there
> is no way I know of for ACPI to tell the firmware it shouldn't use
> posting.

Uhm, ACPI needs to tell the firmware?
I would expect firmware to be platform/chip specific and "just know".

If you meant OS, we already embed knowledge about specific chipsets
for bug workarounds (e.g. tg3 driver). I think that's an option here too.
I mean tweaking mmconfig.c to install a (possibly) chip specific
method (raw_pci_ops) to flush posted mmconfig writes.

> The driver should know when the read is forbidden though.

Yeah, I would expect it to also. But I certainly don't expect
it to know how to flush a posted write with anything but a
config read.

If the chipset that implemented posted mmconfig writes is compliant
with the ECN we've been talking about, then there must be some other
method to guarantee the writes reaches the device. I would expect
another raw_pci_ops method to be able to deal with it but can't
promise that's true for every chipset out there.

grant

2004-11-15 07:07:38

by Michael Chan

[permalink] [raw]
Subject: RE: [PATCH] pci-mmconfig fix for 2.6.9

Grant Grundler wrote:

> There are two tests in arch/i386/pci/mmconfig.c:pci_mmcfg_init() before
> the raw_pci_ops is set to &pci_mmcfg. Perhaps some additional crude tests
> could select a different set of pci_raw_ops to deal with posted writes
> to mmconfig space. Someone more familiar with those chipsets might
> find a more elegant solution.

Do you mean something like pci_mmcfg1 for Intel chipsets that implement non-posted mmconfig and pci_mmcfg2 for other chipsets that may implement posted mmconfig? pci_mmcfg2's write method will guarantee that the write has reached the target before returning. If pci_mmcfg2's write method uses read from the target to flush the write, we are back to the original problem of out-of-spec read when writing the PMCSR register. If the flush does not require reading from the target device, then it's fine.

Michael




2004-11-15 08:01:22

by Grant Grundler

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Sun, Nov 14, 2004 at 11:07:13PM -0800, Michael Chan wrote:
> Grant Grundler wrote:
>
> > There are two tests in arch/i386/pci/mmconfig.c:pci_mmcfg_init() before
> > the raw_pci_ops is set to &pci_mmcfg. Perhaps some additional crude tests
> > could select a different set of pci_raw_ops to deal with posted writes
> > to mmconfig space. Someone more familiar with those chipsets might
> > find a more elegant solution.
>
> Do you mean something like pci_mmcfg1 for Intel chipsets that implement
> non-posted mmconfig and pci_mmcfg2 for other chipsets that may implement
> posted mmconfig?

Yes.

> pci_mmcfg2's write method will guarantee that the write has reached
> the target before returning. If pci_mmcfg2's write method uses read
> from the target to flush the write, we are back to the original problem
> of out-of-spec read when writing the PMCSR register. If the flush does
> not require reading from the target device, then it's fine.

Yes - agreed.

I do expect chip designers are aware of this problem.
It's not a new problem.
But it's certainly possible some are not. :^(

thanks,
grant

>
> Michael
>
>
>

2004-11-15 09:00:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] pci-mmconfig fix for 2.6.9

On Sun, Nov 14, 2004 at 11:00:21PM -0700, Grant Grundler wrote:
> > writes to handle Michael's power management issue properly.
> > That would be definitely the safer approach.
>
> hrm...are you suggesting another entry point in the struct raw_pci_ops?

Yes.

>
> There are two tests in arch/i386/pci/mmconfig.c:pci_mmcfg_init() before
> the raw_pci_ops is set to &pci_mmcfg. Perhaps some additional crude tests
> could select a different set of pci_raw_ops to deal with posted writes
> to mmconfig space. Someone more familiar with those chipsets might
> find a more elegant solution.

I cannot think of a generic good way to detect posting from the software
side.

>
> > > That means someone has to introduce a new method to access
> > > mmconfig if they implement postable writes.
> >
> >
> > Problem is that it adds silently a very subtle bug and there
> > is no way I know of for ACPI to tell the firmware it shouldn't use
> > posting.
>
> Uhm, ACPI needs to tell the firmware?
> I would expect firmware to be platform/chip specific and "just know".

x86-64 always uses direct hardware access, x86 can use the 32bit PCI BIOS,
but it's now discouraged.


>
> If you meant OS, we already embed knowledge about specific chipsets
> for bug workarounds (e.g. tg3 driver). I think that's an option here too.
> I mean tweaking mmconfig.c to install a (possibly) chip specific
> method (raw_pci_ops) to flush posted mmconfig writes.

Possibly yes. One issue is that we have a subtle bug until we notice
though.

-Andi