LinuxLists.cc - Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

2010-02-28 22:19:17

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

2010/2/28 G?bor Stefanik <[email protected]>:
> OK, this dump shows the 0x280a write happening with core 3, i.e. PCIE,
> active. So, it is indeed probably the "PCIE misc configuration"
> routine. Why it's 0x280a is still a mystery to me, it should be 0x100a
> according to the specs.
Unless I'm reading the logs wrong, isn't wl setting bit 0x8000 when
core 1 is mapped (0 indexed cores, 0x18001000 mapped to space 0)?
And b43 appears to do it when core 0 is mapped (0x18000000 mapped to
space 0). b43 also reads from 0x100a after writing to 0x280a, and it
reads as 0x8000 not set (while the 0x280a check show it is set).

This is when comparing the wl_cold and b43_cold logs.

-Nate

2010-03-02 21:58:46

by Michael Büsch

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

On Monday 01 March 2010 01:22:50 Michael Buesch wrote:
> Well, you are confusing address spaces here.
>
> On a PCI based SSB device all host-side MMIO transfers go into
> the PCI device's address space first. The core-switching moves the window of
> the SSB address space that is mapped into 0-0xFFF of the PCI address space.
> So if you write to anything above 0xFFF on the PCI device, the write will
> not (directly) map to the SSB bus or any device on it.
> On the PCI device there is more stuff mapped on top of the SSB sliding
> window. For example the SPROM is mapped right on top of it.
>
> So it might be the case that on a PCI-E device the PCI-E-core's registers
> are permanently mapped into 0x2000 of the PCI address apace. This is to
> avoid sliding the SSB address space window when accessing the PCI-E core.
> This can have several reasons: For one speed (unlikely) and for another
> to avoid concurrency and ugly races when we need to access the PCI-E core
> while the wireless core is already running and generating interrupts.
> Note that this is a GUESS, but it would make sense to me.
> It would be cool if somebody could compare more registers of the PCI-E
> core using the sliding window and the 0x2000 + reg method to check my theory.
>

So what's the status on this? I think the fact that the testing patch showed some
improvement is a clear indicator that something in the PCI-E core init is wrong.
It's also not surprising that something is going wrong there. The whole PCI-E core
code basically is undebugged. I wrote most of it long time ago, but I still
don't have a device that tests it (and probably won't get one anytime soon).
So I'm really not surprised that there are bugs. There also are missing parts.

A bug in the PCI-E core code is able to show such behavior, because all memory
transfers (MMIO and DMA) from the PCI device to the wireless core are translated
by the PCI-E core.
I think the whole PCI-E core code has to be audited (also the specs, probably).

--
Greetings, Michael.

2010-03-04 00:30:41

by Larry Finger

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

On 03/02/2010 03:57 PM, Michael Buesch wrote:

> A bug in the PCI-E core code is able to show such behavior, because all memory
> transfers (MMIO and DMA) from the PCI device to the wireless core are translated
> by the PCI-E core.
> I think the whole PCI-E core code has to be audited (also the specs, probably).

I have nearly finished the update on the code section of the specs page at
http://bcm-v4.sipsolutions.net/PCI-E. The part that is not done involves the
sections that read an address from the SPROM and perform operations on that address.

I found that the chip common registers are mapped at 12K for newer cores on
PCIe. This explains the 0x3XXX addresses. Similarly, the PCIe registers are
mapped at 8K - the 0x2XXX addresses. The SPROM is shadowed at 4K or 0x1XXX.

Larry

2010-02-28 23:03:27

by Gábor Stefanik

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

2010/2/28 Nathan Schulte <[email protected]>:
> 2010/2/28 G?bor Stefanik <[email protected]>:
>> OK, this dump shows the 0x280a write happening with core 3, i.e. PCIE,
>> active. So, it is indeed probably the "PCIE misc configuration"
>> routine. Why it's 0x280a is still a mystery to me, it should be 0x100a
>> according to the specs.
> Unless I'm reading the logs wrong, isn't wl setting bit 0x8000 when
> core 1 is mapped (0 indexed cores, 0x18001000 mapped to space 0)?
> And b43 appears to do it when core 0 is mapped (0x18000000 mapped to
> space 0). ?b43 also reads from 0x100a after writing to 0x280a, and it
> reads as 0x8000 not set (while the 0x280a check show it is set).
>
> This is when comparing the wl_cold and b43_cold logs.
>
> -Nate
>

The latest patch, which is a partial success according to some
testers, writes to core 1 (PCI-E) instead of core 0 (ChipCommon).

--
Vista: [V]iruses, [I]ntruders, [S]pyware, [T]rojans and [A]dware. :-)

2010-02-28 23:38:17

by Nathan Schulte

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

2010/2/28 G?bor Stefanik <[email protected]>:
> The latest patch, which is a partial success according to some
> testers, writes to core 1 (PCI-E) instead of core 0 (ChipCommon).
Then either I am misinterpreting the logs, or the last patch in this
thread is not the patch you are referring to.

A successful write/read to PCI config register 0x80 indicates that any
following MMIO read/writes will be done on that core, correct?

With the lpphy-test.patch you posted earlier, I see the following
output from b43:
Wrote 0x18003000 to pos 0x80
Read 0x18003000 from pos 0x80
MAP 1 0xf4000000 0xffffc900225b8000 0x4000 0x0 0
[snip some mmio read/writes and some PCI config read/writes]
Wrote 0x18000000 to pos 0x80
Read 0x18000000 from pos 0x80
R 4 1 0xf400280a 0x6dbe 0x0 0
W 4 1 0xf400280a 0xedbe 0x0 0

This first maps core 3, does some read/writes with it, then maps core
0, and sets bit 0x8000, correct?

Also, is the address space limited to the 4k range? wl maps core 1,
but sets bit 0x8000 at address 0x280a, which when added to 0x18001000
is 0x1800380a, right in the PCIE cores address space (for address
0x100a).

And finally, unless the [email protected] mailing list is
currently down as well, I think gmail may be the one causing my
messages to not be delivered to the list, as my last message isn't
showing up on the linux-wireless list either.

-Nate

2010-03-02 22:50:58

by William Bourque

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

Michael Buesch wrote:
> On Tuesday 02 March 2010 23:25:48 William Bourque wrote:
>> So if I get this right, this code is responsible of handling the b43
>> devices, as well as several other PCI-E devices, correct?
>
> Nah, this is a broadcom specific thing of the on-chip SSB bus.
>
Ok, sorry then :)

>> Because now that you mention this, the wired network card (Marvel Yukon,
>> with sky2 drivers) on this netbook also have a tons of issue (doesn't
>> show in lspci on a clean boot, oops the kernel if network cable is
>> unplugged while in use, fails to load if the module is ever unloaded, ... )
>> I thought it was unrelated but from your comment, I feel like this could
>> be linked to the same PCI-E bugs as well.
>
> Uh, well. Are you sure your hardware is OK then?
>
I sure hope so. The laptop is very new and I never had trouble with it,
but to tell the truth, it is a refurbished model so can't say for sure.

I think the hardware is fine but there is _very weird_ stuff about the
laptop... I feel like their ACPI implemention is nowhere near standard
and that might cause the problems. It's like everything on this laptop
is under a very agressive power management that bypass the OS and
confuse drivers. But again, it's just a feeling, I don't really have
much facts that back up this theory ;)

- William

2010-03-01 00:23:03

by Michael Büsch

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

On Monday 01 March 2010 00:38:16 Nathan Schulte wrote:
> 2010/2/28 G?bor Stefanik <[email protected]>:
> > The latest patch, which is a partial success according to some
> > testers, writes to core 1 (PCI-E) instead of core 0 (ChipCommon).
> Then either I am misinterpreting the logs, or the last patch in this
> thread is not the patch you are referring to.
>
> A successful write/read to PCI config register 0x80 indicates that any
> following MMIO read/writes will be done on that core, correct?
>
> With the lpphy-test.patch you posted earlier, I see the following
> output from b43:
> Wrote 0x18003000 to pos 0x80
> Read 0x18003000 from pos 0x80
> MAP 1 0xf4000000 0xffffc900225b8000 0x4000 0x0 0
> [snip some mmio read/writes and some PCI config read/writes]
> Wrote 0x18000000 to pos 0x80
> Read 0x18000000 from pos 0x80
> R 4 1 0xf400280a 0x6dbe 0x0 0
> W 4 1 0xf400280a 0xedbe 0x0 0
>
> This first maps core 3, does some read/writes with it, then maps core
> 0, and sets bit 0x8000, correct?
>
> Also, is the address space limited to the 4k range? wl maps core 1,
> but sets bit 0x8000 at address 0x280a, which when added to 0x18001000
> is 0x1800380a, right in the PCIE cores address space (for address
> 0x100a).

Well, you are confusing address spaces here.

On a PCI based SSB device all host-side MMIO transfers go into
the PCI device's address space first. The core-switching moves the window of
the SSB address space that is mapped into 0-0xFFF of the PCI address space.
So if you write to anything above 0xFFF on the PCI device, the write will
not (directly) map to the SSB bus or any device on it.
On the PCI device there is more stuff mapped on top of the SSB sliding
window. For example the SPROM is mapped right on top of it.

So it might be the case that on a PCI-E device the PCI-E-core's registers
are permanently mapped into 0x2000 of the PCI address apace. This is to
avoid sliding the SSB address space window when accessing the PCI-E core.
This can have several reasons: For one speed (unlikely) and for another
to avoid concurrency and ugly races when we need to access the PCI-E core
while the wireless core is already running and generating interrupts.
Note that this is a GUESS, but it would make sense to me.
It would be cool if somebody could compare more registers of the PCI-E
core using the sliding window and the 0x2000 + reg method to check my theory.

--
Greetings, Michael.

2010-03-02 22:11:14

by Larry Finger

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

On 03/02/2010 03:57 PM, Michael Buesch wrote:
>
> So what's the status on this? I think the fact that the testing patch showed some
> improvement is a clear indicator that something in the PCI-E core init is wrong.
> It's also not surprising that something is going wrong there. The whole PCI-E core
> code basically is undebugged. I wrote most of it long time ago, but I still
> don't have a device that tests it (and probably won't get one anytime soon).
> So I'm really not surprised that there are bugs. There also are missing parts.
>
> A bug in the PCI-E core code is able to show such behavior, because all memory
> transfers (MMIO and DMA) from the PCI device to the wireless core are translated
> by the PCI-E core.
> I think the whole PCI-E core code has to be audited (also the specs, probably).

You are right about the audit of the PCIe code and specs. Some of the MMIO
sequences found for wl and missing in b43 come from the code described in
http://bcm-v4.sipsolutions.net/PCI-E; however, that code needs to be checked as
at least one routine is missing, I have not yet had a chance to go through it,
but I hope to soon.

Larry

2010-03-04 00:47:37

by Gábor Stefanik

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

On Thu, Mar 4, 2010 at 1:30 AM, Larry Finger <[email protected]> wrote:
> On 03/02/2010 03:57 PM, Michael Buesch wrote:
>
>> A bug in the PCI-E core code is able to show such behavior, because all memory
>> transfers (MMIO and DMA) from the PCI device to the wireless core are translated
>> by the PCI-E core.
>> I think the whole PCI-E core code has to be audited (also the specs, probably).
>
> I have nearly finished the update on the code section of the specs page at
> http://bcm-v4.sipsolutions.net/PCI-E. The part that is not done involves the
> sections that read an address from the SPROM and perform operations on that address.
>
> I found that the chip common registers

Do you mean the ChipCommon registers or the Backplane common registers?

> are mapped at 12K for newer cores on
> PCIe. This explains the 0x3XXX addresses. Similarly, the PCIe registers are
> mapped at 8K - the 0x2XXX addresses. The SPROM is shadowed at 4K or 0x1XXX.
>
> Larry
>
>

--
Vista: [V]iruses, [I]ntruders, [S]pyware, [T]rojans and [A]dware. :-)

2010-03-02 22:30:06

by Michael Büsch

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

On Tuesday 02 March 2010 23:25:48 William Bourque wrote:
> So if I get this right, this code is responsible of handling the b43
> devices, as well as several other PCI-E devices, correct?

Nah, this is a broadcom specific thing of the on-chip SSB bus.

>
> Because now that you mention this, the wired network card (Marvel Yukon,
> with sky2 drivers) on this netbook also have a tons of issue (doesn't
> show in lspci on a clean boot, oops the kernel if network cable is
> unplugged while in use, fails to load if the module is ever unloaded, ... )
> I thought it was unrelated but from your comment, I feel like this could
> be linked to the same PCI-E bugs as well.

Uh, well. Are you sure your hardware is OK then?

--
Greetings, Michael.

2010-03-02 22:46:19

by William Bourque

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

Michael Buesch wrote:
> On Monday 01 March 2010 01:22:50 Michael Buesch wrote:
>> Well, you are confusing address spaces here.
>>
>> On a PCI based SSB device all host-side MMIO transfers go into
>> the PCI device's address space first. The core-switching moves the window of
>> the SSB address space that is mapped into 0-0xFFF of the PCI address space.
>> So if you write to anything above 0xFFF on the PCI device, the write will
>> not (directly) map to the SSB bus or any device on it.
>> On the PCI device there is more stuff mapped on top of the SSB sliding
>> window. For example the SPROM is mapped right on top of it.
>>
>> So it might be the case that on a PCI-E device the PCI-E-core's registers
>> are permanently mapped into 0x2000 of the PCI address apace. This is to
>> avoid sliding the SSB address space window when accessing the PCI-E core.
>> This can have several reasons: For one speed (unlikely) and for another
>> to avoid concurrency and ugly races when we need to access the PCI-E core
>> while the wireless core is already running and generating interrupts.
>> Note that this is a GUESS, but it would make sense to me.
>> It would be cool if somebody could compare more registers of the PCI-E
>> core using the sliding window and the 0x2000 + reg method to check my theory.
>>
>
> So what's the status on this? I think the fact that the testing patch showed some
> improvement is a clear indicator that something in the PCI-E core init is wrong.
> It's also not surprising that something is going wrong there. The whole PCI-E core
> code basically is undebugged. I wrote most of it long time ago, but I still
> don't have a device that tests it (and probably won't get one anytime soon).
> So I'm really not surprised that there are bugs. There also are missing parts.
>
> A bug in the PCI-E core code is able to show such behavior, because all memory
> transfers (MMIO and DMA) from the PCI device to the wireless core are translated
> by the PCI-E core.
> I think the whole PCI-E core code has to be audited (also the specs, probably).
>

So if I get this right, this code is responsible of handling the b43
devices, as well as several other PCI-E devices, correct?

Because now that you mention this, the wired network card (Marvel Yukon,
with sky2 drivers) on this netbook also have a tons of issue (doesn't
show in lspci on a clean boot, oops the kernel if network cable is
unplugged while in use, fails to load if the module is ever unloaded, ... )
I thought it was unrelated but from your comment, I feel like this could
be linked to the same PCI-E bugs as well.

- William

2010-03-04 01:38:25

by Larry Finger

[permalink] [raw]

Subject: Re: LP-PHY Fatal DMA error 0x00000800 on non-ULV Core 2 Duo?!?!!??!

On 03/03/2010 06:47 PM, G?bor Stefanik wrote:
> On Thu, Mar 4, 2010 at 1:30 AM, Larry Finger <[email protected]> wrote:
>> On 03/02/2010 03:57 PM, Michael Buesch wrote:
>>
>>> A bug in the PCI-E core code is able to show such behavior, because all memory
>>> transfers (MMIO and DMA) from the PCI device to the wireless core are translated
>>> by the PCI-E core.
>>> I think the whole PCI-E core code has to be audited (also the specs, probably).
>>
>> I have nearly finished the update on the code section of the specs page at
>> http://bcm-v4.sipsolutions.net/PCI-E. The part that is not done involves the
>> sections that read an address from the SPROM and perform operations on that address.
>>
>> I found that the chip common registers
>
> Do you mean the ChipCommon registers or the Backplane common registers?

Definitely, it is the chipcommon registers.