2010-02-26 19:09:41

by Larry Finger

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On 02/26/2010 12:34 PM, Linus Torvalds wrote:
>
> This makes the b43 driver just automatically fall back to PIO mode when
> DMA doesn't work.
>
> The driver already told the user to do it, so rather than have the user
> reload the module with a new flag, just make the driver do it
> automatically. We keep the message as an indication that something is
> wrong, but now just automatically fall back to the hopefully working PIO
> case.
>
> Signed-off-by: Linus Torvalds <[email protected]>
> ---
>
> I've carried this around in my tree because it fixes a real device (Dell
> Inspiron 11) that happens to be a laptop that Patricia uses at school.
>
> In commit 214ac9a4ead Larry made the driver not reset continually (and use
> 99% CPU time while not working), he just made it say "use PIO instead"
> (and use 0% CPU time while STILL not working!).
>
> Which seems to be rather excessively stupid. Since we know what the fix
> is, we should just _do_ it, rather than tell the user to recompile the
> kernel and reboot with a new kernel command line flag.
>
> I really don't see what the logic is behind letting the user know what to
> do, but then not doing it youself.
>
> I'd rather see this go through the network layer, but quite frankly, since
> I have access to a machine with this issue and can test it, if I don't get
> this patch back through the proper channels (or some other patch that just
> fixes the DMA issue), I'm going to just apply it myself.
>
> I refuse to have a kernel that is too stupid to just do something that it
> tells the user to do. I also refuse to have a kernel that is so stupid
> that it needs hand-holding and special kernel command line options on a
> machine I can test.

In my defense, my device does not have this failure, thus I have no way to test
switching to PIO when DMA fails, which is why I made the change the way I did. I
could have forced the switch, but was not sure if the hardware state would have
followed.

As you have tested this code, it has my ACK and I request John and DaveM to push
it upstream so that it gets in the 2.6.34 merge. There is a small problem as it
does not apply cleanly to wireless-testing. John: Do you want me to fix that and
send you the new version, or let Linus just apply it to his tree? Please let me
know how to proceed.

It should also be pushed to stable. (GregKH added to Cc list.) It would apply to
2.6.33 - ATM I'm not sure about 2.6.32. I'll check if it applies to 2.6.32.Y.

Incidentally, we have learned today that the fault does not occur when running
MMIO traces, and that it is not SMP related.

Larry


2010-02-26 22:47:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, G?bor Stefanik wrote:
>
> Well, we have a report from someone with an Intel T7250, which is
> definitely neither a slow CPU nor a ULV. The machine has PhoenixBIOS
> (in the same rebranded form as yours). My Intel T7100 + InsydeH2O
> machine (almost the same CPU, but with Insyde EFI in BIOS emulation
> mode instead of PhoenixBIOS, or maybe TrustedCore) didn't reproduce
> the bug. The trigger still seems to be PhoenixBIOS.

Well, I'd love to give you a register dump. It _looks_ like it should be
possible write some script to dump all the registers by accessing

/sys/kernel/debug/b43/phy0/[mmio,shm][16,32]read

but that's a very odd interface and I don't know what the index ranges are
for them (and which registers need 16-bit vs 32-bit accesses).

But if you give me a script to dump all the relevant registers, I could
just dump them after a cold-boot, and _before_ loading the b43 driver fir
the first time. Maybe you can compare it to your state with your BIOS, in
the same situation, and see if there is some obvious difference that
strikes you?

Of course, maybe there is some hidden state that doesn't show up there
either?

Linus

2010-02-27 14:45:04

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Friday 26 February 2010 21:42:29 Linus Torvalds wrote:
>
> On Fri, 26 Feb 2010, Linus Torvalds wrote:
> >
> > So send me a patch. I'll try it. But I have no hardware docs, nor any
> > information about how that SSB bridge is supposed to work, or why DMA
> > might be failing.
>
> Btw, I also object to your argument that
>
> "Well, my original plan was to get rid of controller_restart and not add
> yet another user of it, because it is extremely broken and racy. The
> locking in the whole driver is completely braindead due to the mere
> existence of this function."
>
> is a reason to not apply the patch.

Blah, I never said that. I just said what my original plan was and that
the workaround is not as sound as it looks like. In fact, it kind of depends
on luck that it works at all. The controller_restart _never_ worked correctly,
because it does not notify the stack about the restart. So your code implicitly
relies on getting a DMA error before the stack does any significant negotiation
with the AP. (Or that the stack detects the device fuckup and starts over again).

> The fact that the driver locking is odd is _not_ a reason to not fix other
> issues that are totally unrelated to locking.

To say it again: I _never_ said that.

--
Greetings, Michael.

2010-02-27 18:53:19

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Saturday 27 February 2010 19:31:11 Linus Torvalds wrote:
>
> On Sat, 27 Feb 2010, Michael Buesch wrote:
> >
> > However, did you add the udelay to b43_write()? Please try to add it
> > further down in the callchain into the lowlevel SSB read and write functions.
> > drivers/ssb/pci.c
>
> Well, I don't know what that would help from a timing standpoint, since
> the delay gets done either way,

Well, but not for SSB-only accesses that access registers on the bus or other
devices on the bus.

> but looking at it, I do note that the
> locking for the writes is very very wrong.

Yeah that is known very well. But we don't care, as it is implicitely protected
by b43's mutex.

This is nontrivial to fix, so for now I don't care. We could simply
remove the lowlevel SSB lock, if it bothers you much.

> That said, I assume that 'mapped_device' almost never actually changes in
> practice, so the race presumably doesn't really matter.

Yes, that's the case.
There are no races in practice. Even if you remove the lock completely.
I was going to fix this at some point, but it's really nontrivial.

> I don't know how
> these "devices" even work - is there a separate "device" for different SSB
> functions (ie power management vs DMA vs wireless radio), or is there
> multiple devices only if you actually have multiple radio's?

Depends on the actual device.
The SSB is a bus like PCI. You can attach basically anything you want to it.

> So for all I know, it's possible that in my case there is always just one
> device, and the race really fundamentally cannot ever happen.

That's basically the case for a PCI device, yes.

> I can write a patch and test whether it matters, but I'd like to know if
> my chip even _has_ multiple cores behind the ssb bridge. If people tell me
> that there is only one core on that chip, and that the race can never
> happen, then I won't bother.

Nah, you really don't want to touch that lowlevel code. It is not broken as-is.
Simply remove the lock, if you really care that much.
I tried to fix this properly a few weeks ago, but I didn't really see a sane way
to do it so I finally gave up, because it wouldn't fix a real-life problem anyway.
The problem is that this code may basically run in any context. And (if that were not
enough), it must work on all host-busses that SSB supports. That being plain SSB, PCI,
PCMCIA and SDIO. Where SDIO is special in that it requires sleeping.

I assure you that this is not part of the DMA problem. 100% guaranteed. ;)

--
Greetings, Michael.

2010-02-26 20:50:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, G?bor Stefanik wrote:
>
> Does actually enabling mmiotrace fix it for you too?

I haven't seen the report. What do you need to do? Certainly just enabling
CONFIG_MMIOTRACE does nothing, but I'd not expect it to - I'd expect that
you have to actually do the tracing itself.

So if somebody has a recipe for me to test actual tracing of the module, I
can try to see if it makes a difference for me.

Linus

2010-02-27 18:53:19

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Friday 26 February 2010 21:33:08 Linus Torvalds wrote:
> So send me a patch. I'll try it.

Please check if the following patch makes any difference.
Also please show me the result of the printks.


Index: wireless-testing/drivers/ssb/driver_chipcommon_pmu.c
===================================================================
--- wireless-testing.orig/drivers/ssb/driver_chipcommon_pmu.c 2010-02-27 17:16:38.000000000 +0100
+++ wireless-testing/drivers/ssb/driver_chipcommon_pmu.c 2010-02-27 18:35:50.000000000 +0100
@@ -428,6 +428,9 @@ static void ssb_pmu_resources_init(struc
* min_msk = 0xCBB
* max_msk = 0x7FFFF
*/
+printk("PMU res init 4312/4322\n");
+min_msk = 0xFFFFFFFF;
+max_msk = 0xFFFFFFFF;
break;
case 0x4325:
/* Power OTP down later. */
@@ -442,6 +445,9 @@ static void ssb_pmu_resources_init(struc
updown_tab_size = ARRAY_SIZE(pmu_res_updown_tab_4325a0);
depend_tab = pmu_res_depend_tab_4325a0;
depend_tab_size = ARRAY_SIZE(pmu_res_depend_tab_4325a0);
+printk("PMU res init 4325\n");
+min_msk = 0xFFFFFFFF;
+max_msk = 0xFFFFFFFF;
break;
case 0x4328:
min_msk = (1 << SSB_PMURES_4328_EXT_SWITCHER_PWM) |
@@ -453,10 +459,16 @@ static void ssb_pmu_resources_init(struc
updown_tab_size = ARRAY_SIZE(pmu_res_updown_tab_4328a0);
depend_tab = pmu_res_depend_tab_4328a0;
depend_tab_size = ARRAY_SIZE(pmu_res_depend_tab_4328a0);
+printk("PMU res init 4328\n");
+min_msk = 0xFFFFFFFF;
+max_msk = 0xFFFFFFFF;
break;
case 0x5354:
/* The PLL may turn on, if it decides so. */
max_msk = 0xFFFFF;
+printk("PMU res init 5354\n");
+min_msk = 0xFFFFFFFF;
+max_msk = 0xFFFFFFFF;
break;
default:
ssb_printk(KERN_ERR PFX
@@ -538,6 +550,7 @@ void ssb_pmu_set_ldo_voltage(struct ssb_
switch (bus->chip_id) {
case 0x4328:
case 0x5354:
+printk("Set LDO 4328/5354\n");
switch (id) {
case LDO_VOLT1:
addr = 2;
@@ -565,6 +578,7 @@ void ssb_pmu_set_ldo_voltage(struct ssb_
}
break;
case 0x4312:
+printk("Set LDO 4312\n");
if (SSB_WARN_ON(id != LDO_PAREF))
return;
addr = 0;
@@ -584,6 +598,7 @@ void ssb_pmu_set_ldo_paref(struct ssb_ch
struct ssb_bus *bus = cc->dev->bus;
int ldo;

+printk("Set LDO paref\n");
switch (bus->chip_id) {
case 0x4312:
ldo = SSB_PMURES_4312_PA_REF_LDO;

--
Greetings, Michael.

2010-02-26 20:08:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, Michael Buesch wrote:
>
> Where is that patch? I can't find it in any list archives.

Here is the patch. I originally sent it to Larry directly (with John and
David cc'd).

Linus

---
drivers/net/wireless/b43/Kconfig | 6 +++---
drivers/net/wireless/b43/b43.h | 5 +++--
drivers/net/wireless/b43/main.c | 10 +++++++++-
3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/b43/Kconfig b/drivers/net/wireless/b43/Kconfig
index 64c12e1..e6a7d8e 100644
--- a/drivers/net/wireless/b43/Kconfig
+++ b/drivers/net/wireless/b43/Kconfig
@@ -78,11 +78,11 @@ config B43_SDIO

If unsure, say N.

-# Data transfers to the device via PIO
-# This is only needed on PCMCIA and SDIO devices. All others can do DMA properly.
+# Data transfers to the device via PIO. We want it as a fallback even
+# if we can do DMA.
config B43_PIO
bool
- depends on B43 && (B43_SDIO || B43_PCMCIA || B43_FORCE_PIO)
+ depends on B43
select SSB_BLOCKIO
default y

diff --git a/drivers/net/wireless/b43/b43.h b/drivers/net/wireless/b43/b43.h
index c484cc2..f2cfbc4 100644
--- a/drivers/net/wireless/b43/b43.h
+++ b/drivers/net/wireless/b43/b43.h
@@ -694,6 +694,7 @@ struct b43_wldev {
bool radio_hw_enable; /* saved state of radio hardware enabled state */
bool qos_enabled; /* TRUE, if QoS is used. */
bool hwcrypto_enabled; /* TRUE, if HW crypto acceleration is enabled. */
+ bool use_pio; /* TRUE if next init should use PIO */

/* PHY/Radio device. */
struct b43_phy phy;
@@ -885,9 +886,9 @@ static inline bool b43_using_pio_transfers(struct b43_wldev *dev)
}

#ifdef CONFIG_B43_FORCE_PIO
-# define B43_FORCE_PIO 1
+# define B43_PIO_DEFAULT 1
#else
-# define B43_FORCE_PIO 0
+# define B43_PIO_DEFAULT 0
#endif


diff --git a/drivers/net/wireless/b43/main.c b/drivers/net/wireless/b43/main.c
index 490fb45..d310ce7 100644
--- a/drivers/net/wireless/b43/main.c
+++ b/drivers/net/wireless/b43/main.c
@@ -98,6 +98,10 @@ static int modparam_btcoex = 1;
module_param_named(btcoex, modparam_btcoex, int, 0444);
MODULE_PARM_DESC(btcoex, "Enable Bluetooth coexistence (default on)");

+int b43_modparam_pio = B43_PIO_DEFAULT;
+module_param_named(pio, b43_modparam_pio, int, 0644);
+MODULE_PARM_DESC(pio, "Use PIO accesses by default: 0=DMA, 1=PIO");
+
int b43_modparam_verbose = B43_VERBOSITY_DEFAULT;
module_param_named(verbose, b43_modparam_verbose, int, 0644);
MODULE_PARM_DESC(verbose, "Log message verbosity: 0=error, 1=warn, 2=info(default), 3=debug");
@@ -1795,6 +1799,9 @@ static void b43_do_interrupt_thread(struct b43_wldev *dev)
"on your system. Please use PIO instead.\n");
b43err(dev->wl, "CONFIG_B43_FORCE_PIO must be set in "
"your kernel configuration.\n");
+ /* Fall back to PIO transfers if we get fatal DMA errors! */
+ dev->use_pio = 1;
+ b43_controller_restart(dev, "DMA error");
return;
}
if (merged_dma_reason & B43_DMAIRQ_NONFATALMASK) {
@@ -4360,7 +4367,7 @@ static int b43_wireless_core_init(struct b43_wldev *dev)

if ((dev->dev->bus->bustype == SSB_BUSTYPE_PCMCIA) ||
(dev->dev->bus->bustype == SSB_BUSTYPE_SDIO) ||
- B43_FORCE_PIO) {
+ dev->use_pio) {
dev->__using_pio_transfers = 1;
err = b43_pio_init(dev);
} else {
@@ -4830,6 +4837,7 @@ static int b43_one_core_attach(struct ssb_device *dev, struct b43_wl *wl)
if (!wldev)
goto out;

+ wldev->use_pio = b43_modparam_pio;
wldev->dev = dev;
wldev->wl = wl;
b43_set_status(wldev, B43_STAT_UNINIT);

2010-02-27 14:59:29

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Friday 26 February 2010 22:50:35 Linus Torvalds wrote:
>
> On Fri, 26 Feb 2010, Linus Torvalds wrote:
> >
> > so even if mmiotrace fixes it for others, it's not working for me.
>
> Side note: In case this is timing-related: the machine in question is a
> dual-core 1.3Ghz CPU (Core 2 Duo U1400). I think SMP has always been
> enabled, I don't think I ever tried UP.
>
> Also, in addition to the PIO fallback code, that particular mmiotrace test
> was done with the "udelay(10)" added to the b43_write*() functions. So it
> was even slower.
>
> Maybe the bug only happens on slow machines (and then the udelay obviously
> wouldn't help!),

I'm pretty sure it's triggered by some kind of BIOS background service.
Power saving or something like that. There must be some interference
that fucks up the memory link between machine->ssb->wireless.

However, did you add the udelay to b43_write()? Please try to add it
further down in the callchain into the lowlevel SSB read and write functions.
drivers/ssb/pci.c

I don't think it'll help at all, but it's worth a try.

--
Greetings, Michael.

2010-02-27 18:32:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Sat, 27 Feb 2010, Michael Buesch wrote:
>
> However, did you add the udelay to b43_write()? Please try to add it
> further down in the callchain into the lowlevel SSB read and write functions.
> drivers/ssb/pci.c

Well, I don't know what that would help from a timing standpoint, since
the delay gets done either way, but looking at it, I do note that the
locking for the writes is very very wrong.

For example, look at

static u32 ssb_pci_read32(struct ssb_device *dev, u16 offset)
{
struct ssb_bus *bus = dev->bus;

if (unlikely(ssb_pci_assert_buspower(bus)))
return 0xFFFFFFFF;
if (unlikely(bus->mapped_device != dev)) {
**** race 1 ****
if (unlikely(ssb_pci_switch_core(bus, dev)))
return 0xFFFFFFFF;
}
**** race 2 ****
return ioread32(bus->mmio + offset);
}

and notice how it's doing that 'mapped_device' read with no locks held.
The actual ssb_pci_switch_core() function will take a lock, but it's too
late by then: if you have these things happening in an interrupt, you
might have another access happening either before or after the
mapped_device has been read or written. So the mmio access could easily go
to the wrong core.

That said, I assume that 'mapped_device' almost never actually changes in
practice, so the race presumably doesn't really matter. I don't know how
these "devices" even work - is there a separate "device" for different SSB
functions (ie power management vs DMA vs wireless radio), or is there
multiple devices only if you actually have multiple radio's?

So for all I know, it's possible that in my case there is always just one
device, and the race really fundamentally cannot ever happen. Or maybe
there are multiple ones, but in practice you never have concurrent
accesses (perhaps due to locking at a higher level) so again the locking
bug is hidden.

Anyway, quite frankly, from a design standpoint it looks like any user of
ssb_pci_switch_core() is fundamentally buggered. The lock should be taken
by the caller, ie the locking _should_ have been around the whole
mapped_device test _and_ the actual ioread32() too, something like

u32 retval = 0xffffffff;

spin_lock_irqsave(&dev->core_lock, flags);
if (bus->mapped_device == dev || !ssb_pci_switch_core(bus, dev))
retval = ioread32(bus->mmio + offset);
spin_lock_irqrestore(&dev->core_lock, flags);

return retval;

and that assert_buspower thing should probably be done inside the core
switching (set "mapped_device" to NULL when powering it down, and either
refuse to switch cores or power it up dynamically).

I can write a patch and test whether it matters, but I'd like to know if
my chip even _has_ multiple cores behind the ssb bridge. If people tell me
that there is only one core on that chip, and that the race can never
happen, then I won't bother.

Linus

2010-02-26 19:14:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, Larry Finger wrote:
>
> Incidentally, we have learned today that the fault does not occur when running
> MMIO traces, and that it is not SMP related.

Timing? I guess MMIO tracing will slow down the actual command setup part
a lot.

Linus

2010-02-27 22:12:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Sat, 27 Feb 2010, Michael Buesch wrote:
>
> Thanks. What about this? (Remove old patch first).

Doesn't seem to make any difference for me.

Linus

2010-02-26 20:10:09

by Gábor Stefanik

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Fri, Feb 26, 2010 at 9:06 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Fri, 26 Feb 2010, Linus Torvalds wrote:
>> On Fri, 26 Feb 2010, Larry Finger wrote:
>> >
>> > Incidentally, we have learned today that the fault does not occur when running
>> > MMIO traces, and that it is not SMP related.
>>
>> Timing? I guess MMIO tracing will slow down the actual command setup part
>> a lot.
>
> Adding a "udelay(10)" to the "b43_write[16|32]()" functions doesn't make
> any difference on the machine I have access to, though.
>
> Ill happily test other ideas or patches. Automatically falling back to PIO
> makes things work, but I'd obviously _much_ rather have working DMA.
>
> ? ? ? ? ? ? ? ? ? ? ? ?Linus

Does actually enabling mmiotrace fix it for you too?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>



--
Vista: [V]iruses, [I]ntruders, [S]pyware, [T]rojans and [A]dware. :-)

2010-02-26 21:50:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, Linus Torvalds wrote:
>
> so even if mmiotrace fixes it for others, it's not working for me.

Side note: In case this is timing-related: the machine in question is a
dual-core 1.3Ghz CPU (Core 2 Duo U1400). I think SMP has always been
enabled, I don't think I ever tried UP.

Also, in addition to the PIO fallback code, that particular mmiotrace test
was done with the "udelay(10)" added to the b43_write*() functions. So it
was even slower.

Maybe the bug only happens on slow machines (and then the udelay obviously
wouldn't help!), but that would not explain why mmiotrace might then
_hide_ the bug for some people.

Linus

2010-02-27 14:49:36

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Friday 26 February 2010 21:33:08 Linus Torvalds wrote:
>
> On Fri, 26 Feb 2010, Michael Buesch wrote:
> >
> > I'd rather appreciate that somebody who owns such a device and can reproduce the bug
> > would do some _real_ debugging on the issue instead.
>
> I have the hardware, I can reproduce the bug 100% of the time.
>
> So send me a patch. I'll try it.

Well, unfortunately it is not as easy as sending a patch and testing it.
If that was the case, the bug would've never hit you. Some serious very basic
debugging work has to be done first. What we did for now basically was just
poking around in the dark. What we need to do is find out what triggers
the behavior. One thing that we know is that there are identical cards that
work fine in one machine and fail in another. So it is kind of machine dependent.
The first thing I would try is to start disabling features in the BIOS.
You can't really do this remotely via email, so I think there needs to be done
some work by somebody who knows what he does _and_ is able to reproduce the bug.

--
Greetings, Michael.

2010-02-26 20:07:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, Linus Torvalds wrote:
> On Fri, 26 Feb 2010, Larry Finger wrote:
> >
> > Incidentally, we have learned today that the fault does not occur when running
> > MMIO traces, and that it is not SMP related.
>
> Timing? I guess MMIO tracing will slow down the actual command setup part
> a lot.

Adding a "udelay(10)" to the "b43_write[16|32]()" functions doesn't make
any difference on the machine I have access to, though.

Ill happily test other ideas or patches. Automatically falling back to PIO
makes things work, but I'd obviously _much_ rather have working DMA.

Linus

2010-02-27 15:04:51

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Friday 26 February 2010 23:46:57 Linus Torvalds wrote:
> Well, I'd love to give you a register dump. It _looks_ like it should be
> possible write some script to dump all the registers by accessing
>
> /sys/kernel/debug/b43/phy0/[mmio,shm][16,32]read

It is easily possible using libb43.py
http://bu3sch.de/gitweb?p=b43-tools.git;a=blob;f=debug/libb43.py;h=e5c563823a5f67aea30587a22e0eabab063358fb;hb=HEAD
See class B43.

However, dumping registers in b43 is not as easy as it looks like.
There is a _lot_ of indirect addressing and it has the tendency to
lock up on certain machines and cards, if you access dangling registers.

--
Greetings, Michael.

2010-02-26 19:59:18

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Friday 26 February 2010 20:09:36 Larry Finger wrote:
> On 02/26/2010 12:34 PM, Linus Torvalds wrote:
> >
> > This makes the b43 driver just automatically fall back to PIO mode when
> > DMA doesn't work.
> >
> > The driver already told the user to do it, so rather than have the user
> > reload the module with a new flag, just make the driver do it
> > automatically. We keep the message as an indication that something is
> > wrong, but now just automatically fall back to the hopefully working PIO
> > case.
> >
> > Signed-off-by: Linus Torvalds <[email protected]>
> > ---

Where is that patch? I can't find it in any list archives.

--
Greetings, Michael.

2010-02-26 21:46:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, Larry Finger wrote:
>
> If you do the following (as root):
>
> modprobe -r b43
> echo 6400 > /sys/kernel/debug/tracing/buffer_size_kb
> echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
> cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
> modprobe b43
>
> do some network activity, then
>
> echo nop > /sys/kernel/debug/tracing/current_tracer
> sleep 1
> killproc cat
>
> That should be a good enough test. If your machine behaves the same as has been
> reported, you should see a DMA error at the point that tracing is turned off.

Nope. I get the DMA error immediately, even if I'm tracing. Here:

- turn on tracing:

Feb 26 13:35:45 localhost kernel: mmiotrace: Disabling non-boot CPUs...
Feb 26 13:35:45 localhost kernel: Broke affinity for irq 9
Feb 26 13:35:45 localhost kernel: Broke affinity for irq 12
Feb 26 13:35:45 localhost kernel: Broke affinity for irq 26
Feb 26 13:35:45 localhost kernel: CPU 1 is now offline
Feb 26 13:35:45 localhost kernel: SMP alternatives: switching to UP code
Feb 26 13:35:45 localhost kernel: mmiotrace: CPU1 is down.
Feb 26 13:35:45 localhost kernel: mmiotrace: enabled.

- insmod b43:

Feb 26 13:36:31 localhost kernel: cfg80211: Calling CRDA to update world regulatory domain
...
Feb 26 13:36:31 localhost kernel: b43 ssb0:0: firmware: requesting b43/lp0initvals15.fw
Feb 26 13:36:31 localhost kernel: b43 ssb0:0: firmware: requesting b43/lp0bsinitvals15.fw
Feb 26 13:36:32 localhost kernel: b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
Feb 26 13:36:34 localhost kernel: b43-phy0 ERROR: Fatal DMA error: 0x00000400, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
Feb 26 13:36:34 localhost kernel: b43-phy0 ERROR: This device does not support DMA on your system. Please use PIO instead.
Feb 26 13:36:34 localhost kernel: b43-phy0 ERROR: CONFIG_B43_FORCE_PIO must be set in your kernel configuration.
Feb 26 13:36:34 localhost kernel: b43-phy0: Controller RESET (DMA error) ...
Feb 26 13:36:34 localhost kernel: b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
..

- networking works (because this kernel had the PIO fallback patch)

- turn off tracing:

Feb 26 13:37:15 localhost kernel: mmiotrace: purging non-iounmapped trace @0xffffc90022560000, size 0x4000.
Feb 26 13:37:15 localhost kernel: mmiotrace: Re-enabling CPUs...
Feb 26 13:37:15 localhost kernel: SMP alternatives: switching to SMP code
Feb 26 13:37:15 localhost kernel: Booting Node 0 Processor 1 APIC 0x1
Feb 26 13:37:15 localhost kernel: mmiotrace: enabled CPU1.
Feb 26 13:37:15 localhost kernel: mmiotrace: disabled.

so even if mmiotrace fixes it for others, it's not working for me.

Linus

2010-02-27 21:44:25

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Saturday 27 February 2010 22:08:03 Linus Torvalds wrote:
> No difference.

Thanks. What about this? (Remove old patch first).


Index: wireless-testing/drivers/net/wireless/b43/b43.h
===================================================================
--- wireless-testing.orig/drivers/net/wireless/b43/b43.h 2010-02-27 17:16:35.000000000 +0100
+++ wireless-testing/drivers/net/wireless/b43/b43.h 2010-02-27 22:40:40.000000000 +0100
@@ -39,6 +39,10 @@
#define B43_MMIO_DMA4_IRQ_MASK 0x44
#define B43_MMIO_DMA5_REASON 0x48
#define B43_MMIO_DMA5_IRQ_MASK 0x4C
+#define B43_MMIO_DMA6_REASON 0x50
+#define B43_MMIO_DMA6_IRQ_MASK 0x54
+#define B43_MMIO_DMA7_REASON 0x58
+#define B43_MMIO_DMA7_IRQ_MASK 0x5C
#define B43_MMIO_MACCTL 0x120 /* MAC control */
#define B43_MMIO_MACCMD 0x124 /* MAC command */
#define B43_MMIO_GEN_IRQ_REASON 0x128
Index: wireless-testing/drivers/net/wireless/b43/main.c
===================================================================
--- wireless-testing.orig/drivers/net/wireless/b43/main.c 2010-02-27 17:16:35.000000000 +0100
+++ wireless-testing/drivers/net/wireless/b43/main.c 2010-02-27 22:42:27.000000000 +0100
@@ -2880,6 +2880,10 @@ static int b43_chip_init(struct b43_wlde
b43_write32(dev, B43_MMIO_DMA3_IRQ_MASK, 0x0001DC00);
b43_write32(dev, B43_MMIO_DMA4_IRQ_MASK, 0x0000DC00);
b43_write32(dev, B43_MMIO_DMA5_IRQ_MASK, 0x0000DC00);
+ b43_write32(dev, B43_MMIO_DMA6_IRQ_MASK, 0);
+ b43_write32(dev, B43_MMIO_DMA6_REASON, 0);
+ b43_write32(dev, B43_MMIO_DMA7_IRQ_MASK, 0);
+ b43_write32(dev, B43_MMIO_DMA7_REASON, 0);

value32 = ssb_read32(dev->dev, SSB_TMSLOW);
value32 |= 0x00100000;


--
Greetings, Michael.

2010-02-26 21:01:55

by Larry Finger

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On 02/26/2010 02:50 PM, Linus Torvalds wrote:
>
>
> On Fri, 26 Feb 2010, G?bor Stefanik wrote:
>>
>> Does actually enabling mmiotrace fix it for you too?
>
> I haven't seen the report. What do you need to do? Certainly just enabling
> CONFIG_MMIOTRACE does nothing, but I'd not expect it to - I'd expect that
> you have to actually do the tracing itself.
>
> So if somebody has a recipe for me to test actual tracing of the module, I
> can try to see if it makes a difference for me.

If you do the following (as root):

modprobe -r b43
echo 6400 > /sys/kernel/debug/tracing/buffer_size_kb
echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
modprobe b43

do some network activity, then

echo nop > /sys/kernel/debug/tracing/current_tracer
sleep 1
killproc cat

That should be a good enough test. If your machine behaves the same as has been
reported, you should see a DMA error at the point that tracing is turned off.

Larry


2010-02-26 20:42:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, Linus Torvalds wrote:
>
> So send me a patch. I'll try it. But I have no hardware docs, nor any
> information about how that SSB bridge is supposed to work, or why DMA
> might be failing.

Btw, I also object to your argument that

"Well, my original plan was to get rid of controller_restart and not add
yet another user of it, because it is extremely broken and racy. The
locking in the whole driver is completely braindead due to the mere
existence of this function."

is a reason to not apply the patch. The patch makes the driver _work_. Not
on some odd-ball hardware either, but a regular Dell laptop.

So if you have a patch to remove controller_restart, I'm sure I can modify
mine to work on top of such a new world order. But if you do _not_ have
such a patch, then that is no argument for keeping the driver in a
known-broken state.

The fact that the driver locking is odd is _not_ a reason to not fix other
issues that are totally unrelated to locking.

Linus


2010-02-27 21:08:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Sat, 27 Feb 2010, Michael Buesch wrote:
>
> Please check if the following patch makes any difference.

No difference.

> Also please show me the result of the printks.

I get

b43-pci-bridge 0000:08:00.0: PCI INT A -> GSI 17 (level, low) -> IRQ 17
b43-pci-bridge 0000:08:00.0: setting latency timer to 64
dcdbas dcdbas: Dell Systems Management Base Driver (version 5.6.0-3.2)
ssb: Core 0 found: ChipCommon (cc 0x800, rev 0x16, vendor 0x4243)
ssb: Core 1 found: IEEE 802.11 (cc 0x812, rev 0x0F, vendor 0x4243)
ssb: Core 2 found: PCMCIA (cc 0x80D, rev 0x0A, vendor 0x4243)
ssb: Core 3 found: PCI-E (cc 0x820, rev 0x09, vendor 0x4243)
ssb: Found rev 1 PMU (capabilities 0x02A62F01)
** PMU res init 4312/4322
ssb: SPROM revision 8 detected.
ssb: Sonics Silicon Backplane found on PCI device 0000:08:00.0
cfg80211: Calling CRDA to update world regulatory domain
alloc irq_desc for 22 on node -1
alloc kstat_irqs on node -1
...
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
** Set LDO 4312
** Set LDO paref
b43-phy0 debug: b2062: Using crystal tab entry 19200 kHz.
...
b43-phy0 debug: Adding Interface type 2
ADDRCONF(NETDEV_UP): wlan0: link is not ready
b43-phy0 ERROR: Fatal DMA error: 0x00000800, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
b43-phy0 ERROR: This device does not support DMA on your system. Please use PIO instead.
b43-phy0 ERROR: CONFIG_B43_FORCE_PIO must be set in your kernel configuration.
b43-phy0: Controller RESET (DMA error) ...
...
b43-phy0: Loading firmware version 478.104 (2008-07-01 00:50:23)
** Set LDO 4312
** Set LDO paref
b43-phy0 debug: b2062: Using crystal tab entry 19200 kHz.

for the new printks.

Linus

2010-02-26 20:33:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors



On Fri, 26 Feb 2010, Michael Buesch wrote:
>
> I'd rather appreciate that somebody who owns such a device and can reproduce the bug
> would do some _real_ debugging on the issue instead.

I have the hardware, I can reproduce the bug 100% of the time.

So send me a patch. I'll try it. But I have no hardware docs, nor any
information about how that SSB bridge is supposed to work, or why DMA
might be failing.

Linus

2010-02-26 20:20:48

by Michael Büsch

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Friday 26 February 2010 21:07:43 Linus Torvalds wrote:
>
> On Fri, 26 Feb 2010, Michael Buesch wrote:
> >
> > Where is that patch? I can't find it in any list archives.
>
> Here is the patch. I originally sent it to Larry directly (with John and
> David cc'd).
>
> Linus
>
> ---
> drivers/net/wireless/b43/Kconfig | 6 +++---
> drivers/net/wireless/b43/b43.h | 5 +++--
> drivers/net/wireless/b43/main.c | 10 +++++++++-
> 3 files changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/wireless/b43/Kconfig b/drivers/net/wireless/b43/Kconfig
> index 64c12e1..e6a7d8e 100644
> --- a/drivers/net/wireless/b43/Kconfig
> +++ b/drivers/net/wireless/b43/Kconfig
> @@ -78,11 +78,11 @@ config B43_SDIO
>
> If unsure, say N.
>
> -# Data transfers to the device via PIO
> -# This is only needed on PCMCIA and SDIO devices. All others can do DMA properly.
> +# Data transfers to the device via PIO. We want it as a fallback even
> +# if we can do DMA.
> config B43_PIO
> bool
> - depends on B43 && (B43_SDIO || B43_PCMCIA || B43_FORCE_PIO)
> + depends on B43
> select SSB_BLOCKIO
> default y
>
> diff --git a/drivers/net/wireless/b43/b43.h b/drivers/net/wireless/b43/b43.h
> index c484cc2..f2cfbc4 100644
> --- a/drivers/net/wireless/b43/b43.h
> +++ b/drivers/net/wireless/b43/b43.h
> @@ -694,6 +694,7 @@ struct b43_wldev {
> bool radio_hw_enable; /* saved state of radio hardware enabled state */
> bool qos_enabled; /* TRUE, if QoS is used. */
> bool hwcrypto_enabled; /* TRUE, if HW crypto acceleration is enabled. */
> + bool use_pio; /* TRUE if next init should use PIO */
>
> /* PHY/Radio device. */
> struct b43_phy phy;
> @@ -885,9 +886,9 @@ static inline bool b43_using_pio_transfers(struct b43_wldev *dev)
> }
>
> #ifdef CONFIG_B43_FORCE_PIO
> -# define B43_FORCE_PIO 1
> +# define B43_PIO_DEFAULT 1
> #else
> -# define B43_FORCE_PIO 0
> +# define B43_PIO_DEFAULT 0
> #endif
>
>
> diff --git a/drivers/net/wireless/b43/main.c b/drivers/net/wireless/b43/main.c
> index 490fb45..d310ce7 100644
> --- a/drivers/net/wireless/b43/main.c
> +++ b/drivers/net/wireless/b43/main.c
> @@ -98,6 +98,10 @@ static int modparam_btcoex = 1;
> module_param_named(btcoex, modparam_btcoex, int, 0444);
> MODULE_PARM_DESC(btcoex, "Enable Bluetooth coexistence (default on)");
>
> +int b43_modparam_pio = B43_PIO_DEFAULT;
> +module_param_named(pio, b43_modparam_pio, int, 0644);
> +MODULE_PARM_DESC(pio, "Use PIO accesses by default: 0=DMA, 1=PIO");
> +
> int b43_modparam_verbose = B43_VERBOSITY_DEFAULT;
> module_param_named(verbose, b43_modparam_verbose, int, 0644);
> MODULE_PARM_DESC(verbose, "Log message verbosity: 0=error, 1=warn, 2=info(default), 3=debug");
> @@ -1795,6 +1799,9 @@ static void b43_do_interrupt_thread(struct b43_wldev *dev)
> "on your system. Please use PIO instead.\n");
> b43err(dev->wl, "CONFIG_B43_FORCE_PIO must be set in "
> "your kernel configuration.\n");
> + /* Fall back to PIO transfers if we get fatal DMA errors! */
> + dev->use_pio = 1;
> + b43_controller_restart(dev, "DMA error");
> return;
> }
> if (merged_dma_reason & B43_DMAIRQ_NONFATALMASK) {
> @@ -4360,7 +4367,7 @@ static int b43_wireless_core_init(struct b43_wldev *dev)
>
> if ((dev->dev->bus->bustype == SSB_BUSTYPE_PCMCIA) ||
> (dev->dev->bus->bustype == SSB_BUSTYPE_SDIO) ||
> - B43_FORCE_PIO) {
> + dev->use_pio) {
> dev->__using_pio_transfers = 1;
> err = b43_pio_init(dev);
> } else {
> @@ -4830,6 +4837,7 @@ static int b43_one_core_attach(struct ssb_device *dev, struct b43_wl *wl)
> if (!wldev)
> goto out;
>
> + wldev->use_pio = b43_modparam_pio;
> wldev->dev = dev;
> wldev->wl = wl;
> b43_set_status(wldev, B43_STAT_UNINIT);

Well, my original plan was to get rid of controller_restart and not add yet
another user of it, because it is extremely broken and racy. The locking in the
whole driver is completely braindead due to the mere existence of this function.

I'd rather appreciate that somebody who owns such a device and can reproduce the bug
would do some _real_ debugging on the issue instead.

--
Greetings, Michael.

2010-02-26 22:54:49

by Gábor Stefanik

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

2010/2/26 Linus Torvalds <[email protected]>:
>
>
> On Fri, 26 Feb 2010, G?bor Stefanik wrote:
>>
>> Well, we have a report from someone with an Intel T7250, which is
>> definitely neither a slow CPU nor a ULV. The machine has PhoenixBIOS
>> (in the same rebranded form as yours). My Intel T7100 + InsydeH2O
>> machine (almost the same CPU, but with Insyde EFI in BIOS emulation
>> mode instead of PhoenixBIOS, or maybe TrustedCore) didn't reproduce
>> the bug. The trigger still seems to be PhoenixBIOS.
>
> Well, I'd love to give you a register dump. It _looks_ like it should be
> possible write some script to dump all the registers by accessing
>
> ? ? ? ?/sys/kernel/debug/b43/phy0/[mmio,shm][16,32]read
>
> but that's a very odd interface and I don't know what the index ranges are
> for them (and which registers need 16-bit vs 32-bit accesses).
>
> But if you give me a script to dump all the relevant registers, I could
> just dump them after a cold-boot, and _before_ loading the b43 driver fir
> the first time. Maybe you can compare it to your state with your BIOS, in
> the same situation, and see if there is some obvious difference that
> strikes you?
>
> Of course, maybe there is some hidden state that doesn't show up there
> either?
>
> ? ? ? ? ? ? ? ?Linus
>

Currently I am working on comparing mmiotraces from b43 and wl, and
there appear to be a few writes to the DMA control area of the card,
that are missing from b43. So, most likely no dump will be needed.

BTW I suspect that the difference between working & broken machines is
not in initial state - rather, the BIOS corrupts the state of the card
during startup. This would probably not show up in a register dump.

--G?bor

--
Vista: [V]iruses, [I]ntruders, [S]pyware, [T]rojans and [A]dware. :-)

2010-02-26 22:08:54

by Gábor Stefanik

[permalink] [raw]
Subject: Re: Make b43 driver fall back gracefully to PIO mode after fatal DMA errors

On Fri, Feb 26, 2010 at 10:50 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Fri, 26 Feb 2010, Linus Torvalds wrote:
>>
>> so even if mmiotrace fixes it for others, it's not working for me.
>
> Side note: In case this is timing-related: the machine in question is a
> dual-core 1.3Ghz CPU (Core 2 Duo U1400). I think SMP has always been
> enabled, I don't think I ever tried UP.
>
> Also, in addition to the PIO fallback code, that particular mmiotrace test
> was done with the "udelay(10)" added to the b43_write*() functions. So it
> was even slower.
>
> Maybe the bug only happens on slow machines (and then the udelay obviously
> wouldn't help!), but that would not explain why mmiotrace might then
> _hide_ the bug for some people.
>
> ? ? ? ? ? ? ? ? ? ? ? ?Linus
>

Well, we have a report from someone with an Intel T7250, which is
definitely neither a slow CPU nor a ULV. The machine has PhoenixBIOS
(in the same rebranded form as yours). My Intel T7100 + InsydeH2O
machine (almost the same CPU, but with Insyde EFI in BIOS emulation
mode instead of PhoenixBIOS, or maybe TrustedCore) didn't reproduce
the bug. The trigger still seems to be PhoenixBIOS.

Peace,
G?bor

--
Vista: [V]iruses, [I]ntruders, [S]pyware, [T]rojans and [A]dware. :-)