Hi.
Following bug exists in the ipw2100 driver/firmware for years and Intel
folks never responded to zillions bugzilla entries and forum notices in
the internet with some patch or firmware update (although did request
dmesg and debug info, and received them).
ipw2100: Fatal interrupt. Scheduling firmware restart.
I believe it is a firmware bug because after driver is unloaded and
loaded back again wireless adapter usually starts working (for small
amount of time though). My conspiracy feeling can suggest, that it may
be kind of a force to buy a new one, or trivial error in the firmware,
when it writes to the same place in the flash and essentially given cell
became dead or whatever else.
Intel folks, please fix this problem, I see no other way to force you to do
this than to mark ipw2100 driver as broken, since that is what it is.
Bug exists at least in .15 upto .24 kernels, just search above dmesg
line. I cought it with 2.6.24-19-386 ubuntu kernel, 1.3 firmware
version. lspci:
02:04.0 Network controller: Intel Corporation PRO/Wireless LAN 2100 3B Mini PCI Adapter (rev 04)
Subsystem: Intel Corporation Samsung X10/P30 integrated WLAN
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (500ns min, 8500ns max), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 11
Region 0: Memory at 90080000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
dmesg is pretty usual.
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
index 9931b5a..c24fc6a 100644
--- a/drivers/net/wireless/Kconfig
+++ b/drivers/net/wireless/Kconfig
@@ -125,7 +125,7 @@ config PCMCIA_RAYCS
config IPW2100
tristate "Intel PRO/Wireless 2100 Network Connection"
- depends on PCI && WLAN_80211
+ depends on PCI && WLAN_80211 && BROKEN
select WIRELESS_EXT
select FW_LOADER
select IEEE80211
--
Evgeniy Polyakov
On Sunday 21 September 2008 19:23:17 Evgeniy Polyakov wrote:
> Hi.
>
> Following bug exists in the ipw2100 driver/firmware for years and Intel
> folks never responded to zillions bugzilla entries and forum notices in
> the internet with some patch or firmware update (although did request
> dmesg and debug info, and received them).
>
> ipw2100: Fatal interrupt. Scheduling firmware restart.
>
> I believe it is a firmware bug because after driver is unloaded and
> loaded back again wireless adapter usually starts working (for small
> amount of time though). My conspiracy feeling can suggest, that it may
> be kind of a force to buy a new one, or trivial error in the firmware,
> when it writes to the same place in the flash and essentially given cell
> became dead or whatever else.
>
> Intel folks, please fix this problem, I see no other way to force you to do
> this than to mark ipw2100 driver as broken, since that is what it is.
You are pretty funny, actually. :)
I think the bug should be fixed, but what makes _you_ think you can _force_
anybody to do so?
> Signed-off-by: Evgeniy Polyakov <[email protected]>
>
> diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
> index 9931b5a..c24fc6a 100644
> --- a/drivers/net/wireless/Kconfig
> +++ b/drivers/net/wireless/Kconfig
> @@ -125,7 +125,7 @@ config PCMCIA_RAYCS
>
> config IPW2100
> tristate "Intel PRO/Wireless 2100 Network Connection"
> - depends on PCI && WLAN_80211
> + depends on PCI && WLAN_80211 && BROKEN
> select WIRELESS_EXT
> select FW_LOADER
> select IEEE80211
>
--
Greetings Michael.
[Evgeniy Polyakov - Sun, Sep 21, 2008 at 11:38:09PM +0400]
| Hi.
|
| On Sun, Sep 21, 2008 at 09:14:04PM +0200, Johannes Berg ([email protected]) wrote:
| > > Do you want me to implement ipw2100 driver as a big work structure
| > > which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
| > > And that will be the fix suggested by Intel? That would explain a lot.
| >
| > I think what Arjan is saying is that it would be better to put pressure
| > on the responsible folks (I don't think Arjan is anywhere near them at
|
| Both maintainers were added to the copy list.
|
| > all) if you'd put in a WARN_ON() for this error and that would make the
| > top entry on kerneloops.org all the time... And additionally put in a
| > workaround for yourself for now.
|
| As I pointed, I can rewrite the whole driver's initialization process,
| so that it looked like init/wait/exit loop, which can be processed at
| the module load and when fatal interrupt fires. Do this a fix? This is
| not even a remotely workaround. We can just add
| rmmod/modprobe/ifdown/ifup to the crontab job. Another users reported in
| bugzilla that they needed to reboot a machine to make card working
| again. I'm not sure that user tried to do a rmmod/modprobe though.
|
| > And can we keep the flames off this list please? That comment from Wei
| > Weng was absolutely uncalled for, and inciting a flamewar (as you have
| > already blogged) was not really productive either.
|
| If we will keep silence, no one will notice that problem exists.
|
| I do hope this will result in a progress. Arjan, do you aggree to add
| this patch to the current tree?
|
| diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
| index 19a401c..9a7b64c 100644
| --- a/drivers/net/wireless/ipw2100.c
| +++ b/drivers/net/wireless/ipw2100.c
| @@ -206,6 +206,8 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
|
| static u32 ipw2100_debug_level = IPW_DL_NONE;
|
| +static int ipw2100_max_fatal_ints = 10;
| +
| #ifdef CONFIG_IPW2100_DEBUG
| #define IPW_DEBUG(level, message...) \
| do { \
| @@ -3174,6 +3176,10 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
| if (inta & IPW2100_INTA_FATAL_ERROR) {
| printk(KERN_WARNING DRV_NAME
| ": Fatal interrupt. Scheduling firmware restart.\n");
| + WARN_ON(1);
| +
| + BUG_ON(ipw2100_max_fatal_ints-- <= 0);
| +
| priv->inta_other++;
| write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
|
|
|
| --
| Evgeniy Polyakov
|
Since it's that serious maybe we should change
IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
priv->net_dev->name, priv->fatal_error);
to printk(KERN_WARN)? And here is why - as I see now we can't say what
exactly is wrong - Evgeniy said he has a suspicious about firmware so
this WARNS will be collected by Arjan thru kerneloops and we could not
ask users to change debug level and repost problem - oops will have it
by default - and if it really firmware problem - firmware engineers could
find this _additional_ info usefull and resolve it (probably).
- Cyrill -
On Sun, 2008-09-21 at 23:00 +0400, Evgeniy Polyakov wrote:
> Do you want me to implement ipw2100 driver as a big work structure
> which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
> And that will be the fix suggested by Intel? That would explain a lot.
I think what Arjan is saying is that it would be better to put pressure
on the responsible folks (I don't think Arjan is anywhere near them at
all) if you'd put in a WARN_ON() for this error and that would make the
top entry on kerneloops.org all the time... And additionally put in a
workaround for yourself for now.
And can we keep the flames off this list please? That comment from Wei
Weng was absolutely uncalled for, and inciting a flamewar (as you have
already blogged) was not really productive either.
johannes
On Sun, Sep 21, 2008 at 01:27:53PM -0700, Arjan van de Ven ([email protected]) wrote:
> > Well, I actually wanted to have a bug there because of it, but now I
> > think that annoying repeated warning is enough to bring attention to
> > the problem by putting bug information into some magic special place
> > called kerneloops collection.
>
> are you more interested in bringing attention than finding something
> that makes the driver work ? I sort of am getting that impression and
> I'd be disappointed if that is the case.
I do think that it can not be fixed without serious intervention of the
Intel (hardware) folks, since bug exists more than 4 years in two
firmwares and lots of very different driver versions and was reproduced
even on 2.4 kernel.
I will experiment with reloading issues as Alan suggested and to
add/remove more surgery into initialization process to be allowed to
'workaround' the issue, since it looks noone else will.
But that's definitely not a fix and in my personal workaround's 10
degrees shit'o'meter this lies around 12.
> > Consider for inclusing for the upcoming kernel to get wider
> > notifications. Yes, it is not a bugfix, I know.
>
> still more complex than needed; a WARN_ON_ONCE() will be enough.
That allows to dump whatever number of warnings you want. The more we
have, the louder will be customers scream.
--
Evgeniy Polyakov
On Mon, 22 Sep 2008 00:20:57 +0400
Evgeniy Polyakov <[email protected]> wrote:
> On Sun, Sep 21, 2008 at 12:43:32PM -0700, Arjan van de Ven
> ([email protected]) wrote:
> > > @@ -3174,6 +3176,10 @@ static void ipw2100_irq_tasklet(struct
> > > ipw2100_priv *priv) if (inta & IPW2100_INTA_FATAL_ERROR) {
> > > printk(KERN_WARNING DRV_NAME
> > > ": Fatal interrupt. Scheduling firmware
> > > restart.\n");
> > > + WARN_ON(1);
> > > +
> > > + BUG_ON(ipw2100_max_fatal_ints-- <= 0);
> >
> > BUG_ON in interrupt context is just extremely hostile, since it
> > means the box is dead.
> >
> > also I would suggest using WARN_ON_ONCE()
>
> Well, I actually wanted to have a bug there because of it, but now I
> think that annoying repeated warning is enough to bring attention to
> the problem by putting bug information into some magic special place
> called kerneloops collection.
are you more interested in bringing attention than finding something
that makes the driver work ? I sort of am getting that impression and
I'd be disappointed if that is the case.
>
> Consider for inclusing for the upcoming kernel to get wider
> notifications. Yes, it is not a bugfix, I know.
still more complex than needed; a WARN_ON_ONCE() will be enough.
On Mon, Sep 22, 2008 at 12:05:18AM +0400, Cyrill Gorcunov ([email protected]) wrote:
> Since it's that serious maybe we should change
>
> IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
> priv->net_dev->name, priv->fatal_error);
>
> to printk(KERN_WARN)? And here is why - as I see now we can't say what
> exactly is wrong - Evgeniy said he has a suspicious about firmware so
> this WARNS will be collected by Arjan thru kerneloops and we could not
> ask users to change debug level and repost problem - oops will have it
> by default - and if it really firmware problem - firmware engineers could
> find this _additional_ info usefull and resolve it (probably).
The only reason for this change is to make a mark at the kerneloops.
I.e. users know, there is a bug. Developers know, there is a bug.
Everyone knows that there is a bug, but until it is at the special place
we look to each other just like there is no bug.
Here are dumps for example:
http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=245
Bug existed even with 1.2 firmware and .11 kernel.
Intel, that's a great marketing slogan: stability everywhere!
--
Evgeniy Polyakov
On Sun, 21 Sep 2008 22:28:38 +0400
Evgeniy Polyakov <[email protected]> wrote:
> On Sun, Sep 21, 2008 at 11:04:22AM -0700, Arjan van de Ven
> ([email protected]) wrote:
> > so now you go from an occasional burp to having nothing at all.
> > How about you run with this patch on your own machine only?
>
> And how else user can get attention to the problem which is not fixed
> by the vendor?
... or the community
> We close our eyes and there is no problem, since we do
> not see it. I just brought a lamp: no user can see that essentially
> driver is broken so he can run it on his own risk.
>
> > or.. since you say a reload of the driver fixes it.. why don't you
> > make a patch for the driver that does basically the actions of a
> > reload automatically when the driver detects the issue?
> > (and stick a WARN_ON in for good measure so that kerneloops.org can
> > start tracking these burps)
>
> It stops after several seconds (or packets?). Sometimes (but rarely)
> it works several minutes, sometimes it fires above dmesg line and
> continues to work, sometimes it fires it for a while and then stops
> writing it, although driver does not send or receive anything (at
> least ifconfig counters do not change).
again.. so how about you detect this condition and do, in the driver
code, the equivalent of rmmod/insmod to the hardware. I'm sure people
who have the hardware would appreciate that type of patch a lot more
than the one you sent out.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 21 Sep 2008 21:23:17 +0400
Evgeniy Polyakov <[email protected]> wrote:
> Hi.
>
> Following bug exists in the ipw2100 driver/firmware for years and
> Intel folks never responded to zillions bugzilla entries and forum
> notices in the internet with some patch or firmware update (although
> did request dmesg and debug info, and received them).
>
> ipw2100: Fatal interrupt. Scheduling firmware restart.
>
> I believe it is a firmware bug because after driver is unloaded and
> loaded back again wireless adapter usually starts working (for small
> amount of time though).
\
> diff --git a/drivers/net/wireless/Kconfig
> b/drivers/net/wireless/Kconfig index 9931b5a..c24fc6a 100644
> --- a/drivers/net/wireless/Kconfig
> +++ b/drivers/net/wireless/Kconfig
> @@ -125,7 +125,7 @@ config PCMCIA_RAYCS
>
> config IPW2100
> tristate "Intel PRO/Wireless 2100 Network Connection"
> - depends on PCI && WLAN_80211
> + depends on PCI && WLAN_80211 && BROKEN
> select WIRELESS_EXT
> select FW_LOADER
> select IEEE80211
>
so now you go from an occasional burp to having nothing at all.
How about you run with this patch on your own machine only?
or.. since you say a reload of the driver fixes it.. why don't you make
a patch for the driver that does basically the actions of a reload
automatically when the driver detects the issue?
(and stick a WARN_ON in for good measure so that kerneloops.org can
start tracking these burps)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, Sep 21, 2008 at 11:35:13AM -0700, Arjan van de Ven ([email protected]) wrote:
> > And how else user can get attention to the problem which is not fixed
> > by the vendor?
>
> ... or the community
Which does not have access to the firmware... Which IMO is failing and
not the driver itself.
> > It stops after several seconds (or packets?). Sometimes (but rarely)
> > it works several minutes, sometimes it fires above dmesg line and
> > continues to work, sometimes it fires it for a while and then stops
> > writing it, although driver does not send or receive anything (at
> > least ifconfig counters do not change).
>
> again.. so how about you detect this condition and do, in the driver
> code, the equivalent of rmmod/insmod to the hardware. I'm sure people
> who have the hardware would appreciate that type of patch a lot more
> than the one you sent out.
Reset task does efectively ipw2100_up(), so the difference is power
cycles over the PCI bus and enable/disable/request commands. Like this
stuff:
/* We disable the RETRY_TIMEOUT register (0x41) to keep
* PCI Tx retries from interfering with C3 CPU state */
pci_read_config_dword(pci_dev, 0x40, &val);
if ((val & 0x0000ff00) != 0)
pci_write_config_dword(pci_dev, 0x40, val & 0xffff00ff);
I do remember I had a tibet monk course of decoding ipw2100 PCI
config address space, just need to find my kimono.
Do you want me to implement ipw2100 driver as a big work structure
which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
And that will be the fix suggested by Intel? That would explain a lot.
P.S. And some people tell that asking for bug bisection is a hard
pressure on user. Vendor has to ask him to fix bug himself instead,
and that will be a solution!
Getting the fact, that rmmod/insmod does not always fix the problem (but
most of the time for a short period of time), I again want to point,
that it looks like a firmware problem related to some inner timings. You
ask me to fix the driver and do not even listen to what I said
previously and do not get that into account and analyze.
--
Evgeniy Polyakov
Arjan van de Ven wrote:
> On Sun, 21 Sep 2008 22:28:38 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
>> On Sun, Sep 21, 2008 at 11:04:22AM -0700, Arjan van de Ven
>> ([email protected]) wrote:
>>> so now you go from an occasional burp to having nothing at all.
>>> How about you run with this patch on your own machine only?
>> And how else user can get attention to the problem which is not fixed
>> by the vendor?
>
> ... or the community
>
>> We close our eyes and there is no problem, since we do
>> not see it. I just brought a lamp: no user can see that essentially
>> driver is broken so he can run it on his own risk.
>>
>>> or.. since you say a reload of the driver fixes it.. why don't you
>>> make a patch for the driver that does basically the actions of a
>>> reload automatically when the driver detects the issue?
>>> (and stick a WARN_ON in for good measure so that kerneloops.org can
>>> start tracking these burps)
>> It stops after several seconds (or packets?). Sometimes (but rarely)
>> it works several minutes, sometimes it fires above dmesg line and
>> continues to work, sometimes it fires it for a while and then stops
>> writing it, although driver does not send or receive anything (at
>> least ifconfig counters do not change).
>
> again.. so how about you detect this condition and do, in the driver
> code, the equivalent of rmmod/insmod to the hardware. I'm sure people
> who have the hardware would appreciate that type of patch a lot more
> than the one you sent out.
>
I guess it is your way of "middle finger" to all the IPW2100 customers who try
to use it on a Linux machine.
Thanks
Wei
On Sun, Sep 21, 2008 at 11:04:22AM -0700, Arjan van de Ven ([email protected]) wrote:
> so now you go from an occasional burp to having nothing at all.
> How about you run with this patch on your own machine only?
And how else user can get attention to the problem which is not fixed by
the vendor? We close our eyes and there is no problem, since we do not
see it. I just brought a lamp: no user can see that essentially driver
is broken so he can run it on his own risk.
> or.. since you say a reload of the driver fixes it.. why don't you make
> a patch for the driver that does basically the actions of a reload
> automatically when the driver detects the issue?
> (and stick a WARN_ON in for good measure so that kerneloops.org can
> start tracking these burps)
It stops after several seconds (or packets?). Sometimes (but rarely)
it works several minutes, sometimes it fires above dmesg line and
continues to work, sometimes it fires it for a while and then stops
writing it, although driver does not send or receive anything (at
least ifconfig counters do not change).
Actually, I do not think it is a driver problem, since what it does is
pretty much straightforward, but if you will tell me how else can we fix
this issue, I will print it and glue near the window so this gotcha
could be used with other problems. Or you can (as everyone else who do)
just said that this is damn wrong and forget about problem for the next
several years.
--
Evgeniy Polyakov
On Sun, Sep 21, 2008 at 07:36:00PM +0200, Michael Buesch ([email protected]) wrote:
> > Intel folks, please fix this problem, I see no other way to force you to do
> > this than to mark ipw2100 driver as broken, since that is what it is.
>
> You are pretty funny, actually. :)
>
> I think the bug should be fixed, but what makes _you_ think you can _force_
> anybody to do so?
Maybe because I bought that adapter and it stopped working and Intel
knows about this bug and does not fix it for years?
--
Evgeniy Polyakov
On Sun, Sep 21, 2008 at 12:43:32PM -0700, Arjan van de Ven ([email protected]) wrote:
> > @@ -3174,6 +3176,10 @@ static void ipw2100_irq_tasklet(struct
> > ipw2100_priv *priv) if (inta & IPW2100_INTA_FATAL_ERROR) {
> > printk(KERN_WARNING DRV_NAME
> > ": Fatal interrupt. Scheduling firmware
> > restart.\n");
> > + WARN_ON(1);
> > +
> > + BUG_ON(ipw2100_max_fatal_ints-- <= 0);
>
> BUG_ON in interrupt context is just extremely hostile, since it means
> the box is dead.
>
> also I would suggest using WARN_ON_ONCE()
Well, I actually wanted to have a bug there because of it, but now I
think that annoying repeated warning is enough to bring attention to the
problem by putting bug information into some magic special place called
kerneloops collection.
Consider for inclusing for the upcoming kernel to get wider
notifications. Yes, it is not a bugfix, I know.
diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..6599211 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -206,6 +206,9 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
static u32 ipw2100_debug_level = IPW_DL_NONE;
+static int ipw2100_max_fatal_ints = 10;
+module_param(ipw2100_max_fatal_ints, int, 0644);
+
#ifdef CONFIG_IPW2100_DEBUG
#define IPW_DEBUG(level, message...) \
do { \
@@ -3174,6 +3177,9 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
if (inta & IPW2100_INTA_FATAL_ERROR) {
printk(KERN_WARNING DRV_NAME
": Fatal interrupt. Scheduling firmware restart.\n");
+
+ WARN_ON(ipw2100_max_fatal_ints-- >= 0);
+
priv->inta_other++;
write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
--
Evgeniy Polyakov
> Getting the fact, that rmmod/insmod does not always fix the problem (but
> most of the time for a short period of time), I again want to point,
> that it looks like a firmware problem related to some inner timings. You
> ask me to fix the driver and do not even listen to what I said
> previously and do not get that into account and analyze.
Try putting it into D3 counting to 10 and powering it back up. Thats
about as close as you can get to pulling the plug when it hangs.
Alan
On Sun, Sep 21, 2008 at 02:02:03PM -0700, Arjan van de Ven ([email protected]) wrote:
> > That allows to dump whatever number of warnings you want. The more we
> > have, the louder will be customers scream.
>
> artificially increasing numbers isn't going to do that; it just shows
> you're more interested in making a stink than in getting something
> improved ;(
As practice shows, I'm the only one who is interested in getting
something improved, and Intel, as we see right now, is not interested in
it at all, since you ask me not only decrease error verbosity, but also
do not work towards fixing the bug by trying to understand where it
lives.
--
Evgeniy Polyakov
On Sun, Sep 21, 2008 at 08:57:44PM +0100, Alan Cox ([email protected]) wrote:
> Try putting it into D3 counting to 10 and powering it back up. Thats
> about as close as you can get to pulling the plug when it hangs.
I will experiment with this, thanks Alan.
Unfortunately my machine builds this only
updated driver for about 10 minutes, so
results will appear not too quickly.
I will start tests tomorrow.
--
Evgeniy Polyakov
On Mon, 22 Sep 2008 00:57:06 +0400
Evgeniy Polyakov <[email protected]> wrote:
> > > Consider for inclusing for the upcoming kernel to get wider
> > > notifications. Yes, it is not a bugfix, I know.
> >
> > still more complex than needed; a WARN_ON_ONCE() will be enough.
>
> That allows to dump whatever number of warnings you want. The more we
> have, the louder will be customers scream.
artificially increasing numbers isn't going to do that; it just shows
you're more interested in making a stink than in getting something
improved ;(
>
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 21 Sep 2008 14:52:37 -0400
Wei Weng <[email protected]> wrote:
> > again.. so how about you detect this condition and do, in the driver
> > code, the equivalent of rmmod/insmod to the hardware. I'm sure
> > people who have the hardware would appreciate that type of patch a
> > lot more than the one you sent out.
> >
>
> I guess it is your way of "middle finger" to all the IPW2100
> customers who try to use it on a Linux machine.
if suggesting a workaround is giving the middle finger in your mind,
then I don't think it's worth my time to discuss this further with you.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Hi Evgeniy,
> Following bug exists in the ipw2100 driver/firmware for years and Intel
> folks never responded to zillions bugzilla entries and forum notices in
> the internet with some patch or firmware update (although did request
> dmesg and debug info, and received them).
>
> ipw2100: Fatal interrupt. Scheduling firmware restart.
>
> I believe it is a firmware bug because after driver is unloaded and
> loaded back again wireless adapter usually starts working (for small
> amount of time though). My conspiracy feeling can suggest, that it may
> be kind of a force to buy a new one, or trivial error in the firmware,
> when it writes to the same place in the flash and essentially given cell
> became dead or whatever else.
I don't know if it is for this bug or a different one, but Matthew
Garrett seem to have some pending patches. At least that is what he told
me at PlumbersConf. Lets see if these patches do help. And please follow
up with Arjan's suggestion and put a WARN_ON in the upstream code
instead of waving CONFIG_BROKEN around.
Regards
Marcel
[Evgeniy Polyakov - Mon, Sep 22, 2008 at 12:26:56AM +0400]
| On Mon, Sep 22, 2008 at 12:05:18AM +0400, Cyrill Gorcunov ([email protected]) wrote:
| > Since it's that serious maybe we should change
| >
| > IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
| > priv->net_dev->name, priv->fatal_error);
| >
| > to printk(KERN_WARN)? And here is why - as I see now we can't say what
| > exactly is wrong - Evgeniy said he has a suspicious about firmware so
| > this WARNS will be collected by Arjan thru kerneloops and we could not
| > ask users to change debug level and repost problem - oops will have it
| > by default - and if it really firmware problem - firmware engineers could
| > find this _additional_ info usefull and resolve it (probably).
|
| The only reason for this change is to make a mark at the kerneloops.
| I.e. users know, there is a bug. Developers know, there is a bug.
| Everyone knows that there is a bug, but until it is at the special place
| we look to each other just like there is no bug.
|
| Here are dumps for example:
| http://www.intellinuxwireless.org/bugzilla/show_bug.cgi?id=245
|
| Bug existed even with 1.2 firmware and .11 kernel.
| Intel, that's a great marketing slogan: stability everywhere!
|
| --
| Evgeniy Polyakov
|
>From dump:
> Sep 25 11:31:39 suino wlan0: TX timed out. Scheduling firmware restart.
yes Evgeniy - all could know that but this register info could help
firmware engineers to distinguish problems (without additional efforts
like ask users to pass debug argument - kerneloops will have it
by default) if there not only one exist. I mean I don't think anyone
would reject additional info about problem ever :)
- Cyrill -
Hi.
On Sun, Sep 21, 2008 at 09:14:04PM +0200, Johannes Berg ([email protected]) wrote:
> > Do you want me to implement ipw2100 driver as a big work structure
> > which will run ipw2100_init()/wait/ipw2100_exit() in a loop?
> > And that will be the fix suggested by Intel? That would explain a lot.
>
> I think what Arjan is saying is that it would be better to put pressure
> on the responsible folks (I don't think Arjan is anywhere near them at
Both maintainers were added to the copy list.
> all) if you'd put in a WARN_ON() for this error and that would make the
> top entry on kerneloops.org all the time... And additionally put in a
> workaround for yourself for now.
As I pointed, I can rewrite the whole driver's initialization process,
so that it looked like init/wait/exit loop, which can be processed at
the module load and when fatal interrupt fires. Do this a fix? This is
not even a remotely workaround. We can just add
rmmod/modprobe/ifdown/ifup to the crontab job. Another users reported in
bugzilla that they needed to reboot a machine to make card working
again. I'm not sure that user tried to do a rmmod/modprobe though.
> And can we keep the flames off this list please? That comment from Wei
> Weng was absolutely uncalled for, and inciting a flamewar (as you have
> already blogged) was not really productive either.
If we will keep silence, no one will notice that problem exists.
I do hope this will result in a progress. Arjan, do you aggree to add
this patch to the current tree?
diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..9a7b64c 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -206,6 +206,8 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
static u32 ipw2100_debug_level = IPW_DL_NONE;
+static int ipw2100_max_fatal_ints = 10;
+
#ifdef CONFIG_IPW2100_DEBUG
#define IPW_DEBUG(level, message...) \
do { \
@@ -3174,6 +3176,10 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
if (inta & IPW2100_INTA_FATAL_ERROR) {
printk(KERN_WARNING DRV_NAME
": Fatal interrupt. Scheduling firmware restart.\n");
+ WARN_ON(1);
+
+ BUG_ON(ipw2100_max_fatal_ints-- <= 0);
+
priv->inta_other++;
write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
--
Evgeniy Polyakov
On Mon, Sep 22, 2008 at 12:35:03AM +0400, Cyrill Gorcunov ([email protected]) wrote:
> yes Evgeniy - all could know that but this register info could help
> firmware engineers to distinguish problems (without additional efforts
> like ask users to pass debug argument - kerneloops will have it
> by default) if there not only one exist. I mean I don't think anyone
> would reject additional info about problem ever :)
Agreed.
diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..36cdd57 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -206,6 +206,9 @@ MODULE_PARM_DESC(disable, "manually disable the radio (default 0 [radio on])");
static u32 ipw2100_debug_level = IPW_DL_NONE;
+static int ipw2100_max_fatal_ints = 10;
+module_param(ipw2100_max_fatal_ints, int, 0644);
+
#ifdef CONFIG_IPW2100_DEBUG
#define IPW_DEBUG(level, message...) \
do { \
@@ -3174,16 +3177,21 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
if (inta & IPW2100_INTA_FATAL_ERROR) {
printk(KERN_WARNING DRV_NAME
": Fatal interrupt. Scheduling firmware restart.\n");
+
+ printk(KERN_WARNING DRV_NAME ": INTA: 0x%08lX\n",
+ (unsigned long)inta & IPW_INTERRUPT_MASK);
+
priv->inta_other++;
write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
read_nic_dword(dev, IPW_NIC_FATAL_ERROR, &priv->fatal_error);
- IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
+ printk(KERN_WARNING "%s: Fatal error value: 0x%08X\n",
priv->net_dev->name, priv->fatal_error);
read_nic_dword(dev, IPW_ERROR_ADDR(priv->fatal_error), &tmp);
- IPW_DEBUG_INFO("%s: Fatal error address value: 0x%08X\n",
+ printk(KERN_WARNING "%s: Fatal error address value: 0x%08X\n",
priv->net_dev->name, tmp);
+ WARN_ON(ipw2100_max_fatal_ints-- >= 0);
/* Wake up any sleeping jobs */
schedule_reset(priv);
--
Evgeniy Polyakov
On Sun, 21 Sep 2008 23:38:09 +0400
Evgeniy Polyakov <[email protected]> wrote:
>
> As I pointed, I can rewrite the whole driver's initialization process,
> so that it looked like init/wait/exit loop, which can be processed at
> the module load and when fatal interrupt fires.
yes please do this.
>
> I do hope this will result in a progress. Arjan, do you aggree to add
> this patch to the current tree?
>
> diff --git a/drivers/net/wireless/ipw2100.c
> b/drivers/net/wireless/ipw2100.c index 19a401c..9a7b64c 100644
> --- a/drivers/net/wireless/ipw2100.c
> +++ b/drivers/net/wireless/ipw2100.c
> @@ -206,6 +206,8 @@ MODULE_PARM_DESC(disable, "manually disable the
> radio (default 0 [radio on])");
> static u32 ipw2100_debug_level = IPW_DL_NONE;
>
> +static int ipw2100_max_fatal_ints = 10;
> +
> #ifdef CONFIG_IPW2100_DEBUG
> #define IPW_DEBUG(level, message...) \
> do { \
> @@ -3174,6 +3176,10 @@ static void ipw2100_irq_tasklet(struct
> ipw2100_priv *priv) if (inta & IPW2100_INTA_FATAL_ERROR) {
> printk(KERN_WARNING DRV_NAME
> ": Fatal interrupt. Scheduling firmware
> restart.\n");
> + WARN_ON(1);
> +
> + BUG_ON(ipw2100_max_fatal_ints-- <= 0);
BUG_ON in interrupt context is just extremely hostile, since it means
the box is dead.
also I would suggest using WARN_ON_ONCE()
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Mon, Sep 22, 2008 at 01:15:59AM +0300, Denys Fedoryshchenko ([email protected]) wrote:
> > Just to be clear, did it take 4 years? :)
> Any bugzilla entry? I cannot find on
> http://www.intellinuxwireless.org/bugzilla/ anything about this bug.
>
> I submit two reports in my case, one in kernel bugzilla, one in intel linux
> wireless project..
Lucky you :)
--
Evgeniy Polyakov
On Sun, Sep 21, 2008 at 11:38:19PM +0100, Alan Cox ([email protected]) wrote:
> > > still more complex than needed; a WARN_ON_ONCE() will be enough.
> >
> > That allows to dump whatever number of warnings you want. The more we
> > have, the louder will be customers scream.
>
> But if Intel don't care then you can scream all you like 8)
That's what happens :)
> A WARN_ON_ONCE is sufficient to capture an idea of how many people it is
> effecting and maybe to figure out what the trigger is from their reports,
> at that point there is some chance to get it fixed (especially if its
> remotely triggerable ;))
Well, redhat, suse and ubuntu bugzillas happend to be not enough. Why do
you believe a single warning at a new place will be? or couple of tens
or whatever else? If it cares, it cares. If it does not...
I attracted vendor's attention, vendor told me to fix it myself and to
create a patch to fill an entry in another 'bugzilla', so that vendor
could get results and probably decide to walk down from the cloud and
fix it.
So, if they do not care, I do not care about their care. That's the
deal. I will try to find a workaround, even if it is a real crap,
fortunately other users will not strike this bug too frequently.
--
Evgeniy Polyakov
Try D3ing the chip in the firmware restart code. Yes, it's retarded.
--
Matthew Garrett | [email protected]
On Sun, Sep 21, 2008 at 08:57:44PM +0100, Alan Cox ([email protected]) wrote:
> > Getting the fact, that rmmod/insmod does not always fix the problem (but
> > most of the time for a short period of time), I again want to point,
> > that it looks like a firmware problem related to some inner timings. You
> > ask me to fix the driver and do not even listen to what I said
> > previously and do not get that into account and analyze.
>
> Try putting it into D3 counting to 10 and powering it back up. Thats
> about as close as you can get to pulling the plug when it hangs.
I made several experimetns with power states in reset handler,
like put to d3 (hot), disable device, save/resetore states.
Fatal interrupts continue to fire with essentially the same rate.
The same error address does not always contain the same error value, but
frequently it is finit small set.
Here are some data:
[41773.200686] ipw2100: Fatal interrupt. Scheduling firmware restart.
[41773.200707] eth1: Fatal error value: 0x500185B8, address: 0x08004501,
inta: 0x40000000
[41773.200810] ipw2100 0000:02:04.0: PCI INT A disabled
[41773.203110] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.224446] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.245781] ipw2100: IRQ INTA == 0xFFFFFFFF
[41773.249360] ipw2100 0000:02:04.0: enabling device (0000 -> 0002)
[41773.249384] ipw2100 0000:02:04.0: PCI INT A -> Link[C0C8] -> GSI 11
(level, low) -> IRQ 11
[41773.249426] ipw2100 0000:02:04.0: restoring config space at
offset 0x1 (was 0x2900002, writing 0x2900006)
That is quite harmless, since interrupt handler just sees that device is
dissapearing. This brought me to think more about interrupt processing
(irq handler and related tasklet), and I found races between interrupt
tasklet, ipw2100_wx_event_work() handler, reset task and probably
others. Register access in some cases are proteceted by lock (interrupt
handler), and in some cases is not (all others). Although every user
first disables interrupts, but it can be handled right now and scheduled
tasklet already. Also priv->status field is frequently accessed and
modified with and without locks. This may be harmless, but still a red
flag.
Another data about the same failed address:
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x5000CEE4, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
eth1: Fatal error value: 0x50018584, address: 0x61C00000, inta:
0x40000000
values are quite limited, but I saw at different address wider set of
different data, but still limited. Addresses and values of the fatal
interrupt do not follow some immediately obvious law, so this looks more
like firmware losts its mind. The reason for this actually may be a
register access races described above. I will continue experiments with
this.
--
Evgeniy Polyakov
> Just to be clear, did it take 4 years? :)
Any bugzilla entry? I cannot find on
http://www.intellinuxwireless.org/bugzilla/ anything about this bug.
I submit two reports in my case, one in kernel bugzilla, one in intel linux
wireless project..
Evgeniy, you're bordering on being an asshole, if not actually
being one.
If you behaved this way for a bug I was responsible for, I would
absolutely ignore you until you settled down and started to behave
more reasonably.
You're acting like a bomb which is about to explode, which is probably
why the actual Intel maintainers for this driver don't want to touch
you with a ten foot pole. You're being volatile and extremely
unpleasant to interact with about this issue.
The Intel folks replying to you right now are general Intel linux
folks who are trying to help you, not the driver maintainers who can
look into the firmware and attack that angle. So give them A FUCKING
BREAK!
Getting the OOPS to kerneloops.org is the way forward and will help
your cause, whether you believe it or not.
On Sun, Sep 21, 2008 at 04:48:29PM -0700, David Miller ([email protected]) wrote:
> Evgeniy, you're bordering on being an asshole, if not actually
> being one.
Out of curiosity, what's worse: being an asshole and pretend to be good
or vice versa? Whatever... :)
> If you behaved this way for a bug I was responsible for, I would
> absolutely ignore you until you settled down and started to behave
> more reasonably.
That's the main point: 'until you started to behave more reasonably'.
For example filling another bug in rh/suse/ubuntu bugzilla?
Put yourself to the user's place, and suddenly picture changes
dramatically.
We got some progress on this bug, at least there is direct suggestion
from Matthew about power state, if it will fix the issue, I think it is
a good deal: one bug fix for lot of users for the mail in the killfile
and a worsened 'reputation'.
I provded a patch like Arjan wanted, and it can only change something
because of all this talks I started being an asshole. In my opinion.
Maybe there were some other ways around, but it looks like being a
provocative is the only way to get to the cloud. Who knows :)
--
Evgeniy Polyakov
On Mon, Sep 22, 2008 at 01:27:47AM +0200, Marcel Holtmann ([email protected]) wrote:
> as Arjan and Alan pointed out already, WARN_ON_ONCE is enough and I
> agree with them. Just to make this perfectly clear, this is with my
> community hat on.
>
> Please send a proper patch with a simple WARN_ON_ONCE and I am happy to
> sign off on it.
I really do not care about if there is warning at all, I just want that
bug to be fixed. And a we can see, something started to change, and that's
probably a good sign. I glad there is a result. I will check d3 states
tomorrow. Attached patch if you think it is yet needed.
diff --git a/drivers/net/wireless/ipw2100.c b/drivers/net/wireless/ipw2100.c
index 19a401c..637dc05 100644
--- a/drivers/net/wireless/ipw2100.c
+++ b/drivers/net/wireless/ipw2100.c
@@ -3174,16 +3174,18 @@ static void ipw2100_irq_tasklet(struct ipw2100_priv *priv)
if (inta & IPW2100_INTA_FATAL_ERROR) {
printk(KERN_WARNING DRV_NAME
": Fatal interrupt. Scheduling firmware restart.\n");
+
priv->inta_other++;
write_register(dev, IPW_REG_INTA, IPW2100_INTA_FATAL_ERROR);
read_nic_dword(dev, IPW_NIC_FATAL_ERROR, &priv->fatal_error);
- IPW_DEBUG_INFO("%s: Fatal error value: 0x%08X\n",
- priv->net_dev->name, priv->fatal_error);
-
read_nic_dword(dev, IPW_ERROR_ADDR(priv->fatal_error), &tmp);
- IPW_DEBUG_INFO("%s: Fatal error address value: 0x%08X\n",
- priv->net_dev->name, tmp);
+
+ printk(KERN_WARNING "%s: Fatal error value: 0x%08X, "
+ "address: 0x%08X, inta: 0x%08lX\n",
+ priv->net_dev->name, priv->fatal_error, tmp,
+ (unsigned long)inta & IPW_INTERRUPT_MASK);
+ WARN_ON_ONCE(1);
/* Wake up any sleeping jobs */
schedule_reset(priv);
--
Evgeniy Polyakov
On Sun, Sep 21, 2008 at 09:35:31PM +0200, Marcel Holtmann wrote:
> I don't know if it is for this bug or a different one, but Matthew
> Garrett seem to have some pending patches. At least that is what he told
> me at PlumbersConf. Lets see if these patches do help. And please follow
> up with Arjan's suggestion and put a WARN_ON in the upstream code
> instead of waving CONFIG_BROKEN around.
The fix I had for this was actually for ipw2200, but it ought to be
applicable for 2100 as well. The ideal fix is probably to ensure that
ipw*_down D3s the card and *_up D0s it, which brings enhanced runtime
power saving and also has the nice side effect of actually resetting the
damned POS in error cases.
--
Matthew Garrett | [email protected]
> > still more complex than needed; a WARN_ON_ONCE() will be enough.
>
> That allows to dump whatever number of warnings you want. The more we
> have, the louder will be customers scream.
But if Intel don't care then you can scream all you like 8)
A WARN_ON_ONCE is sufficient to capture an idea of how many people it is
effecting and maybe to figure out what the trigger is from their reports,
at that point there is some chance to get it fixed (especially if its
remotely triggerable ;))
Alan
On Sun, Sep 21, 2008 at 11:42:10PM +0100, Matthew Garrett ([email protected]) wrote:
> Try D3ing the chip in the firmware restart code. Yes, it's retarded.
Thank you, I will start tests tomorrow.
--
Evgeniy Polyakov
On Monday 22 September 2008, Evgeniy Polyakov wrote:
> On Sun, Sep 21, 2008 at 02:02:03PM -0700, Arjan van de Ven
([email protected]) wrote:
> > > That allows to dump whatever number of warnings you want. The more we
> > > have, the louder will be customers scream.
> >
> > artificially increasing numbers isn't going to do that; it just shows
> > you're more interested in making a stink than in getting something
> > improved ;(
>
> As practice shows, I'm the only one who is interested in getting
> something improved, and Intel, as we see right now, is not interested in
> it at all, since you ask me not only decrease error verbosity, but also
> do not work towards fixing the bug by trying to understand where it
> lives.
You are not right. I had totaly disfunctional Intel driver on two laptops and
reported about issue to Intel. Yes it took time, they took all debugs and
went coma mode (i was thinking that), but suddently i got mail from them, and
next kernel/firmware release worked for me flawlessly. So they did perfect
job.
Don't be negative and prepare yourself for giving long debug outputs. Patience
and only patience.
I gotta admit, those "firmware restarts" were pretty annoying, and I'd
always wondered why Intel themselves couldn't be bothered to fix 'em.
-Kenny
--
Kenneth R. Crudup Sr. SW Engineer, Scott County Consulting, Los Angeles
O: 3630 S. Sepulveda Blvd. #138, L.A., CA 90034-6809 (888) 454-8181
Hi Marcel.
On Sun, Sep 21, 2008 at 09:35:31PM +0200, Marcel Holtmann ([email protected]) wrote:
> I don't know if it is for this bug or a different one, but Matthew
> Garrett seem to have some pending patches. At least that is what he told
> me at PlumbersConf. Lets see if these patches do help. And please follow
> up with Arjan's suggestion and put a WARN_ON in the upstream code
> instead of waving CONFIG_BROKEN around.
I expect it is something new, since this bug exists at least from the
1.2 firmware version and .11 kernel. It was also reproduced (long ago
though) on 2.4.
--
Evgeniy Polyakov
Hi Evgeniy,
> > > That allows to dump whatever number of warnings you want. The more we
> > > have, the louder will be customers scream.
> >
> > artificially increasing numbers isn't going to do that; it just shows
> > you're more interested in making a stink than in getting something
> > improved ;(
>
> As practice shows, I'm the only one who is interested in getting
> something improved, and Intel, as we see right now, is not interested in
> it at all, since you ask me not only decrease error verbosity, but also
> do not work towards fixing the bug by trying to understand where it
> lives.
as Arjan and Alan pointed out already, WARN_ON_ONCE is enough and I
agree with them. Just to make this perfectly clear, this is with my
community hat on.
Please send a proper patch with a simple WARN_ON_ONCE and I am happy to
sign off on it.
Regards
Marcel
On Mon, Sep 22, 2008 at 12:43:07AM +0300, Denys Fedoryshchenko ([email protected]) wrote:
> You are not right. I had totaly disfunctional Intel driver on two laptops and
> reported about issue to Intel. Yes it took time, they took all debugs and
> went coma mode (i was thinking that), but suddently i got mail from them, and
> next kernel/firmware release worked for me flawlessly. So they did perfect
> job.
I'm talking about current situation.
> Don't be negative and prepare yourself for giving long debug outputs. Patience
> and only patience.
Just to be clear, did it take 4 years? :)
Anyway, I already made conclusions, as probably others: I will
experiment with different 'workarounds' for this bug, maybe I will
succeed, maybe Intel will decided to fix it, maybe LHC will crash the
world. Verbose warning about the bug was frowned upon, so its up to uses
to make a progress here...
--
Evgeniy Polyakov
On Mon, 22 Sep 2008 01:05:55 +0400
Evgeniy Polyakov <[email protected]> wrote:
> On Sun, Sep 21, 2008 at 02:02:03PM -0700, Arjan van de Ven
> ([email protected]) wrote:
> > > That allows to dump whatever number of warnings you want. The
> > > more we have, the louder will be customers scream.
> >
> > artificially increasing numbers isn't going to do that; it just
> > shows you're more interested in making a stink than in getting
> > something improved ;(
>
> As practice shows, I'm the only one who is interested in getting
> something improved, and Intel, as we see right now, is not interested
> in it at all, since you ask me not only decrease error verbosity, but
> also do not work towards fixing the bug by trying to understand where
> it lives.
I did no such thing and you know it.
I'm sorry, I'm not going to waste time on this if you keep acting
this dishonest; welcome to my mail filter...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org