2010-08-18 22:03:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

Hi,

While testing 2.6.36-rc1 (with a couple of fixes on top) I noticed that the ath9k
driver didn't work after resume from suspend to RAM. An attempt to unload the
driver using rmmod caused the BUG_ON() in kernel/workqueue.c:2844 to trigger.

I wonder if that regression is a result of the recent workqueue changes?

Thanks,
Rafael


2010-08-19 08:15:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

Hello, Rafael.

On 08/19/2010 12:01 AM, Rafael J. Wysocki wrote:
> While testing 2.6.36-rc1 (with a couple of fixes on top) I noticed
> that the ath9k driver didn't work after resume from suspend to RAM.
> An attempt to unload the driver using rmmod caused the BUG_ON() in
> kernel/workqueue.c:2844 to trigger.

That BUG_ON() triggers if destroy_workqueue() is called while work
items are still pending on the workqueue. Can you please trigger
stack traces after resume and post it?

Thanks.

--
tejun

2010-08-19 13:56:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

On Thursday, August 19, 2010, Luis R. Rodriguez wrote:
> On Wed, Aug 18, 2010 at 3:01 PM, Rafael J. Wysocki <[email protected]> wrote:
> > Hi,
> >
> > While testing 2.6.36-rc1 (with a couple of fixes on top)
>
> Which couple of fixes?

AMD bood fix, HID suspend fix and shmem fix (two of them have already been
merged).

> > I noticed that the ath9k
> > driver didn't work after resume from suspend to RAM.
>
> To rule out if its an ath9k issue you can try
> compat-wireless-2.6.36-rc1 from here:
>
> http://wireless.kernel.org/en/users/Download/stable/
>
> and install it on an older kernel, you can use ./scripts/driver-select
> to only enable ath9k to compile.

Well, that sounds a bit complicated and even if I know it's not ath9k,
that's not going to help me find the real source of the problem.

> I've been using pm-suspend on this release for a few days now without
> any issue but I am using an AR9003 chipset. What chipset are you
> using? Can you provide the dmesg output upon module load?

Sure.

[ 9.680128] ath9k 0000:09:00.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19
[ 9.689487] ath9k 0000:09:00.0: setting latency timer to 64
[ 9.706383] HDA Intel 0000:00:14.2: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[ 9.805043] hda_codec: ALC272X: BIOS auto-probing.
[ 9.819389] input: HDA Digital PCBeep as /devices/pci0000:00/0000:00:14.2/input/input8
[ 10.155052] ath: EEPROM regdomain: 0x65
[ 10.155058] ath: EEPROM indicates we should expect a direct regpair map
[ 10.155066] ath: Country alpha2 being used: 00
[ 10.155070] ath: Regpair used: 0x65
[ 10.179762] phy0: Selected rate control algorithm 'ath9k_rate_control'
[ 10.181657] Registered led device: ath9k-phy0::radio
[ 10.181890] Registered led device: ath9k-phy0::assoc
[ 10.182139] Registered led device: ath9k-phy0::tx
[ 10.182339] Registered led device: ath9k-phy0::rx

lspci says it's:

09:00.0 Network controller: Atheros Communications Inc. AR928X Wireless Network Adapter (PCI-Express) (rev 01)

Thanks,
Rafael

2010-08-19 14:06:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

On Thursday, August 19, 2010, Tejun Heo wrote:
> Hello, Rafael.
>
> On 08/19/2010 12:01 AM, Rafael J. Wysocki wrote:
> > While testing 2.6.36-rc1 (with a couple of fixes on top) I noticed
> > that the ath9k driver didn't work after resume from suspend to RAM.
> > An attempt to unload the driver using rmmod caused the BUG_ON() in
> > kernel/workqueue.c:2844 to trigger.
>
> That BUG_ON() triggers if destroy_workqueue() is called while work
> items are still pending on the workqueue. Can you please trigger
> stack traces after resume and post it?

Do you mean sysrq-t?

Rafael

2010-08-19 14:23:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

Hello,

On 08/19/2010 04:05 PM, Rafael J. Wysocki wrote:
> On Thursday, August 19, 2010, Tejun Heo wrote:
>> Hello, Rafael.
>>
>> On 08/19/2010 12:01 AM, Rafael J. Wysocki wrote:
>>> While testing 2.6.36-rc1 (with a couple of fixes on top) I noticed
>>> that the ath9k driver didn't work after resume from suspend to RAM.
>>> An attempt to unload the driver using rmmod caused the BUG_ON() in
>>> kernel/workqueue.c:2844 to trigger.
>>
>> That BUG_ON() triggers if destroy_workqueue() is called while work
>> items are still pending on the workqueue. Can you please trigger
>> stack traces after resume and post it?
>
> Do you mean sysrq-t?

Yeah, I'm a bit confused regarding what's going on. I thought the
most likely cause is thawing failing to kick a frozen workqueue into
working state but then flush_workqueue() which is called from
destroy_workqueue() should have hung too, that is, unless
flush_workqueue() is broken too. If flush_workqueue() is not broken,
then it could be that workqueue itself isn't at fault and works are
being scheduled and executed fine for the workqueue ath9k is using but
the driver doesn't work for another reason.

Also, the BUG_ON() being triggered means either flush_workqueue() is
broken or the driver is failing to stop works on the workqueue from
being requeued before calling destroy_workqueue(). So, finding out
the followings would be great,

* While the driver isn't working, do a sysrq-t and see whether any
worker is executing a work for ath9k.

* Repeat it several times and see whether the work is stuck or making
progress and/or executing on different workers.

Thanks.

--
tejun

2010-08-19 14:26:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

Oh, can you also please attach log of the BUG()?

Thanks.

--
tejun

2010-08-19 14:35:41

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

On Wed, Aug 18, 2010 at 3:01 PM, Rafael J. Wysocki <[email protected]> wrote:
> Hi,
>
> While testing 2.6.36-rc1 (with a couple of fixes on top)

Which couple of fixes?

> I noticed that the ath9k
> driver didn't work after resume from suspend to RAM.

To rule out if its an ath9k issue you can try
compat-wireless-2.6.36-rc1 from here:

http://wireless.kernel.org/en/users/Download/stable/

and install it on an older kernel, you can use ./scripts/driver-select
to only enable ath9k to compile.

I've been using pm-suspend on this release for a few days now without
any issue but I am using an AR9003 chipset. What chipset are you
using? Can you provide the dmesg output upon module load?

>  An attempt to unload the
> driver using rmmod caused the BUG_ON() in kernel/workqueue.c:2844 to trigger.

That's a bug, a regression likely.

> I wonder if that regression is a result of the recent workqueue changes?

Yeah very likely.

Luis

2010-08-19 20:19:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

On Thursday, August 19, 2010, Tejun Heo wrote:
> Oh, can you also please attach log of the BUG()?

That's difficult, because I have no way to collect it after it's happened.

I can try to convert it to WARN_ON or rewrite the call trace by hand.

Or I may try to make a photo. :-)

Thanks,
Rafael

2010-08-19 20:32:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

On Thursday, August 19, 2010, Tejun Heo wrote:
> Hello,
>
> On 08/19/2010 04:05 PM, Rafael J. Wysocki wrote:
> > On Thursday, August 19, 2010, Tejun Heo wrote:
> >> Hello, Rafael.
> >>
> >> On 08/19/2010 12:01 AM, Rafael J. Wysocki wrote:
> >>> While testing 2.6.36-rc1 (with a couple of fixes on top) I noticed
> >>> that the ath9k driver didn't work after resume from suspend to RAM.
> >>> An attempt to unload the driver using rmmod caused the BUG_ON() in
> >>> kernel/workqueue.c:2844 to trigger.
> >>
> >> That BUG_ON() triggers if destroy_workqueue() is called while work
> >> items are still pending on the workqueue. Can you please trigger
> >> stack traces after resume and post it?
> >
> > Do you mean sysrq-t?
>
> Yeah, I'm a bit confused regarding what's going on. I thought the
> most likely cause is thawing failing to kick a frozen workqueue into
> working state but then flush_workqueue() which is called from
> destroy_workqueue() should have hung too, that is, unless
> flush_workqueue() is broken too. If flush_workqueue() is not broken,
> then it could be that workqueue itself isn't at fault and works are
> being scheduled and executed fine for the workqueue ath9k is using but
> the driver doesn't work for another reason.
>
> Also, the BUG_ON() being triggered means either flush_workqueue() is
> broken or the driver is failing to stop works on the workqueue from
> being requeued before calling destroy_workqueue(). So, finding out
> the followings would be great,
>
> * While the driver isn't working, do a sysrq-t and see whether any
> worker is executing a work for ath9k.
>
> * Repeat it several times and see whether the work is stuck or making
> progress and/or executing on different workers.

Actaully, I'm unable to reproduce the resume issue with current mainline
(HEAD = 763008c4357b73c8d18396dfd8d79dc58fa3f99d), so I guess it either is
a race (or another timing issue), or it's been fixed by one of the patches on
top of -rc1.

I'll let you know if I see it again.

Thanks,
Rafael

2010-08-19 20:42:27

by Luis R. Rodriguez

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

On Thu, Aug 19, 2010 at 01:31:01PM -0700, Rafael J. Wysocki wrote:
> On Thursday, August 19, 2010, Tejun Heo wrote:
> > Hello,
> >
> > On 08/19/2010 04:05 PM, Rafael J. Wysocki wrote:
> > > On Thursday, August 19, 2010, Tejun Heo wrote:
> > >> Hello, Rafael.
> > >>
> > >> On 08/19/2010 12:01 AM, Rafael J. Wysocki wrote:
> > >>> While testing 2.6.36-rc1 (with a couple of fixes on top) I noticed
> > >>> that the ath9k driver didn't work after resume from suspend to RAM.
> > >>> An attempt to unload the driver using rmmod caused the BUG_ON() in
> > >>> kernel/workqueue.c:2844 to trigger.
> > >>
> > >> That BUG_ON() triggers if destroy_workqueue() is called while work
> > >> items are still pending on the workqueue. Can you please trigger
> > >> stack traces after resume and post it?
> > >
> > > Do you mean sysrq-t?
> >
> > Yeah, I'm a bit confused regarding what's going on. I thought the
> > most likely cause is thawing failing to kick a frozen workqueue into
> > working state but then flush_workqueue() which is called from
> > destroy_workqueue() should have hung too, that is, unless
> > flush_workqueue() is broken too. If flush_workqueue() is not broken,
> > then it could be that workqueue itself isn't at fault and works are
> > being scheduled and executed fine for the workqueue ath9k is using but
> > the driver doesn't work for another reason.
> >
> > Also, the BUG_ON() being triggered means either flush_workqueue() is
> > broken or the driver is failing to stop works on the workqueue from
> > being requeued before calling destroy_workqueue(). So, finding out
> > the followings would be great,
> >
> > * While the driver isn't working, do a sysrq-t and see whether any
> > worker is executing a work for ath9k.
> >
> > * Repeat it several times and see whether the work is stuck or making
> > progress and/or executing on different workers.
>
> Actaully, I'm unable to reproduce the resume issue with current mainline
> (HEAD = 763008c4357b73c8d18396dfd8d79dc58fa3f99d), so I guess it either is
> a race (or another timing issue), or it's been fixed by one of the patches on
> top of -rc1.
>
> I'll let you know if I see it again.

To be clear, this is a non-issue now until further notice, ACK?

Luis

2010-08-19 21:09:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Regression, 2.6.36-rc1] ath9k resume problem on Acer Ferrari One

On Thursday, August 19, 2010, Luis R. Rodriguez wrote:
> On Thu, Aug 19, 2010 at 01:31:01PM -0700, Rafael J. Wysocki wrote:
> > On Thursday, August 19, 2010, Tejun Heo wrote:
> > > Hello,
> > >
> > > On 08/19/2010 04:05 PM, Rafael J. Wysocki wrote:
> > > > On Thursday, August 19, 2010, Tejun Heo wrote:
> > > >> Hello, Rafael.
> > > >>
> > > >> On 08/19/2010 12:01 AM, Rafael J. Wysocki wrote:
> > > >>> While testing 2.6.36-rc1 (with a couple of fixes on top) I noticed
> > > >>> that the ath9k driver didn't work after resume from suspend to RAM.
> > > >>> An attempt to unload the driver using rmmod caused the BUG_ON() in
> > > >>> kernel/workqueue.c:2844 to trigger.
> > > >>
> > > >> That BUG_ON() triggers if destroy_workqueue() is called while work
> > > >> items are still pending on the workqueue. Can you please trigger
> > > >> stack traces after resume and post it?
> > > >
> > > > Do you mean sysrq-t?
> > >
> > > Yeah, I'm a bit confused regarding what's going on. I thought the
> > > most likely cause is thawing failing to kick a frozen workqueue into
> > > working state but then flush_workqueue() which is called from
> > > destroy_workqueue() should have hung too, that is, unless
> > > flush_workqueue() is broken too. If flush_workqueue() is not broken,
> > > then it could be that workqueue itself isn't at fault and works are
> > > being scheduled and executed fine for the workqueue ath9k is using but
> > > the driver doesn't work for another reason.
> > >
> > > Also, the BUG_ON() being triggered means either flush_workqueue() is
> > > broken or the driver is failing to stop works on the workqueue from
> > > being requeued before calling destroy_workqueue(). So, finding out
> > > the followings would be great,
> > >
> > > * While the driver isn't working, do a sysrq-t and see whether any
> > > worker is executing a work for ath9k.
> > >
> > > * Repeat it several times and see whether the work is stuck or making
> > > progress and/or executing on different workers.
> >
> > Actaully, I'm unable to reproduce the resume issue with current mainline
> > (HEAD = 763008c4357b73c8d18396dfd8d79dc58fa3f99d), so I guess it either is
> > a race (or another timing issue), or it's been fixed by one of the patches on
> > top of -rc1.
> >
> > I'll let you know if I see it again.
>
> To be clear, this is a non-issue now until further notice, ACK?

Yep.

Rafael