2008-08-04 17:08:17

by Arkadiusz Miśkiewicz

[permalink] [raw]
Subject: BUG: scheduling while atomic: ip/23212/0x00000102


git kernel, pulled around Sat Aug 2 13:32:05 CEST,
i686, thinkpad z60m notebook

[119353.611650] usb 4-2: Manufacturer: STMicroelectronics
[119357.998202] BUG: scheduling while atomic: ip/23212/0x00000102
[119357.998209] Modules linked in: uhci_hcd ehci_hcd pl2303 usbserial nls_utf8 cifs radeon drm binfmt_misc deflate zlib_deflate zlib_inflate ctr twofish twofish_common camellia serpent blowfish des_generic cbc aes_i586
aes_generic xcbc rmd160 sha1_generic crypto_null crypto_blkcipher af_key bridge stp llc bnep hidp nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ipv6 sch_sfq fuse lcd mmc_block rfcomm l2cap bluetooth ircomm_tty ircomm
nsc_ircc irda crc_ccitt cpufreq_powersave cpufreq_userspace cpufreq_stats acpi_cpufreq freq_table hdaps input_polldev fan container tun pcmcia usbhid hid ff_memless joydev battery ac video output thinkpad_acpi rfkill backlight
intel_agp agpgart led_class nvram yenta_socket rsrc_nonstatic sdhci_pci ohci1394 sdhci mmc_core ieee1394 thermal button pcmcia_core processor psmouse snd_hda_intel ipw2200 snd_pcm ieee80211 ieee80211_crypt snd_timer
rtc_cmos rtc_core evdev i2c_i801 tg3 libphy snd_page_alloc snd_hwdep rtc_lib pcspkr i2c_core snd soundcore serio_raw usbcore iTCO_wdt iTCO_vendor_support sg sr_mod cdrom xfs scsi_wait_scan sd_mod ata_piix libata dock
scsi_mod [last unloaded: uhci_hcd]
[119357.998401] Pid: 23212, comm: ip Not tainted 2.6.27-rc1 #29
[119357.998407] [<c01171a3>] __schedule_bug+0x47/0x4c
[119357.998418] [<c02ad09d>] schedule+0x72/0x440
[119357.998426] [<c0124b84>] ? lock_timer_base+0x2a/0x5e
[119357.998436] [<c02ad8bf>] schedule_timeout+0x73/0x91
[119357.998443] [<c01250ea>] ? process_timeout+0x0/0xa
[119357.998450] [<c02ad8f1>] schedule_timeout_uninterruptible+0x14/0x16
[119357.998458] [<c01250b3>] msleep+0x10/0x16
[119357.998465] [<c01e30ef>] pci_set_power_state+0x1be/0x295
[119357.998481] [<f89b2ae9>] tg3_set_power_state+0x4e/0x7f8 [tg3]
[119357.998504] [<f89bab37>] tg3_open+0x34/0x5e2 [tg3]
[119357.998520] [<c024961b>] dev_open+0x69/0xa2
[119357.998528] [<c024906a>] dev_change_flags+0x9c/0x14f
[119357.998535] [<c024fef7>] do_setlink+0x242/0x2fe
[119357.998542] [<c01172b7>] ? finish_task_switch+0x2b/0x91
[119357.998551] [<c025101a>] rtnl_newlink+0x292/0x3fc
[119357.998558] [<c0250de2>] ? rtnl_newlink+0x5a/0x3fc
[119357.998566] [<c0250e22>] ? rtnl_newlink+0x9a/0x3fc
[119357.998575] [<c0112203>] ? do_page_fault+0x0/0x609
[119357.998583] [<c0250d88>] ? rtnl_newlink+0x0/0x3fc
[119357.998590] [<c0250d6e>] rtnetlink_rcv_msg+0x196/0x1b0
[119357.998598] [<c0250bd8>] ? rtnetlink_rcv_msg+0x0/0x1b0
[119357.998606] [<c025d14e>] netlink_rcv_skb+0x30/0x76
[119357.998614] [<c0250bd0>] rtnetlink_rcv+0x1c/0x24
[119357.998621] [<c025ccb9>] netlink_unicast+0x1c6/0x226
[119357.998628] [<c025cf59>] netlink_sendmsg+0x240/0x24d
[119357.998635] [<c023e3db>] sock_sendmsg+0xdd/0xf8
[119357.998642] [<c012db65>] ? autoremove_wake_function+0x0/0x33
[119357.998651] [<c012db65>] ? autoremove_wake_function+0x0/0x33
[119357.998659] [<c015ee5b>] ? __do_fault+0x288/0x2c0
[119357.998668] [<c0244a4d>] ? verify_iovec+0x40/0x6f
[119357.998677] [<c023e535>] sys_sendmsg+0x13f/0x192
[119357.998683] [<c023f048>] ? sys_recvmsg+0x116/0x17b
[119357.998691] [<c023ef1a>] ? move_addr_to_user+0x56/0x6e
[119357.998699] [<c023f250>] ? sys_getsockname+0x59/0x76
[119357.998707] [<c0240307>] ? lock_sock_nested+0xb5/0xbd
[119357.998714] [<c0121cd5>] ? local_bh_enable+0x73/0x87
[119357.998722] [<c024024a>] ? release_sock+0xa8/0xb0
[119357.998730] [<c016296b>] ? __vma_link+0x58/0x5c
[119357.998738] [<c01629c1>] ? vma_link+0x52/0xf6
[119357.998746] [<c023f67d>] sys_socketcall+0x168/0x19f
[119357.998753] [<c016399e>] ? sys_brk+0xd6/0xde
[119357.998761] [<c0103871>] sysenter_do_call+0x12/0x21
[119357.998768] [<c02a0000>] ? unix_inflight+0x21/0xc4
[119357.998776] =======================
[119358.240113] ADDRCONF(NETDEV_UP): eth0: link is not ready


--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/


2008-08-04 21:37:10

by David Miller

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

From: Arkadiusz Miskiewicz <[email protected]>
Date: Mon, 4 Aug 2008 18:45:10 +0200

>
> git kernel, pulled around Sat Aug 2 13:32:05 CEST,
> i686, thinkpad z60m notebook

Obviously the tg3 PCI power-management changes were totally untested.

And unfortunately this is usually par for the course for driver
changes of this kind that I end up applying directly from Andrew's
patch bombs. :-/

Please try reverting this patch:

commit 12dac0756d357325b107fe6ec24921ec38661839
Author: Rafael J. Wysocki <[email protected]>
Date: Wed Jul 30 16:37:33 2008 -0700

tg3: adapt tg3 to use reworked PCI PM code

Adapt the tg3 driver to use the reworked PCI PM and make it use the
exported PCI PM core functions instead of accessing the PCI PM registers
directly by itself.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 633c128..26aa37a 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -1982,8 +1982,6 @@ static void tg3_power_down_phy(struct tg3 *tp)
static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
{
u32 misc_host_ctrl;
- u16 power_control, power_caps;
- int pm = tp->pm_cap;

/* Make sure register accesses (indirect or otherwise)
* will function correctly.
@@ -1992,18 +1990,10 @@ static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
TG3PCI_MISC_HOST_CTRL,
tp->misc_host_ctrl);

- pci_read_config_word(tp->pdev,
- pm + PCI_PM_CTRL,
- &power_control);
- power_control |= PCI_PM_CTRL_PME_STATUS;
- power_control &= ~(PCI_PM_CTRL_STATE_MASK);
switch (state) {
case PCI_D0:
- power_control |= 0;
- pci_write_config_word(tp->pdev,
- pm + PCI_PM_CTRL,
- power_control);
- udelay(100); /* Delay after power state change */
+ pci_enable_wake(tp->pdev, state, false);
+ pci_set_power_state(tp->pdev, PCI_D0);

/* Switch out of Vaux if it is a NIC */
if (tp->tg3_flags2 & TG3_FLG2_IS_NIC)
@@ -2012,26 +2002,15 @@ static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
return 0;

case PCI_D1:
- power_control |= 1;
- break;
-
case PCI_D2:
- power_control |= 2;
- break;
-
case PCI_D3hot:
- power_control |= 3;
break;

default:
- printk(KERN_WARNING PFX "%s: Invalid power state (%d) "
- "requested.\n",
- tp->dev->name, state);
+ printk(KERN_ERR PFX "%s: Invalid power state (D%d) requested\n",
+ tp->dev->name, state);
return -EINVAL;
}
-
- power_control |= PCI_PM_CTRL_PME_ENABLE;
-
misc_host_ctrl = tr32(TG3PCI_MISC_HOST_CTRL);
tw32(TG3PCI_MISC_HOST_CTRL,
misc_host_ctrl | MISC_HOST_CTRL_MASK_PCI_INT);
@@ -2109,8 +2088,6 @@ static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
WOL_DRV_WOL |
WOL_SET_MAGIC_PKT);

- pci_read_config_word(tp->pdev, pm + PCI_PM_PMC, &power_caps);
-
if (tp->tg3_flags & TG3_FLAG_WOL_ENABLE) {
u32 mac_mode;

@@ -2143,8 +2120,8 @@ static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
if (!(tp->tg3_flags2 & TG3_FLG2_5750_PLUS))
tw32(MAC_LED_CTRL, tp->led_ctrl);

- if (((power_caps & PCI_PM_CAP_PME_D3cold) &&
- (tp->tg3_flags & TG3_FLAG_WOL_ENABLE)))
+ if (pci_pme_capable(tp->pdev, state) &&
+ (tp->tg3_flags & TG3_FLAG_WOL_ENABLE))
mac_mode |= MAC_MODE_MAGIC_PKT_ENABLE;

tw32_f(MAC_MODE, mac_mode);
@@ -2236,9 +2213,11 @@ static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)

tg3_write_sig_post_reset(tp, RESET_KIND_SHUTDOWN);

+ if (tp->tg3_flags & TG3_FLAG_WOL_ENABLE)
+ pci_enable_wake(tp->pdev, state, true);
+
/* Finally, set the new power state. */
- pci_write_config_word(tp->pdev, pm + PCI_PM_CTRL, power_control);
- udelay(100); /* Delay after power state change */
+ pci_set_power_state(tp->pdev, state);

return 0;
}
@@ -9065,7 +9044,8 @@ static void tg3_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
{
struct tg3 *tp = netdev_priv(dev);

- if (tp->tg3_flags & TG3_FLAG_WOL_CAP)
+ if ((tp->tg3_flags & TG3_FLAG_WOL_CAP) &&
+ device_can_wakeup(&tp->pdev->dev))
wol->supported = WAKE_MAGIC;
else
wol->supported = 0;
@@ -9078,18 +9058,22 @@ static void tg3_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
static int tg3_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
{
struct tg3 *tp = netdev_priv(dev);
+ struct device *dp = &tp->pdev->dev;

if (wol->wolopts & ~WAKE_MAGIC)
return -EINVAL;
if ((wol->wolopts & WAKE_MAGIC) &&
- !(tp->tg3_flags & TG3_FLAG_WOL_CAP))
+ !((tp->tg3_flags & TG3_FLAG_WOL_CAP) && device_can_wakeup(dp)))
return -EINVAL;

spin_lock_bh(&tp->lock);
- if (wol->wolopts & WAKE_MAGIC)
+ if (wol->wolopts & WAKE_MAGIC) {
tp->tg3_flags |= TG3_FLAG_WOL_ENABLE;
- else
+ device_set_wakeup_enable(dp, true);
+ } else {
tp->tg3_flags &= ~TG3_FLAG_WOL_ENABLE;
+ device_set_wakeup_enable(dp, false);
+ }
spin_unlock_bh(&tp->lock);

return 0;
@@ -11296,7 +11280,8 @@ static void __devinit tg3_get_eeprom_hw_cfg(struct tg3 *tp)
if (val & VCPU_CFGSHDW_ASPM_DBNC)
tp->tg3_flags |= TG3_FLAG_ASPM_WORKAROUND;
if ((val & VCPU_CFGSHDW_WOL_ENABLE) &&
- (val & VCPU_CFGSHDW_WOL_MAGPKT))
+ (val & VCPU_CFGSHDW_WOL_MAGPKT) &&
+ device_may_wakeup(&tp->pdev->dev))
tp->tg3_flags |= TG3_FLAG_WOL_ENABLE;
return;
}
@@ -11426,8 +11411,9 @@ static void __devinit tg3_get_eeprom_hw_cfg(struct tg3 *tp)
!(nic_cfg & NIC_SRAM_DATA_CFG_FIBER_WOL))
tp->tg3_flags &= ~TG3_FLAG_WOL_CAP;

- if (tp->tg3_flags & TG3_FLAG_WOL_CAP &&
- nic_cfg & NIC_SRAM_DATA_CFG_WOL_ENABLE)
+ if ((tp->tg3_flags & TG3_FLAG_WOL_CAP) &&
+ (nic_cfg & NIC_SRAM_DATA_CFG_WOL_ENABLE) &&
+ device_may_wakeup(&tp->pdev->dev))
tp->tg3_flags |= TG3_FLAG_WOL_ENABLE;

if (cfg2 & (1 << 17))
@@ -13613,6 +13599,7 @@ static int tg3_suspend(struct pci_dev *pdev, pm_message_t state)
{
struct net_device *dev = pci_get_drvdata(pdev);
struct tg3 *tp = netdev_priv(dev);
+ pci_power_t target_state;
int err;

/* PCI register 4 needs to be saved whether netif_running() or not.
@@ -13641,7 +13628,9 @@ static int tg3_suspend(struct pci_dev *pdev, pm_message_t state)
tp->tg3_flags &= ~TG3_FLAG_INIT_COMPLETE;
tg3_full_unlock(tp);

- err = tg3_set_power_state(tp, pci_choose_state(pdev, state));
+ target_state = pdev->pm_cap ? pci_target_state(pdev) : PCI_D3hot;
+
+ err = tg3_set_power_state(tp, target_state);
if (err) {
int err2;

2008-08-04 22:08:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

On Monday, 4 of August 2008, David Miller wrote:
> From: Arkadiusz Miskiewicz <[email protected]>
> Date: Mon, 4 Aug 2008 18:45:10 +0200
>
> >
> > git kernel, pulled around Sat Aug 2 13:32:05 CEST,
> > i686, thinkpad z60m notebook
>
> Obviously the tg3 PCI power-management changes were totally untested.

If you mean the commit below, actually they were tested.

I run this code on a regular basis on my production box and I have never seen
this stack trace. How exactly can I make it appear?

> And unfortunately this is usually par for the course for driver
> changes of this kind that I end up applying directly from Andrew's
> patch bombs. :-/

Sorry for causing the trouble, but had I seen the stack trace, I obviously
wouldn't have posted the patch.

> commit 12dac0756d357325b107fe6ec24921ec38661839
> Author: Rafael J. Wysocki <[email protected]>
> Date: Wed Jul 30 16:37:33 2008 -0700
>
> tg3: adapt tg3 to use reworked PCI PM code
>
> Adapt the tg3 driver to use the reworked PCI PM and make it use the
> exported PCI PM core functions instead of accessing the PCI PM registers
> directly by itself.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
>
> diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
> index 633c128..26aa37a 100644
> --- a/drivers/net/tg3.c
> +++ b/drivers/net/tg3.c
> @@ -1982,8 +1982,6 @@ static void tg3_power_down_phy(struct tg3 *tp)
> static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
> {
> u32 misc_host_ctrl;
> - u16 power_control, power_caps;
> - int pm = tp->pm_cap;
>
> /* Make sure register accesses (indirect or otherwise)
> * will function correctly.
> @@ -1992,18 +1990,10 @@ static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
> TG3PCI_MISC_HOST_CTRL,
> tp->misc_host_ctrl);
>
> - pci_read_config_word(tp->pdev,
> - pm + PCI_PM_CTRL,
> - &power_control);
> - power_control |= PCI_PM_CTRL_PME_STATUS;
> - power_control &= ~(PCI_PM_CTRL_STATE_MASK);
> switch (state) {
> case PCI_D0:
> - power_control |= 0;
> - pci_write_config_word(tp->pdev,
> - pm + PCI_PM_CTRL,
> - power_control);
> - udelay(100); /* Delay after power state change */
> + pci_enable_wake(tp->pdev, state, false);
> + pci_set_power_state(tp->pdev, PCI_D0);

Still, I don't think drivers should access the standard PCI PM registers
directly, so perhaps there should be a version of pci_set_power_state()
using udelay() instead of msleep() or we can just replace the msleep()
in pci_set_power_state() with udelay()?

Rafael

2008-08-04 22:37:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

On Tuesday, 5 of August 2008, Rafael J. Wysocki wrote:
> On Monday, 4 of August 2008, David Miller wrote:
> > From: Arkadiusz Miskiewicz <[email protected]>
> > Date: Mon, 4 Aug 2008 18:45:10 +0200
> >
> > >
> > > git kernel, pulled around Sat Aug 2 13:32:05 CEST,
> > > i686, thinkpad z60m notebook
> >
> > Obviously the tg3 PCI power-management changes were totally untested.
>
> If you mean the commit below, actually they were tested.
>
> I run this code on a regular basis on my production box and I have never seen
> this stack trace. How exactly can I make it appear?

OK, I see. I must have overlooked it. :-/ Sorry again.

Rafael

2008-08-04 23:05:04

by Jesse Barnes

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

On Monday, August 04, 2008 3:10 pm Rafael J. Wysocki wrote:
> > - pci_read_config_word(tp->pdev,
> > - pm + PCI_PM_CTRL,
> > - &power_control);
> > - power_control |= PCI_PM_CTRL_PME_STATUS;
> > - power_control &= ~(PCI_PM_CTRL_STATE_MASK);
> > switch (state) {
> > case PCI_D0:
> > - power_control |= 0;
> > - pci_write_config_word(tp->pdev,
> > - pm + PCI_PM_CTRL,
> > - power_control);
> > - udelay(100); /* Delay after power state change */
> > + pci_enable_wake(tp->pdev, state, false);
> > + pci_set_power_state(tp->pdev, PCI_D0);
>
> Still, I don't think drivers should access the standard PCI PM registers
> directly, so perhaps there should be a version of pci_set_power_state()
> using udelay() instead of msleep() or we can just replace the msleep()
> in pci_set_power_state() with udelay()?

I think we should get rid of the open coded PCI PM state management, since
otherwise platform related bugs like the Intel PCIe PM quirk that sets
pci_pm_d3_delay to 120ms would have to be duplicated around the tree.

That said, waiting for 120ms with a busy wait seems a bit absurd if we can
avoid it. Either we need to find a way to make drivers only change states
(which can be very slow) in non-atomic context or we'll need to add a busy
wait variant of the power state call...

Jesse

2008-08-04 23:50:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

On Tuesday, 5 of August 2008, Jesse Barnes wrote:
> On Monday, August 04, 2008 3:10 pm Rafael J. Wysocki wrote:
> > > - pci_read_config_word(tp->pdev,
> > > - pm + PCI_PM_CTRL,
> > > - &power_control);
> > > - power_control |= PCI_PM_CTRL_PME_STATUS;
> > > - power_control &= ~(PCI_PM_CTRL_STATE_MASK);
> > > switch (state) {
> > > case PCI_D0:
> > > - power_control |= 0;
> > > - pci_write_config_word(tp->pdev,
> > > - pm + PCI_PM_CTRL,
> > > - power_control);
> > > - udelay(100); /* Delay after power state change */
> > > + pci_enable_wake(tp->pdev, state, false);
> > > + pci_set_power_state(tp->pdev, PCI_D0);
> >
> > Still, I don't think drivers should access the standard PCI PM registers
> > directly, so perhaps there should be a version of pci_set_power_state()
> > using udelay() instead of msleep() or we can just replace the msleep()
> > in pci_set_power_state() with udelay()?
>
> I think we should get rid of the open coded PCI PM state management, since
> otherwise platform related bugs like the Intel PCIe PM quirk that sets
> pci_pm_d3_delay to 120ms would have to be duplicated around the tree.
>
> That said, waiting for 120ms with a busy wait seems a bit absurd if we can
> avoid it. Either we need to find a way to make drivers only change states
> (which can be very slow) in non-atomic context or we'll need to add a busy
> wait variant of the power state call...

What about this?

It fixes the tg3 issue for me.

---
drivers/net/tg3.c | 19 +++++++++++++++++--
drivers/pci/pci.c | 24 +++++++++++++++---------
include/linux/pci.h | 2 ++
3 files changed, 34 insertions(+), 11 deletions(-)

Index: linux-2.6/drivers/pci/pci.c
===================================================================
--- linux-2.6.orig/drivers/pci/pci.c
+++ linux-2.6/drivers/pci/pci.c
@@ -421,6 +421,7 @@ static inline int platform_pci_sleep_wak
* given PCI device
* @dev: PCI device to handle.
* @state: PCI power state (D0, D1, D2, D3hot) to put the device into.
+ * @delay: if set, time to wait for the device to change the state, in microseconds
*
* RETURN VALUE:
* -EINVAL if the requested state is invalid.
@@ -429,8 +430,8 @@ static inline int platform_pci_sleep_wak
* 0 if device already is in the requested state.
* 0 if device's power state has been successfully changed.
*/
-static int
-pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
+int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state,
+ unsigned int delay)
{
u16 pmcsr;
bool need_restore = false;
@@ -486,12 +487,16 @@ pci_raw_set_power_state(struct pci_dev *
/* enter specified state */
pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr);

- /* Mandatory power management transition delays */
- /* see PCI PM 1.1 5.6.1 table 18 */
- if (state == PCI_D3hot || dev->current_state == PCI_D3hot)
- msleep(pci_pm_d3_delay);
- else if (state == PCI_D2 || dev->current_state == PCI_D2)
- udelay(200);
+ if (delay) {
+ udelay(delay);
+ } else {
+ /* Mandatory power management transition delays */
+ /* see PCI PM 1.1 5.6.1 table 18 */
+ if (state == PCI_D3hot || dev->current_state == PCI_D3hot)
+ msleep(pci_pm_d3_delay);
+ else if (state == PCI_D2 || dev->current_state == PCI_D2)
+ udelay(200);
+ }

dev->current_state = state;

@@ -577,7 +582,7 @@ int pci_set_power_state(struct pci_dev *
if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
return 0;

- error = pci_raw_set_power_state(dev, state);
+ error = pci_raw_set_power_state(dev, state, 0);

if (state > PCI_D0 && platform_pci_power_manageable(dev)) {
/* Allow the platform to finalize the transition */
@@ -1941,6 +1946,7 @@ EXPORT_SYMBOL(pci_assign_resource);
EXPORT_SYMBOL(pci_find_parent_resource);
EXPORT_SYMBOL(pci_select_bars);

+EXPORT_SYMBOL(pci_raw_set_power_state);
EXPORT_SYMBOL(pci_set_power_state);
EXPORT_SYMBOL(pci_save_state);
EXPORT_SYMBOL(pci_restore_state);
Index: linux-2.6/include/linux/pci.h
===================================================================
--- linux-2.6.orig/include/linux/pci.h
+++ linux-2.6/include/linux/pci.h
@@ -645,6 +645,8 @@ size_t pci_get_rom_size(void __iomem *ro
/* Power management related routines */
int pci_save_state(struct pci_dev *dev);
int pci_restore_state(struct pci_dev *dev);
+int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state,
+ unsigned int delay);
int pci_set_power_state(struct pci_dev *dev, pci_power_t state);
pci_power_t pci_choose_state(struct pci_dev *dev, pm_message_t state);
bool pci_pme_capable(struct pci_dev *dev, pci_power_t state);
Index: linux-2.6/drivers/net/tg3.c
===================================================================
--- linux-2.6.orig/drivers/net/tg3.c
+++ linux-2.6/drivers/net/tg3.c
@@ -1979,6 +1979,21 @@ static void tg3_power_down_phy(struct tg
tg3_writephy(tp, MII_BMCR, BMCR_PDOWN);
}

+static int tg3_force_power_state_d0(struct tg3 *tp)
+{
+ int error;
+
+ error = pci_raw_set_power_state(tp->pdev, PCI_D0, 100);
+ if (error)
+ return error;
+
+ /* Switch out of Vaux if it is a NIC */
+ if (tp->tg3_flags2 & TG3_FLG2_IS_NIC)
+ tw32_wait_f(GRC_LOCAL_CTRL, tp->grc_local_ctrl, 100);
+
+ return 0;
+}
+
static int tg3_set_power_state(struct tg3 *tp, pci_power_t state)
{
u32 misc_host_ctrl;
@@ -7690,7 +7705,7 @@ static int tg3_init_hw(struct tg3 *tp, i
int err;

/* Force the chip into D0. */
- err = tg3_set_power_state(tp, PCI_D0);
+ err = tg3_force_power_state_d0(tp);
if (err)
goto out;

@@ -8018,7 +8033,7 @@ static int tg3_open(struct net_device *d

tg3_full_lock(tp, 0);

- err = tg3_set_power_state(tp, PCI_D0);
+ err = tg3_force_power_state_d0(tp);
if (err) {
tg3_full_unlock(tp);
return err;

2008-08-05 00:06:00

by Jesse Barnes

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

On Monday, August 04, 2008 4:53 pm Rafael J. Wysocki wrote:
> On Tuesday, 5 of August 2008, Jesse Barnes wrote:
> > On Monday, August 04, 2008 3:10 pm Rafael J. Wysocki wrote:
> > > > - pci_read_config_word(tp->pdev,
> > > > - pm + PCI_PM_CTRL,
> > > > - &power_control);
> > > > - power_control |= PCI_PM_CTRL_PME_STATUS;
> > > > - power_control &= ~(PCI_PM_CTRL_STATE_MASK);
> > > > switch (state) {
> > > > case PCI_D0:
> > > > - power_control |= 0;
> > > > - pci_write_config_word(tp->pdev,
> > > > - pm + PCI_PM_CTRL,
> > > > - power_control);
> > > > - udelay(100); /* Delay after power state change */
> > > > + pci_enable_wake(tp->pdev, state, false);
> > > > + pci_set_power_state(tp->pdev, PCI_D0);
> > >
> > > Still, I don't think drivers should access the standard PCI PM
> > > registers directly, so perhaps there should be a version of
> > > pci_set_power_state() using udelay() instead of msleep() or we can just
> > > replace the msleep() in pci_set_power_state() with udelay()?
> >
> > I think we should get rid of the open coded PCI PM state management,
> > since otherwise platform related bugs like the Intel PCIe PM quirk that
> > sets pci_pm_d3_delay to 120ms would have to be duplicated around the
> > tree.
> >
> > That said, waiting for 120ms with a busy wait seems a bit absurd if we
> > can avoid it. Either we need to find a way to make drivers only change
> > states (which can be very slow) in non-atomic context or we'll need to
> > add a busy wait variant of the power state call...
>
> What about this?
>
> It fixes the tg3 issue for me.
>
> ---
> drivers/net/tg3.c | 19 +++++++++++++++++--
> drivers/pci/pci.c | 24 +++++++++++++++---------
> include/linux/pci.h | 2 ++
> 3 files changed, 34 insertions(+), 11 deletions(-)
>
> Index: linux-2.6/drivers/pci/pci.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/pci.c
> +++ linux-2.6/drivers/pci/pci.c
> @@ -421,6 +421,7 @@ static inline int platform_pci_sleep_wak
> * given PCI device
> * @dev: PCI device to handle.
> * @state: PCI power state (D0, D1, D2, D3hot) to put the device into.
> + * @delay: if set, time to wait for the device to change the state, in
> microseconds *
> * RETURN VALUE:
> * -EINVAL if the requested state is invalid.
> @@ -429,8 +430,8 @@ static inline int platform_pci_sleep_wak
> * 0 if device already is in the requested state.
> * 0 if device's power state has been successfully changed.
> */
> -static int
> -pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> +int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state,
> + unsigned int delay)
> {
> u16 pmcsr;
> bool need_restore = false;
> @@ -486,12 +487,16 @@ pci_raw_set_power_state(struct pci_dev *
> /* enter specified state */
> pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr);
>
> - /* Mandatory power management transition delays */
> - /* see PCI PM 1.1 5.6.1 table 18 */
> - if (state == PCI_D3hot || dev->current_state == PCI_D3hot)
> - msleep(pci_pm_d3_delay);
> - else if (state == PCI_D2 || dev->current_state == PCI_D2)
> - udelay(200);
> + if (delay) {
> + udelay(delay);
> + } else {
> + /* Mandatory power management transition delays */
> + /* see PCI PM 1.1 5.6.1 table 18 */
> + if (state == PCI_D3hot || dev->current_state == PCI_D3hot)
> + msleep(pci_pm_d3_delay);
> + else if (state == PCI_D2 || dev->current_state == PCI_D2)
> + udelay(200);
> + }

I think this has the issue I mentioned above. We want to honor platform D3
transition delays or we'll see the bugs they're intended to work around.
With this API, a driver could pass in a delay of 5us and void both the
platform bugs that pci_pm_d3_delay works around and the D2 transition time of
200us that the code already has...

Jesse

2008-08-05 09:15:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

On Tuesday, 5 of August 2008, Jesse Barnes wrote:
> On Monday, August 04, 2008 4:53 pm Rafael J. Wysocki wrote:
> > On Tuesday, 5 of August 2008, Jesse Barnes wrote:
> > > On Monday, August 04, 2008 3:10 pm Rafael J. Wysocki wrote:
> > > > > - pci_read_config_word(tp->pdev,
> > > > > - pm + PCI_PM_CTRL,
> > > > > - &power_control);
> > > > > - power_control |= PCI_PM_CTRL_PME_STATUS;
> > > > > - power_control &= ~(PCI_PM_CTRL_STATE_MASK);
> > > > > switch (state) {
> > > > > case PCI_D0:
> > > > > - power_control |= 0;
> > > > > - pci_write_config_word(tp->pdev,
> > > > > - pm + PCI_PM_CTRL,
> > > > > - power_control);
> > > > > - udelay(100); /* Delay after power state change */
> > > > > + pci_enable_wake(tp->pdev, state, false);
> > > > > + pci_set_power_state(tp->pdev, PCI_D0);
> > > >
> > > > Still, I don't think drivers should access the standard PCI PM
> > > > registers directly, so perhaps there should be a version of
> > > > pci_set_power_state() using udelay() instead of msleep() or we can just
> > > > replace the msleep() in pci_set_power_state() with udelay()?
> > >
> > > I think we should get rid of the open coded PCI PM state management,
> > > since otherwise platform related bugs like the Intel PCIe PM quirk that
> > > sets pci_pm_d3_delay to 120ms would have to be duplicated around the
> > > tree.
> > >
> > > That said, waiting for 120ms with a busy wait seems a bit absurd if we
> > > can avoid it. Either we need to find a way to make drivers only change
> > > states (which can be very slow) in non-atomic context or we'll need to
> > > add a busy wait variant of the power state call...
> >
> > What about this?
> >
> > It fixes the tg3 issue for me.
> >
> > ---
> > drivers/net/tg3.c | 19 +++++++++++++++++--
> > drivers/pci/pci.c | 24 +++++++++++++++---------
> > include/linux/pci.h | 2 ++
> > 3 files changed, 34 insertions(+), 11 deletions(-)
> >
> > Index: linux-2.6/drivers/pci/pci.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/pci/pci.c
> > +++ linux-2.6/drivers/pci/pci.c
> > @@ -421,6 +421,7 @@ static inline int platform_pci_sleep_wak
> > * given PCI device
> > * @dev: PCI device to handle.
> > * @state: PCI power state (D0, D1, D2, D3hot) to put the device into.
> > + * @delay: if set, time to wait for the device to change the state, in
> > microseconds *
> > * RETURN VALUE:
> > * -EINVAL if the requested state is invalid.
> > @@ -429,8 +430,8 @@ static inline int platform_pci_sleep_wak
> > * 0 if device already is in the requested state.
> > * 0 if device's power state has been successfully changed.
> > */
> > -static int
> > -pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state)
> > +int pci_raw_set_power_state(struct pci_dev *dev, pci_power_t state,
> > + unsigned int delay)
> > {
> > u16 pmcsr;
> > bool need_restore = false;
> > @@ -486,12 +487,16 @@ pci_raw_set_power_state(struct pci_dev *
> > /* enter specified state */
> > pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr);
> >
> > - /* Mandatory power management transition delays */
> > - /* see PCI PM 1.1 5.6.1 table 18 */
> > - if (state == PCI_D3hot || dev->current_state == PCI_D3hot)
> > - msleep(pci_pm_d3_delay);
> > - else if (state == PCI_D2 || dev->current_state == PCI_D2)
> > - udelay(200);
> > + if (delay) {
> > + udelay(delay);
> > + } else {
> > + /* Mandatory power management transition delays */
> > + /* see PCI PM 1.1 5.6.1 table 18 */
> > + if (state == PCI_D3hot || dev->current_state == PCI_D3hot)
> > + msleep(pci_pm_d3_delay);
> > + else if (state == PCI_D2 || dev->current_state == PCI_D2)
> > + udelay(200);
> > + }
>
> I think this has the issue I mentioned above. We want to honor platform D3
> transition delays or we'll see the bugs they're intended to work around.
> With this API, a driver could pass in a delay of 5us and void both the
> platform bugs that pci_pm_d3_delay works around and the D2 transition time of
> 200us that the code already has...

Okay, but tg3 evidently wants to set the power state to D0 under a spinlock
(without the delay required by the PCI PM spec, but well).

So, perhaps we should just change the msleep(pci_pm_d3_delay), in
pci_raw_set_power_state(), to mdelay(pci_pm_d3_delay). As far as
suspend/resume is concerned, this won't matter, because they both are done
in one thread. I'm not really sure if there are any other cases in which that
could matter, though.

Thanks,
Rafael

2008-08-05 09:37:42

by David Miller

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

From: "Rafael J. Wysocki" <[email protected]>
Date: Tue, 5 Aug 2008 11:17:28 +0200

> Okay, but tg3 evidently wants to set the power state to D0 under a spinlock
> (without the delay required by the PCI PM spec, but well).

Please read the rest of your inbox, we found a way to fix
tg3 to not need to call this under a lock.

2008-08-05 12:18:37

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: BUG: scheduling while atomic: ip/23212/0x00000102

On Tuesday, 5 of August 2008, David Miller wrote:
> From: "Rafael J. Wysocki" <[email protected]>
> Date: Tue, 5 Aug 2008 11:17:28 +0200
>
> > Okay, but tg3 evidently wants to set the power state to D0 under a spinlock
> > (without the delay required by the PCI PM spec, but well).
>
> Please read the rest of your inbox, we found a way to fix
> tg3 to not need to call this under a lock.

Thanks a lot, I've just read the message.

[For the information of the people who are not subscribed to netdev, the fix
patch is at http://marc.info/?l=linux-netdev&m=121791046925274&w=2]

Rafael