2001-04-19 08:26:12

by Jeff Garzik

[permalink] [raw]
Subject: PCI power management

Index: drivers/pci/pci.c
===================================================================
RCS file: /cvsroot/gkernel/linux_2_4/drivers/pci/pci.c,v
retrieving revision 1.1.1.32
retrieving revision 1.1.1.32.2.1
diff -u -r1.1.1.32 -r1.1.1.32.2.1
--- drivers/pci/pci.c 2001/04/18 01:19:31 1.1.1.32
+++ drivers/pci/pci.c 2001/04/18 03:39:02 1.1.1.32.2.1
@@ -228,49 +228,157 @@
}

/**
- * pci_set_power_state - Set power management state of a device.
- * @dev: PCI device for which PM is set
- * @new_state: new power management statement (0 == D0, 3 == D3, etc.)
+ * pci_power_on - Wake up a PCI device
+ * @dev: PCI device to which power is to be applied
*
- * Set power management state of a device. For transitions from state D3
- * it isn't as straightforward as one could assume since many devices forget
- * their configuration space during wakeup. Returns old power state.
+ * Bring the given PCI device @dev up to full power,
+ * using standard PCI PM techniques. Any saved context
+ * is restored after device power-up.
+ *
+ * RETURN VALUE: Zero is returned upon successful completion
+ * of the wake-up operation.
*/
+
int
-pci_set_power_state(struct pci_dev *dev, int new_state)
+pci_power_on(struct pci_dev *dev)
{
- u32 base[5], romaddr;
- u16 pci_command, pwr_command;
- u8 pci_latency, pci_cacheline;
- int i, old_state;
- int pm = pci_find_capability(dev, PCI_CAP_ID_PM);
+ u16 pwr_command;
+ int pm_d_state, pm, i;
+
+ /* find PCI PM capability in list */
+ pm = pci_find_capability(dev, PCI_CAP_ID_PM);
+ if (!pm) return 0; /* assume no PM == poweron success */

- if (!pm)
- return 0;
+ /* make sure we aren't already in D0 state */
pci_read_config_word(dev, pm + PCI_PM_CTRL, &pwr_command);
- old_state = pwr_command & PCI_PM_CTRL_STATE_MASK;
- if (old_state == new_state)
- return old_state;
- DBG("PCI: %s goes from D%d to D%d\n", dev->slot_name, old_state, new_state);
- if (old_state == 3) {
- pci_read_config_word(dev, PCI_COMMAND, &pci_command);
- pci_write_config_word(dev, PCI_COMMAND, pci_command & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY));
- for (i = 0; i < 5; i++)
- pci_read_config_dword(dev, PCI_BASE_ADDRESS_0 + i*4, &base[i]);
- pci_read_config_dword(dev, PCI_ROM_ADDRESS, &romaddr);
- pci_read_config_byte(dev, PCI_LATENCY_TIMER, &pci_latency);
- pci_read_config_byte(dev, PCI_CACHE_LINE_SIZE, &pci_cacheline);
- pci_write_config_word(dev, pm + PCI_PM_CTRL, new_state);
- for (i = 0; i < 5; i++)
- pci_write_config_dword(dev, PCI_BASE_ADDRESS_0 + i*4, base[i]);
- pci_write_config_dword(dev, PCI_ROM_ADDRESS, romaddr);
+ pm_d_state = pwr_command & PCI_PM_CTRL_STATE_MASK;
+ if (pm_d_state == 0) return 0;
+
+ /* go to D0 */
+ /* XXX: should we enable function's ability to assert
+ * PME# here (bit 8) too?
+ */
+ pci_write_config_word(dev, pm + PCI_PM_CTRL, 0);
+
+ /*
+ * restore context, if saved
+ */
+ if (dev->saved_context) {
+ /* XXX: 100% dword access ok here? */
+ for (i = 0; i < dev->saved_context->n_dwords; i++)
+ pci_write_config_dword(dev, i * 4,
+ dev->saved_context->cfg_hdr[i]);
+
+ kfree(dev->saved_context);
+ dev->saved_context = NULL;
+ }
+
+ /*
+ * otherwise, write the context information we know from bootup.
+ * This works around a problem where warm-booting from Windows
+ * combined with a D3(hot)->D0 transition causes PCI config
+ * header data to be forgotten.
+ */
+ else {
+ for (i = 0; i < 6; i ++)
+ pci_write_config_dword(dev,
+ PCI_BASE_ADDRESS_0 + (i * 4),
+ dev->resource[i].start);
pci_write_config_byte(dev, PCI_INTERRUPT_LINE, dev->irq);
- pci_write_config_byte(dev, PCI_CACHE_LINE_SIZE, pci_cacheline);
- pci_write_config_byte(dev, PCI_LATENCY_TIMER, pci_latency);
- pci_write_config_word(dev, PCI_COMMAND, pci_command);
- } else
- pci_write_config_word(dev, pm + PCI_PM_CTRL, (pwr_command & ~PCI_PM_CTRL_STATE_MASK) | new_state);
- return old_state;
+ }
+
+ return 0;
+}
+
+/**
+ * pci_power_off - Suspend a PCI device
+ * @dev: PCI device to be suspended
+ * @context_size: Number of PCI config bytes to save
+ *
+ * Remove power from a PCI device, saving PCI context
+ * before fully transitioning to the D3 state.
+ *
+ * The @context_size argument can be -1, which indicates
+ * that only the standard PCI 2.2 configuration header
+ * is to be saved. @context_size can be zero, which indicates
+ * no context is to be saved. Or, @context_size can be a
+ * specific length, indicating the number of bytes to be saved
+ * before poweroff. @context_size is always rounded up to the nearest
+ * dword boundary.
+ *
+ * RETURN VALUE: If the PCI device
+ * does not support PCI PM, %EIO is returned. If memory
+ * is not available to store the PCI context requested,
+ * %ENOMEM is returned. Otherwise, zero (success) is returned.
+ */
+
+int
+pci_power_off(struct pci_dev *dev, int context_size)
+{
+ u16 pwr_command, tmp, newtmp;
+ int pm_d_state, pm, i;
+ void *mem;
+
+ /* find PCI PM capability in list */
+ pm = pci_find_capability(dev, PCI_CAP_ID_PM);
+ if (!pm) return -EIO; /* this device cannot poweroff */
+
+ /* make sure we aren't already in D3 state */
+ /* XXX: reliable/superfluous test? */
+ pci_read_config_word(dev, pm + PCI_PM_CTRL, &pwr_command);
+ pm_d_state = pwr_command & PCI_PM_CTRL_STATE_MASK;
+ if (pm_d_state == 3) return 0;
+
+ /* programmer error... */
+ if (dev->saved_context)
+ BUG();
+
+ /*
+ * save context
+ */
+ if (context_size == -1) /* save only standard PCI config header */
+ context_size = 15 * sizeof(u32);
+ if (context_size > 0) {
+ /* convert bytes to dwords, with rounding */
+ if (context_size % 4 == 0)
+ context_size >>= 2;
+ else
+ context_size = (context_size >> 2) + 1;
+
+ mem = kmalloc(sizeof(struct pci_dev_context) +
+ (context_size * sizeof(u32)), GFP_KERNEL);
+ if (!mem)
+ return -ENOMEM;
+ dev->saved_context = mem;
+ dev->saved_context->n_dwords = context_size;
+ dev->saved_context->cfg_hdr = mem + sizeof(struct pci_dev_context);
+
+ /* XXX: 100% dword access ok here? */
+ for (i = 0; i < dev->saved_context->n_dwords; i++)
+ pci_read_config_dword(dev, i * 4,
+ &dev->saved_context->cfg_hdr[i]);
+ }
+
+ /* _PCI System Arch._ sez "disable device's ability to act as
+ * a master and a target." Interpreted as clearing the
+ * master, MEM decode and IO decode bits
+ */
+ pci_read_config_word(dev, PCI_COMMAND, &tmp);
+ newtmp = tmp & ~(PCI_COMMAND_IO|PCI_COMMAND_MEMORY|PCI_COMMAND_MASTER);
+ if (tmp != newtmp)
+ pci_write_config_word(dev, PCI_COMMAND, newtmp);
+
+ /* just for the sake of sanity and pessimism, pause for a bit,
+ * then clear any status conditions. PCI status register
+ * is nicely designed so we can clear it thusly..
+ */
+ pci_read_config_word(dev, PCI_STATUS, &tmp);
+ pci_write_config_word(dev, PCI_STATUS, tmp);
+
+ /* go to D3 */
+ pci_write_config_word(dev, pm + PCI_PM_CTRL, 3);
+
+ return 0;
}

/**
@@ -285,10 +393,13 @@
pci_enable_device(struct pci_dev *dev)
{
int err;
+
+ err = pci_power_on(dev);
+ if (err) return err;
+
+ err = pcibios_enable_device(dev);
+ if (err < 0) return err;

- if ((err = pcibios_enable_device(dev)) < 0)
- return err;
- pci_set_power_state(dev, 0);
return 0;
}

@@ -1390,7 +1501,8 @@
EXPORT_SYMBOL(pci_find_subsys);
EXPORT_SYMBOL(pci_set_master);
EXPORT_SYMBOL(pci_set_dma_mask);
-EXPORT_SYMBOL(pci_set_power_state);
+EXPORT_SYMBOL(pci_power_on);
+EXPORT_SYMBOL(pci_power_off);
EXPORT_SYMBOL(pci_assign_resource);
EXPORT_SYMBOL(pci_register_driver);
EXPORT_SYMBOL(pci_unregister_driver);
Index: include/linux/pci.h
===================================================================
RCS file: /cvsroot/gkernel/linux_2_4/include/linux/pci.h,v
retrieving revision 1.1.1.39
retrieving revision 1.1.1.39.2.1
diff -u -r1.1.1.39 -r1.1.1.39.2.1
--- include/linux/pci.h 2001/04/18 01:11:14 1.1.1.39
+++ include/linux/pci.h 2001/04/18 03:44:33 1.1.1.39.2.1
@@ -308,6 +308,11 @@
#define pci_for_each_dev_reverse(dev) \
for(dev = pci_dev_g(pci_devices.prev); dev != pci_dev_g(&pci_devices); dev = pci_dev_g(dev->global_list.prev))

+struct pci_dev_context {
+ int n_dwords;
+ u32 *cfg_hdr;
+};
+
/*
* The pci_dev structure is used to describe both PCI and ISAPnP devices.
*/
@@ -330,6 +335,11 @@
u8 rom_base_reg; /* which config register controls the ROM */

struct pci_driver *driver; /* which driver has allocated this device */
+
+ struct pci_dev_context *saved_context;
+ /* PCI config header, when suspended.
+ NULL when active */
+
void *driver_data; /* data private to the driver */
dma_addr_t dma_mask; /* Mask of the bits of bus address this
device implements. Normally this is
@@ -528,7 +538,8 @@
int pci_enable_device(struct pci_dev *dev);
void pci_set_master(struct pci_dev *dev);
int pci_set_dma_mask(struct pci_dev *dev, dma_addr_t mask);
-int pci_set_power_state(struct pci_dev *dev, int state);
+int pci_power_on(struct pci_dev *dev);
+int pci_power_off(struct pci_dev *dev, int context_size);
int pci_assign_resource(struct pci_dev *dev, int i);

/* Helper functions for low-level code (drivers/pci/setup-[bus,res].c) */


Attachments:
pcipm.patch (8.93 kB)

2001-04-19 10:08:57

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PCI power management

Hi ! Glad to see things moving around Power Management ;)

>This was originally a private reply to Patrick Mochel, but the e-mail
>kept getting longer and longer :)

Note: we have setup a list for PM issues

http://lists.sourceforge.net/lists/listinfo/linux-pm-devel

Not very much used yet, but I, at least, plan to spam it with all
sort of things we need for PowerBook PM... I'm forwarding your
message there and I suggest we continue that discussion there as well.

>The current state of PCI PM is this:
>
>pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes
>the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state.
>Linus believes the power state transition should occur before (1) and
>(2), and I agree.
>
>pci_set_power_state brings a device to a new D state. If the D state
>transition is D3->D0, then we (1) save key PCI config registers, (2) go
>to D0, and (3) restore saved PCI config registers. This originally
>comes from Donald Becker's acpi_wake function, which is used only for
>the case of device enabling (where he had no problems), not for the case
>of returning-from-suspend (where we see problems).

I beleive the current scheme is not enough. Here are some of my own
thoughts about this:

- Some devices won't properly give you their config space when
in D3 state. You shouldn't save the configuration when in D3 to restore
it after switching to D0, but you must have previously saved it before
originally putting the device into D3 state.

- There need to be some arch "hooks" in this mecanism. Some machines
have the ability (from the arch specific code, by tweaking ASIC bits)
to remove clock and/or power from selected devices. That mean power
management can be done even with devices not supporting PCI PM provided
that the driver can recover them from a PowerOn reset.

- Some devices just can't be brought back to life from D3 state without
a PCI reset (ATI Rage M3 for example) and that require some arch specific
support (when it's possible at all).

- The current scheme provide no way for the kernel to "know" if a
driver can handle recovering the device from a PowerOn reset. Some
drivers can, some can't (the video drivers usually can't as they
require the board's PLL to be properly setup by the BIOS). Some
advanced PM modes we use on pmacs will cause the motherboard ASIC to
turn off power to PCI & AGP cards when putting the machine to sleep.
We need a way to prevent/allow this "deep sleep" mode depending
on what the card supports.

- Ordering of power management may matter. On PowerBooks, we run
through all notifiers first with a "sleep request" message. None of
the drivers will actually put anything to sleep at this point, but
they will allocate all the memory the might need for doing so (saving
state, saving a framebuffer in some cases, etc...). Once all devices
have accepted the request (they can refuse it), I then send a
"sleep now" message. This way, I can make sure all memory allocations
have been performed and disks properly sync'ed before putting the swap
devices to sleep and such things.

- On SMP, we need some way to stop other CPUs in the scheduler
while running the last round of sleep (putting devices to sleep) at least
until all IO layers in Linux can properly handle blocking of IO queues
while the device sleeps.

- We need a generic (non-x86 APM or ACPI dependant) way of including
userland process that request it in the loop. Some userland process
that bang hardware directly (X, but not only X) need to be properly
suspended (and the kernel has to wait for ack from them before continuing
with devices sleep).

>"apm -s" causes the apm driver to map all suspends to the ACPI D3
>state. An apm suspend triggers a pm_send_all call, which in turns
>triggers pci_pm_suspend. This code [from Linus iirc] walks the root
>buses, recursively suspending downstream buses and then attached
>devices. The resume code does the exact opposite. The PCI core
>suspend/resume code has this comment, and we note the current
>requirement that -all- drivers should export suspend/resume somehow, in
>order for a sane PM system to work here.

Yup. They should also be able to return an error (fail or just limit
to a higher level like D2). They should also be able to tell the kernel
if they support recovering from a power down.

>It is up to the drivers to implement ::suspend() and ::resume(), and few
>do. The few that do, even fewer work well in practice.

I would have preferred that a PM node be created for each PCI node and
have the PM nodes organised as a tree structure. That way, arch fixup
hooks can re-arrange the tree as the PCI bus->child dependency may not
be true. On some portables, some ASICs located on the PCI bus are not
dependent on their parent host bridge power plane.

>That's the current state of things. I do not think the system -- at the
>PCI core level -- is poorly designed. I think it just takes a lot of
>grunt work with drivers at this point, plus maybe a few new pci helper
>functions.
>
>So here's a random list of notes and issues on Linux PCI PM.
>
>1) pci_enable_device needs to power up the device before enabling it.

Right. And we need a equivalent power down function. For example, some
drivers may improve power management by powering down the device when
it's /dev node is not opened (or when the device have been idle for some
time). However, those power up/down functions have to be arch-dependant,
you can't rely on the PCI power management to be the only PM scheme.

>2) AFAICT, it is safe to turn off a PCI device's bus-mastering bit and
>take the device to D3, if it exports the PCI PM capability. My
>previously-submitted pci_disable_function function turns off the
>bus-mastering bit, and should probably take the device to D3 too.

No, D3 is not safe on all devices. However, if pci_disable_function() is
under driver control, then the driver may decide not to call it. In some
case, D2 is the only acceptable mode. In other cases, the device doesn't
support PM but the motherboard has ways to shut the clock down or the
power supply.

>3) The current pci_set_power_state implementation is non-spec, and even
>though it works for some cases it does not appear like the right thing
>to do.

Definitely.

>4) Because of #2, I have create pci_power_on and pci_power_off.
>pci_power_off saves ALL the PCI config registers, turns off
>busmastering, and goes to D3. pci_power_on takes the device to D0, then
>blasts the stored PCI config register data back to the hardware.

That's better. I would however separate the config save/backup and
other housekeeping from the actual D-state change. As I wrote earlier,
D3 is not always a good solution and some motherboard specific mecanism
may be used here. I would have liked a bitmask of "options" of what
the driver allows (D1, D2, D3, static (no clock), power down, ...).
Then, the arch-specific implementation of power down can pick the
best mode supported by the driver.

>5) In testing, this works sometimes, but other times it causes the
>upstream bridge of the device being resumed to stop decoding the device.

I beleive the bridge has to be power managed too (save/restore). The
G4 PowerMacs, when going to sleep, will cut power off the PCI<->PCI
bridge as well. The host bridge is another matter and is fully arch
dependant.

>6) One solution to #4 is to save and restore the PCI bridge registers
>too. This comes partially from a Linus suggestion, and partially from
>an end user who solved their eepro100 suspend/resume problems with a
>setpci command to their PCI bridge (not to the eepro100 device). In my
>own testing this solution works 100%, but (a) it might not be right, and
>thus (b) it might cause problems. I am -very- interested in feedback on
>this solution, or a better one.

Well, I beleive we should indeed save & restore PCI<->PCI bridges.

I still beleive as well that instead of having PCI-specific
suspend/resume functions, we should have a real PM node per PCI node.
That way, we can add additional power notifications.

>7) Due to #5 an open issue is to re-read the bridge and PCI PM specs.
>Some portions of the spec imply that the bridge should never be touched
>during device suspend or resume :)

PCI PM specs should be cached :)

>8) Who can predict what a laptop's AML tables want to do with the PCI
>bus, and if Linux will be interfering with ACPI suspend, or if ACPI will
>be interfering with Linux resume, etc.
>
>9) A truly green driver should register itself then disable its
>hardware. It is wasting power otherwise. That implies waking up
>hardware on dev->open and sleeping on dev->release. Some net drivers do
>this already. This further implies problems down the road with stuff
>like char drivers, where applications often open and close the device
>node very rapidly. This happens in OSS audio land when some audio apps
>start up, for example. Maybe an inactivity timer would work here, to
>power down the device after time passes with no open(2) calls.

I beleive it's up to each driver to handle that. Maybe some "framework"
for this can be provided with the generic PM nodes...

>10) We might wind up needing northbridge, southbridge, and/or PCI bridge
>drivers. They will likely be small, but I think eventually they will
>need to exist in order to provide complete power management coverage.

Indeed, but those are in the arch side. We definitely to have a way
for the arch to hook deeply into the sleep process. There can be some
weird dependencies going on on portable motherboards or embedded
devices, like an ethernet device beeing also used to provide reset
signals to another PCI device, etc...

That's why I prefer the idea of having the PM nodes in a tree and
a node for each PCI device. The arch would then "hook" on the
pci_register_pm_node() or whatever we call it and have the ability
to move the node elsewhere in the tree depending on motherboard
details.

>11) Hard drives. Our IDE and SCSI subsystems stink when it comes to
>working with the PCI PM framework. Andre has spoken of plans to use
>pci_driver in 2.5, and turn the IDE subsystem "inside out" so that PCI
>drivers call out to registration functions, etc., instead of the current
>system. The same thing needs to happen for SCSI.

Right, along with the problem of properly blocking IO queues. A similar
issue exist with all "bus" drivers. USB drivers should be properly notified
of sleep before the host controller gets suspended (I had some weird crashes
that were due to the OHCI controller getting mad because it was beeing
"tapped" while suspended by some drivers still feeding it with requests).
Also, some devices don't handle properly the errors they may get because
requests were interrupted by sleep.
So busses like USB, FireWire, etc... need a similar "tree" architecture. I
strongly beleive generalizing the PM node is the way to go. Beeing
a "notifier" like mecanism, it allows to add specific messages if needed
(for example, USB bus suspend is different than machine sleep, that could
be an additional PM message sent by the host controller to USB drivers,
etc...)

>12) Continuing #11, there needs to be a general notion of when the
>system should -not- write stuff to disk. This is mainly a userspace
>issue, ie. low-priority syslog messages should not prevent the system
>from idling the hard drive and spinning it down. BUT.. the kernel may
>need to be the central arbiter if only to have a single place which says
>"hard drive is idle now"...

Yup. That's what I've more or less worked around on powerbooks with
the 2-step sleep process described above. When disks and IDE controller
are acutally put to sleep (powered off in my case), I will have already
allocated all the memory I need (backup storage from the frambuffer
etc...), have sync'ed the disks, waited a bit, and will no longer schedule
explicitely (and I should no longer schedule implicitely neither, but
that can be difficult to acheive as some of the current PM hooks will
cause schedules to happen). I also have a priority-based ordering
mecanism so that I can "fix" some of these issues by putting
things like display & disk to sleep latest, but that's a workaround.
>

Regards,
Ben.



2001-04-19 10:20:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PCI power management

Hi ! Glad to see things moving around Power Management ;)

>This was originally a private reply to Patrick Mochel, but the e-mail
>kept getting longer and longer :)

Note: we have setup a list for PM issues

http://lists.sourceforge.net/lists/listinfo/linux-pm-devel

Not very much used yet, but I, at least, plan to spam it with all
sort of things we need for PowerBook PM... I'm forwarding your
message there and I suggest we continue that discussion there as well.

>The current state of PCI PM is this:
>
>pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes
>the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state.
>Linus believes the power state transition should occur before (1) and
>(2), and I agree.
>
>pci_set_power_state brings a device to a new D state. If the D state
>transition is D3->D0, then we (1) save key PCI config registers, (2) go
>to D0, and (3) restore saved PCI config registers. This originally
>comes from Donald Becker's acpi_wake function, which is used only for
>the case of device enabling (where he had no problems), not for the case
>of returning-from-suspend (where we see problems).

I beleive the current scheme is not enough. Here are some of my own
thoughts about this:

- Some devices won't properly give you their config space when
in D3 state. You shouldn't save the configuration when in D3 to restore
it after switching to D0, but you must have previously saved it before
originally putting the device into D3 state.

- There need to be some arch "hooks" in this mecanism. Some machines
have the ability (from the arch specific code, by tweaking ASIC bits)
to remove clock and/or power from selected devices. That mean power
management can be done even with devices not supporting PCI PM provided
that the driver can recover them from a PowerOn reset.

- Some devices just can't be brought back to life from D3 state without
a PCI reset (ATI Rage M3 for example) and that require some arch specific
support (when it's possible at all).

- The current scheme provide no way for the kernel to "know" if a
driver can handle recovering the device from a PowerOn reset. Some
drivers can, some can't (the video drivers usually can't as they
require the board's PLL to be properly setup by the BIOS). Some
advanced PM modes we use on pmacs will cause the motherboard ASIC to
turn off power to PCI & AGP cards when putting the machine to sleep.
We need a way to prevent/allow this "deep sleep" mode depending
on what the card supports.

- Ordering of power management may matter. On PowerBooks, we run
through all notifiers first with a "sleep request" message. None of
the drivers will actually put anything to sleep at this point, but
they will allocate all the memory the might need for doing so (saving
state, saving a framebuffer in some cases, etc...). Once all devices
have accepted the request (they can refuse it), I then send a
"sleep now" message. This way, I can make sure all memory allocations
have been performed and disks properly sync'ed before putting the swap
devices to sleep and such things.

- On SMP, we need some way to stop other CPUs in the scheduler
while running the last round of sleep (putting devices to sleep) at least
until all IO layers in Linux can properly handle blocking of IO queues
while the device sleeps.

- We need a generic (non-x86 APM or ACPI dependant) way of including
userland process that request it in the loop. Some userland process
that bang hardware directly (X, but not only X) need to be properly
suspended (and the kernel has to wait for ack from them before continuing
with devices sleep).

>"apm -s" causes the apm driver to map all suspends to the ACPI D3
>state. An apm suspend triggers a pm_send_all call, which in turns
>triggers pci_pm_suspend. This code [from Linus iirc] walks the root
>buses, recursively suspending downstream buses and then attached
>devices. The resume code does the exact opposite. The PCI core
>suspend/resume code has this comment, and we note the current
>requirement that -all- drivers should export suspend/resume somehow, in
>order for a sane PM system to work here.

Yup. They should also be able to return an error (fail or just limit
to a higher level like D2). They should also be able to tell the kernel
if they support recovering from a power down.

>It is up to the drivers to implement ::suspend() and ::resume(), and few
>do. The few that do, even fewer work well in practice.

I would have preferred that a PM node be created for each PCI node and
have the PM nodes organised as a tree structure. That way, arch fixup
hooks can re-arrange the tree as the PCI bus->child dependency may not
be true. On some portables, some ASICs located on the PCI bus are not
dependent on their parent host bridge power plane.

>That's the current state of things. I do not think the system -- at the
>PCI core level -- is poorly designed. I think it just takes a lot of
>grunt work with drivers at this point, plus maybe a few new pci helper
>functions.
>
>So here's a random list of notes and issues on Linux PCI PM.
>
>1) pci_enable_device needs to power up the device before enabling it.

Right. And we need a equivalent power down function. For example, some
drivers may improve power management by powering down the device when
it's /dev node is not opened (or when the device have been idle for some
time). However, those power up/down functions have to be arch-dependant,
you can't rely on the PCI power management to be the only PM scheme.

>2) AFAICT, it is safe to turn off a PCI device's bus-mastering bit and
>take the device to D3, if it exports the PCI PM capability. My
>previously-submitted pci_disable_function function turns off the
>bus-mastering bit, and should probably take the device to D3 too.

No, D3 is not safe on all devices. However, if pci_disable_function() is
under driver control, then the driver may decide not to call it. In some
case, D2 is the only acceptable mode. In other cases, the device doesn't
support PM but the motherboard has ways to shut the clock down or the
power supply.

>3) The current pci_set_power_state implementation is non-spec, and even
>though it works for some cases it does not appear like the right thing
>to do.

Definitely.

>4) Because of #2, I have create pci_power_on and pci_power_off.
>pci_power_off saves ALL the PCI config registers, turns off
>busmastering, and goes to D3. pci_power_on takes the device to D0, then
>blasts the stored PCI config register data back to the hardware.

That's better. I would however separate the config save/backup and
other housekeeping from the actual D-state change. As I wrote earlier,
D3 is not always a good solution and some motherboard specific mecanism
may be used here. I would have liked a bitmask of "options" of what
the driver allows (D1, D2, D3, static (no clock), power down, ...).
Then, the arch-specific implementation of power down can pick the
best mode supported by the driver.

>5) In testing, this works sometimes, but other times it causes the
>upstream bridge of the device being resumed to stop decoding the device.

I beleive the bridge has to be power managed too (save/restore). The
G4 PowerMacs, when going to sleep, will cut power off the PCI<->PCI
bridge as well. The host bridge is another matter and is fully arch
dependant.

>6) One solution to #4 is to save and restore the PCI bridge registers
>too. This comes partially from a Linus suggestion, and partially from
>an end user who solved their eepro100 suspend/resume problems with a
>setpci command to their PCI bridge (not to the eepro100 device). In my
>own testing this solution works 100%, but (a) it might not be right, and
>thus (b) it might cause problems. I am -very- interested in feedback on
>this solution, or a better one.

Well, I beleive we should indeed save & restore PCI<->PCI bridges.

I still beleive as well that instead of having PCI-specific
suspend/resume functions, we should have a real PM node per PCI node.
That way, we can add additional power notifications.

>7) Due to #5 an open issue is to re-read the bridge and PCI PM specs.
>Some portions of the spec imply that the bridge should never be touched
>during device suspend or resume :)

PCI PM specs should be cached :)

>8) Who can predict what a laptop's AML tables want to do with the PCI
>bus, and if Linux will be interfering with ACPI suspend, or if ACPI will
>be interfering with Linux resume, etc.
>
>9) A truly green driver should register itself then disable its
>hardware. It is wasting power otherwise. That implies waking up
>hardware on dev->open and sleeping on dev->release. Some net drivers do
>this already. This further implies problems down the road with stuff
>like char drivers, where applications often open and close the device
>node very rapidly. This happens in OSS audio land when some audio apps
>start up, for example. Maybe an inactivity timer would work here, to
>power down the device after time passes with no open(2) calls.

I beleive it's up to each driver to handle that. Maybe some "framework"
for this can be provided with the generic PM nodes...

>10) We might wind up needing northbridge, southbridge, and/or PCI bridge
>drivers. They will likely be small, but I think eventually they will
>need to exist in order to provide complete power management coverage.

Indeed, but those are in the arch side. We definitely to have a way
for the arch to hook deeply into the sleep process. There can be some
weird dependencies going on on portable motherboards or embedded
devices, like an ethernet device beeing also used to provide reset
signals to another PCI device, etc...

That's why I prefer the idea of having the PM nodes in a tree and
a node for each PCI device. The arch would then "hook" on the
pci_register_pm_node() or whatever we call it and have the ability
to move the node elsewhere in the tree depending on motherboard
details.

>11) Hard drives. Our IDE and SCSI subsystems stink when it comes to
>working with the PCI PM framework. Andre has spoken of plans to use
>pci_driver in 2.5, and turn the IDE subsystem "inside out" so that PCI
>drivers call out to registration functions, etc., instead of the current
>system. The same thing needs to happen for SCSI.

Right, along with the problem of properly blocking IO queues. A similar
issue exist with all "bus" drivers. USB drivers should be properly notified
of sleep before the host controller gets suspended (I had some weird crashes
that were due to the OHCI controller getting mad because it was beeing
"tapped" while suspended by some drivers still feeding it with requests).
Also, some devices don't handle properly the errors they may get because
requests were interrupted by sleep.
So busses like USB, FireWire, etc... need a similar "tree" architecture. I
strongly beleive generalizing the PM node is the way to go. Beeing
a "notifier" like mecanism, it allows to add specific messages if needed
(for example, USB bus suspend is different than machine sleep, that could
be an additional PM message sent by the host controller to USB drivers,
etc...)

>12) Continuing #11, there needs to be a general notion of when the
>system should -not- write stuff to disk. This is mainly a userspace
>issue, ie. low-priority syslog messages should not prevent the system
>from idling the hard drive and spinning it down. BUT.. the kernel may
>need to be the central arbiter if only to have a single place which says
>"hard drive is idle now"...

Yup. That's what I've more or less worked around on powerbooks with
the 2-step sleep process described above. When disks and IDE controller
are acutally put to sleep (powered off in my case), I will have already
allocated all the memory I need (backup storage from the frambuffer
etc...), have sync'ed the disks, waited a bit, and will no longer schedule
explicitely (and I should no longer schedule implicitely neither, but
that can be difficult to acheive as some of the current PM hooks will
cause schedules to happen). I also have a priority-based ordering
mecanism so that I can "fix" some of these issues by putting
things like display & disk to sleep latest, but that's a workaround.
>

Regards,
Ben.



2001-04-19 10:39:31

by CaT

[permalink] [raw]
Subject: Re: PCI power management

On Thu, Apr 19, 2001 at 11:19:31AM +0100, Benjamin Herrenschmidt wrote:
> Hi ! Glad to see things moving around Power Management ;)
>
> >This was originally a private reply to Patrick Mochel, but the e-mail
> >kept getting longer and longer :)
>
> Note: we have setup a list for PM issues
>
> http://lists.sourceforge.net/lists/listinfo/linux-pm-devel

Oooooo....

*tries to subscribe*

Doh! The silly thing is trying to use the From_ header on the confirm
rather then the From: header and so I can't subscribe. Can this get fixed?

--
CaT ([email protected]) *** Jenna has joined the channel.
<cat> speaking of mental giants..
<Jenna> me, a giant, bullshit
<Jenna> And i'm not mental
- An IRC session, 20/12/2000

2001-04-19 12:12:05

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PCI power management

>On Thu, Apr 19, 2001 at 11:19:31AM +0100, Benjamin Herrenschmidt wrote:
>> Hi ! Glad to see things moving around Power Management ;)
>>
>> >This was originally a private reply to Patrick Mochel, but the e-mail
>> >kept getting longer and longer :)
>>
>> Note: we have setup a list for PM issues
>>
>> http://lists.sourceforge.net/lists/listinfo/linux-pm-devel
>
>Oooooo....
>
>*tries to subscribe*
>
>Doh! The silly thing is trying to use the From_ header on the confirm
>rather then the From: header and so I can't subscribe. Can this get fixed?

Dunno, it's the standard sourceforge/geocrawler list stuff..

Ben.



2001-04-19 12:57:12

by Alan

[permalink] [raw]
Subject: Re: PCI power management

> - Some devices just can't be brought back to life from D3 state without
> a PCI reset (ATI Rage M3 for example) and that require some arch specific
> support (when it's possible at all).

Putting on a driver author hat what I want is

pci_power_on_generic
pci_power_off_generic
pci_power_on_null
pci_power_off_null

At which point most driver writers are having to do no thinking at all about
their device. The PCI layer just requires they pick a function and stick it
in the struct pci_device.

> - On SMP, we need some way to stop other CPUs in the scheduler
> while running the last round of sleep (putting devices to sleep) at least
> until all IO layers in Linux can properly handle blocking of IO queues
> while the device sleeps.

This doesnt help you. You need device specific support in each case where
bus mastering is occuring and a bus master error could be fatal if missed.
For example on i2o I can easily have 4Mbytes of outstanding I/O between the
message layer and disk, all of which is bus mastering. Only the driver actually
knows when its idle.

> that bang hardware directly (X, but not only X) need to be properly
> suspended (and the kernel has to wait for ack from them before continuing
> with devices sleep).

X has hooks for this in XFree 4


2001-04-19 13:16:04

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PCI power management

>> - Some devices just can't be brought back to life from D3 state without
>> a PCI reset (ATI Rage M3 for example) and that require some arch specific
>> support (when it's possible at all).
>
>Putting on a driver author hat what I want is
>
> pci_power_on_generic
> pci_power_off_generic
> pci_power_on_null
> pci_power_off_null
>
>At which point most driver writers are having to do no thinking at all about
>their device. The PCI layer just requires they pick a function and stick it
>in the struct pci_device.

Could you elaborate about the difference between generic and null
functions ? I'm not sure I understand what you mean...

Note that in the case of chips like the Rage M3, the driver is the only
one to know if it will be able to bring back the card from a power off
state or not. It's the only one to know if it can reconfigure the card
completely without having a BIOS run before it.

I would suggest a call that looks like

pci_power_off(uint mask);

where mask is

PCI_POWER_MASK_D1 = 0x00000001
PCI_POWER_MASK_D2 = 0x00000002
PCI_POWER_MASK_D3 = 0x00000004
PCI_POWER_MASK_NOCLOCK = 0x00000008
PCI_POWER_MASK_NOPOWER = 0x00000010

The driver sets the mask to whatever state it supports getting the card
from. We can #define a PCI_POWER_MASK_STD (that would be a D1+D2+D3) for
"generic" drivers that don't really know anything but to follow the HW
PCI power management capabilities.

This function would be routed to an arch function, that will in turn
either call the lower-level PCI code to set D1, D2 or D3 mode (the best
supported) or will suspend the card's clock or power if it can and the
driver accept it.

Typically, on a PowerMac, this function could keep track of which cards
are in D2 or D3 mode (or which drivers allowed for clock suspend) and
would stop the PCI clock once they all asked for it.

>This doesnt help you. You need device specific support in each case where
>bus mastering is occuring and a bus master error could be fatal if missed.
>For example on i2o I can easily have 4Mbytes of outstanding I/O between the
>message layer and disk, all of which is bus mastering. Only the driver
>actually
>knows when its idle.

Right. That's a driver issue. The problem would go away if all drivers
properly block their IO queues and wait for all IO to complete when
notified of sleep

>X has hooks for this in XFree 4

The last time I looked at it, those were rather APM-specific. But well, I
guess it's easy to update them. What I'm thinknig about is the kernel
side, that is a generic, non-APM or non-ACPI specific way of notifying
userland process that request for it. Some kind of interface allowing
userland to register PM notifiers and have the kernel PM thread be
blocked until the userland code "acked" the message.

Well, maybe there is already something I missed...

Ben.



2001-04-19 13:25:58

by Jeff Garzik

[permalink] [raw]
Subject: Re: PCI power management

Benjamin Herrenschmidt wrote:
> - On SMP, we need some way to stop other CPUs in the scheduler
> while running the last round of sleep (putting devices to sleep) at least
> until all IO layers in Linux can properly handle blocking of IO queues
> while the device sleeps.

I think either Rusty or Anton wrote code to enable and disable CPUs...

CPU hotplugging but it would be useful for PM too.

--
Jeff Garzik | "The universe is like a safe to which there is a
Building 1024 | combination -- but the combination is locked up
MandrakeSoft | in the safe." -- Peter DeVries

2001-04-19 13:32:38

by Alan

[permalink] [raw]
Subject: Re: PCI power management

> > pci_power_on_generic
> > pci_power_off_generic
> > pci_power_on_null
> > pci_power_off_null
> >
> >At which point most driver writers are having to do no thinking at all about
> >their device. The PCI layer just requires they pick a function and stick it
> >in the struct pci_device.
>
> Could you elaborate about the difference between generic and null
> functions ? I'm not sure I understand what you mean...

null = 'do absolutely nothing'
generic = 'do D3 as per the specification'

The idea being the PM layer would go around calling

dev->power_off(dev);

as a default notifier for PCI devices.

> one to know if it will be able to bring back the card from a power off
> state or not. It's the only one to know if it can reconfigure the card
> completely without having a BIOS run before it.

And in the case of the cards like that you would need a custom mask. So you'd
do
pci_set_power_handler(dev, atyfb_power_on, atyfb_power_off)

to get a custom function. For most authors however they can call the power
handler setup just using prerolled functions that do the right thing and know
about any architecture horrors they dont.

> pci_power_off(uint mask);
>
> where mask is
>
> PCI_POWER_MASK_D1 = 0x00000001
> PCI_POWER_MASK_D2 = 0x00000002
> PCI_POWER_MASK_D3 = 0x00000004
> PCI_POWER_MASK_NOCLOCK = 0x00000008
> PCI_POWER_MASK_NOPOWER = 0x00000010

I'd rather

pci_dev->powerstate

or similar as a set of flags in the device.

Alan

2001-04-19 13:44:50

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PCI power management

>null = 'do absolutely nothing'
>generic = 'do D3 as per the specification'
>
>The idea being the PM layer would go around calling
>
> dev->power_off(dev);
>
>as a default notifier for PCI devices.

Ok, I see. I didn't understand that the functions you were talking about
would be defaults to put directly in the pci_dev structure.

>And in the case of the cards like that you would need a custom mask. So you'd
>do
> pci_set_power_handler(dev, atyfb_power_on, atyfb_power_off)
>
>to get a custom function. For most authors however they can call the power
>handler setup just using prerolled functions that do the right thing and know
>about any architecture horrors they dont.

Right. However, rare are the drivers that don't need at least to know
that a power management sequence is going on. All bus mastering drivers,
at least, must stop bus mastering (and clearing the bit in the command
register is not enough on a bunch of them). Most drivers have to cleanly
stop ongoing operations, refuse (or block) requests while the driver is
sleeping, etc... and finally configure things back once waking up. I
don't see much cases where a simple "default" function would work.

My current scheme on powerbook don't do half of that... it still sorta
works since I manage to stop all scheduling and shut things down in the
proper order, but it's neither a clean nor a safe way to do things.

>I'd rather
>
> pci_dev->powerstate
>
>or similar as a set of flags in the device.

Ok, agree with that one.

I sill consider, however, that the current suspend/resume callbacks in
the pci_dev structure are not the best way to do things. I would have
really prefered that each pci_dev embed a pm notifier structure. In some
cases, we want to pass more than simple suspend/resume messages (suspend
request, suspend now, suspend cancel, and resume are the 4 messages I use
on powerbooks).

Also, this can be generalized to other type of drivers (USB, IEEE1394,
..), eventually passing bus-specific messages

Ben.



2001-04-19 23:04:34

by Patrick Mochel

[permalink] [raw]
Subject: Re: PCI power management


> pci_enable_device (1) enables IO and mem decoding, (2) assigns/routes
> the PCI IRQ, and (3) brings the device to D0 using pci_set_power_state.
> Linus believes the power state transition should occur before (1) and
> (2), and I agree.

Yes, that's correct. The device should have power before it obtains any
resources. And, when coming back from suspend, obtaining those values
requires simply plugging the values in struct pci_dev in.

> pci_set_power_state brings a device to a new D state. If the D state
> transition is D3->D0, then we (1) save key PCI config registers, (2) go
> to D0, and (3) restore saved PCI config registers. This originally
> comes from Donald Becker's acpi_wake function, which is used only for
> the case of device enabling (where he had no problems), not for the case
> of returning-from-suspend (where we see problems).

That is bad. There are two D3 states - D3[cold] and D3[hot]. The former is
supported by all devices by default - it is when Vcc has been cut to the
device. The latter is supported by most modern PCI devices, I think. Under
this state, software access to the register space will work correctly.
This may work on power up or coming back from suspend for some wacky
reason, but it shouldn't. Both STR and STD cut power to the PCI bus, so we
shouldn't get anything coherent by reading from the config space.

Is it possible that it works on boot because of any initialization that
the BIOS has done to PCI?

> > * We do not touch devices that don't have a driver that exports
> > * a suspend/resume function. That is just too dangerous. If the default
> > * PCI suspend/resume functions work for a device, the driver can
> > * easily implement them (ie just have a suspend function that calls
> > * the pci_set_power_state() function).

So what happens to devices that don't implement those functions when we
resume? Are they completely reinitialized? If so, you could just be rude
about it and cut power to them, which is essentially what the BIOS is
doing anyway.

Seriously, though, all drivers should implement those functions.

> That's the current state of things. I do not think the system -- at the
> PCI core level -- is poorly designed. I think it just takes a lot of
> grunt work with drivers at this point, plus maybe a few new pci helper
> functions.

Yep. But there is also the case of non-PCI devices - USB, PCMCIA, system
devices attached to the motherboard, ISA deviecs. It has been mentioned
that we just create a PCI struct for each one, but it seems like overkill.
Plus that would imply a new root bus and a need for the infrastructure to
support non-PCI devices masquerading as PCI devices.

> 1) pci_enable_device needs to power up the device before enabling it.

Yes. Easy enough.

> 2) AFAICT, it is safe to turn off a PCI device's bus-mastering bit and
> take the device to D3, if it exports the PCI PM capability. My
> previously-submitted pci_disable_function function turns off the
> bus-mastering bit, and should probably take the device to D3 too.

You must disable I/O and memory space as well as bus-mastering before
entering any suspend state. For D3, every device supports that, whether it
means just writing to the PowerState field in the Capability or just
cutting power.

> 3) The current pci_set_power_state implementation is non-spec, and even
> though it works for some cases it does not appear like the right thing
> to do.

Hmm. Maybe it should do only what it advertises - set the power state.
Leave the saving/restoring of the state to the driver. There can be a
generic function that does what pci_set_power_state wrt the state, and
leave pci_set_power_state only write the bits to the PowerState field.

> 4) Because of #2, I have create pci_power_on and pci_power_off.
> pci_power_off saves ALL the PCI config registers, turns off
> busmastering, and goes to D3. pci_power_on takes the device to D0, then
> blasts the stored PCI config register data back to the hardware.
>
> 5) In testing, this works sometimes, but other times it causes the
> upstream bridge of the device being resumed to stop decoding the device.

Alright, this is where your expertise takes over. The PCI PM Spec says:

"When attempting to place a PCI function in a low power state D1-D3, it is
the operating system's responsibility to ensure that the function has no
pending (host initiated) transactions, of in the case of a bridge device,
that there are no PCI functions behind the bridge that require the bridge
to be in the fully operational D0 state." (Section 8.2, p. 66)

It also goes on to say that it must ensure that there is no peer-to-peer
transfers to the target function during sleep.

So somehow, all knowledge of the PCI function must be disavowed, and I am
pretty clueless as how to guarantee this.

> 6) One solution to #4 is to save and restore the PCI bridge registers
> too. This comes partially from a Linus suggestion, and partially from
> an end user who solved their eepro100 suspend/resume problems with a
> setpci command to their PCI bridge (not to the eepro100 device). In my
> own testing this solution works 100%, but (a) it might not be right, and
> thus (b) it might cause problems. I am -very- interested in feedback on
> this solution, or a better one.
>
> 7) Due to #5 an open issue is to re-read the bridge and PCI PM specs.
> Some portions of the spec imply that the bridge should never be touched
> during device suspend or resume :)

The PCI PM spec has specific information wrt to bridges, but I haven't
looked at it myself.

> 8) Who can predict what a laptop's AML tables want to do with the PCI
> bus, and if Linux will be interfering with ACPI suspend, or if ACPI will
> be interfering with Linux resume, etc.

Laptops _should_ have it down pretty well, though I won't put my lunch
money on it. I would be more worried about desktop systems. One more thing
to blame on broken BIOSes...

> 9) A truly green driver should register itself then disable its
> hardware. It is wasting power otherwise. That implies waking up
> hardware on dev->open and sleeping on dev->release. Some net drivers do
> this already. This further implies problems down the road with stuff
> like char drivers, where applications often open and close the device
> node very rapidly. This happens in OSS audio land when some audio apps
> start up, for example. Maybe an inactivity timer would work here, to
> power down the device after time passes with no open(2) calls.

This is a little farther down the road, but yes, all devices should enter
some sleep state during inactivity. At the very least, the device could
enter D1 initially, and gradually move to D2 and D3[hot] during longer
periods of inactivity.

> 10) We might wind up needing northbridge, southbridge, and/or PCI bridge
> drivers. They will likely be small, but I think eventually they will
> need to exist in order to provide complete power management coverage.

The only thing that they should need are the suspend/resume functions,
right?

> 12) Continuing #11, there needs to be a general notion of when the
> system should -not- write stuff to disk. This is mainly a userspace
> issue, ie. low-priority syslog messages should not prevent the system
> from idling the hard drive and spinning it down. BUT.. the kernel may
> need to be the central arbiter if only to have a single place which says
> "hard drive is idle now"...

There should be a general notion of when a device should not do I/O. Not
just for runtime power management, but also for suspend. What do we do
when we want to go to sleep, but there are continuous requests to write to
disk or to some network card? There has to be a point when a driver cuts
off the I/O, queueing it up (or dropping it). It seems this is best left
up to the driver, since all devices may have a different mechanism for
determining when and where the cutoff point is. Maybe a serialize()
function for struct pci_driver?

-pat

2001-04-20 00:12:35

by Patrick Mochel

[permalink] [raw]
Subject: Re: [Linux-pm-devel] Re: PCI power management


> > - On SMP, we need some way to stop other CPUs in the scheduler
> > while running the last round of sleep (putting devices to sleep) at least
> > until all IO layers in Linux can properly handle blocking of IO queues
> > while the device sleeps.
>
> I think either Rusty or Anton wrote code to enable and disable CPUs...
>
> CPU hotplugging but it would be useful for PM too.

There's more than that, too. The ACPI spec says that the system must be
able to handle complete dynamic reconfiguration of the system during
suspend/resume. Basically an ideal solution would assume that any device
could have been added or removed while the system was asleep, so it must
account for it by initializing the device and allocating system resources.

Granted CPU hotplugging is a different ballpark, but it's the same league.

-pat

2001-04-20 00:19:55

by Patrick Mochel

[permalink] [raw]
Subject: Re: PCI power management


> - There need to be some arch "hooks" in this mecanism. Some machines
> have the ability (from the arch specific code, by tweaking ASIC bits)
> to remove clock and/or power from selected devices. That mean power
> management can be done even with devices not supporting PCI PM provided
> that the driver can recover them from a PowerOn reset.

All devices should handle having power removed from them. And, all of the
drivers should as well, since that is the only way we're going to get
power management out of legacy devices and other things on the board. This
involves saving the current context on suspend, and reinitializing the
device, and restoring the context as much as possible when we resume. It
should behave almost identically to the boot-time init code.

> - Some devices just can't be brought back to life from D3 state without
> a PCI reset (ATI Rage M3 for example) and that require some arch specific
> support (when it's possible at all).

When a device comes out of D3[hot], the equivalent of a soft reset is
performed. From D3[cold], PCI RST# is asserted, and the device must be
completely reinitialized.

> - The current scheme provide no way for the kernel to "know" if a
> driver can handle recovering the device from a PowerOn reset. Some
> drivers can, some can't (the video drivers usually can't as they
> require the board's PLL to be properly setup by the BIOS). Some
> advanced PM modes we use on pmacs will cause the motherboard ASIC to
> turn off power to PCI & AGP cards when putting the machine to sleep.
> We need a way to prevent/allow this "deep sleep" mode depending
> on what the card supports.

It's not about what the device supports, it's about what the driver
supports. STR and STD imply that all devices will lose power. The drivers
are responsible for reinitializing the devices, regardless of what that
may involve.

> - Ordering of power management may matter. On PowerBooks, we run
> through all notifiers first with a "sleep request" message. None of
> the drivers will actually put anything to sleep at this point, but
> they will allocate all the memory the might need for doing so (saving
> state, saving a framebuffer in some cases, etc...). Once all devices
> have accepted the request (they can refuse it), I then send a
> "sleep now" message. This way, I can make sure all memory allocations
> have been performed and disks properly sync'ed before putting the swap
> devices to sleep and such things.

Hmm. How about doing two walks of the device tree - the first calls a
save_state() function for each device, which gives it the opportunity to
allocate memory and save appropriate registers, etc. The second actually
places the device in a low power state.

This could give the kernel the chance to disable swap, or for the action
to be cancelled before anything is actually put to sleep.

> - On SMP, we need some way to stop other CPUs in the scheduler
> while running the last round of sleep (putting devices to sleep) at least
> until all IO layers in Linux can properly handle blocking of IO queues
> while the device sleeps.

Ugh. SMP. Not yet.

> - We need a generic (non-x86 APM or ACPI dependant) way of including
> userland process that request it in the loop. Some userland process
> that bang hardware directly (X, but not only X) need to be properly
> suspended (and the kernel has to wait for ack from them before continuing
> with devices sleep).

Hmm. Like init?

> Yup. They should also be able to return an error (fail or just limit
> to a higher level like D2). They should also be able to tell the kernel
> if they support recovering from a power down.

Another sleep level is not acceptable when entering a system sleep state,
except for S2, but I've never seen a system that supports that. Power will
be cut to all devices, and there is no getting around it. If the driver
can't support reinitializing the device, it should return an error and the
sleep request be cancelled.

The PCI PM Capabilities can be read from a device's config space. The PCI
PM Spec has register descriptions. There are also #defines for the fields
in pci.h. So a driver can know exactly what is expected of it.

> >It is up to the drivers to implement ::suspend() and ::resume(), and few
> >do. The few that do, even fewer work well in practice.
>
> I would have preferred that a PM node be created for each PCI node and
> have the PM nodes organised as a tree structure. That way, arch fixup
> hooks can re-arrange the tree as the PCI bus->child dependency may not
> be true. On some portables, some ASICs located on the PCI bus are not
> dependent on their parent host bridge power plane.

I favor the idea of having a tree view of _all_ devices in the system, but
that's another story, and something I discussed in a post to the
linux-power list.

The PCI bus-child dependency and ordering should always be true, AFAIK.
Some PCI functions may have another source of power, but should only be
to support the generation of wake events when the device is in D3[cold] -
it must maintain some of its capability state.

> Right. And we need a equivalent power down function. For example, some
> drivers may improve power management by powering down the device when
> it's /dev node is not opened (or when the device have been idle for some
> time). However, those power up/down functions have to be arch-dependant,
> you can't rely on the PCI power management to be the only PM scheme.

Possibly a better term is bus-dependent?

> >2) AFAICT, it is safe to turn off a PCI device's bus-mastering bit and
> >take the device to D3, if it exports the PCI PM capability. My
> >previously-submitted pci_disable_function function turns off the
> >bus-mastering bit, and should probably take the device to D3 too.
>
> No, D3 is not safe on all devices. However, if pci_disable_function() is
> under driver control, then the driver may decide not to call it. In some
> case, D2 is the only acceptable mode. In other cases, the device doesn't
> support PM but the motherboard has ways to shut the clock down or the
> power supply.

How is D3 not safe on all devices? You mean to tell me that I cannot turn
my workstation off because it is not safe to cut power to some device?
Every device supports that state. When placing the system in a sleep
state, you have no choice. D0-D2 are not an option. It's the _driver_ that
has problems and must be fixed if it can't recover from D3.

> I beleive it's up to each driver to handle that. Maybe some "framework"
> for this can be provided with the generic PM nodes...

You mean a ... policy?

Yes, it is definitely needed, and should be able to be genericized for all
PM schemes and all types (buses).

> Indeed, but those are in the arch side. We definitely to have a way
> for the arch to hook deeply into the sleep process. There can be some
> weird dependencies going on on portable motherboards or embedded
> devices, like an ethernet device beeing also used to provide reset
> signals to another PCI device, etc...

It is the responsibility of the PM layer to ensure that this doesn't
happen. This is not the fault of the device or the driver, but must be
disabled.

> That's why I prefer the idea of having the PM nodes in a tree and
> a node for each PCI device. The arch would then "hook" on the
> pci_register_pm_node() or whatever we call it and have the ability
> to move the node elsewhere in the tree depending on motherboard
> details.

I don't understand why you would want to change the parent of a device. A
device will always sit behind a bridge, logically if nothing else. It
should adhere to the semantics to the bus on which it resides. This could
just be fanciful idealism, but damn it, it makes sense.

Though, I can see the need for a driver to have multiple nodes in the
device tree. If it were a PCI card, it would have one that was a child of
the root PCI bus. But it could also implement some logical ACPI object,
such as a wake-enabled device, in which case another node would be a child
of the root bus. Maybe.

> So busses like USB, FireWire, etc... need a similar "tree" architecture. I
> strongly beleive generalizing the PM node is the way to go. Beeing
> a "notifier" like mecanism, it allows to add specific messages if needed
> (for example, USB bus suspend is different than machine sleep, that could
> be an additional PM message sent by the host controller to USB drivers,
> etc...)

What about considering just the USB root host or Firewire equivalent as
nodes in the tree. When they are put to sleep, they handle walking the
device scheme that lies behind them, much in the same manner that PCI does
it now. This way, a bus-specific implementation could be achieved,
depending on what is needed.

There are a couple of things that I wanted to respond to. First, it is
evident that a PM scheme must be implemented for the bridges. They support
various power states, as well as have state that must be preserved across
suspend.

A tree view of the all the devices in the system is needed to support
proper ordering when suspending and resuming. At the moment, it's not
necessary to modify anything to obtain, at least for PCI. PCI handles
walking its own device tree, which is not a bad model for the rest of
the buses present on the system.

But, I also can see a benefit in a two-stage approach, where a call is
made to save the state of each device, then another is called to put the
device to sleep. In this case, a complete tree view almost seems
necessary. Or at least like we would only have to implement the interface
once, instead of n times.

-pat

p.s. Every device supports D3. It must. The drivers must be fixed to do so
as well. It's absolutely necessary in order to support system sleep
states.

2001-04-20 12:35:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PCI power management

>All devices should handle having power removed from them. And, all of the
>drivers should as well, since that is the only way we're going to get
>power management out of legacy devices and other things on the board. This
>involves saving the current context on suspend, and reinitializing the
>device, and restoring the context as much as possible when we resume. It
>should behave almost identically to the boot-time init code.

Right. In fact, at the driver level, the power management involve
2 different things:

- Handling context save & restore of the device state

- Blocking of "user" (I mean user of the driver, that can be
a kernel servicer) requests properly. In some case, this later
thing can be done by returning errors provided that upper level
drivers are read to handle them. For example, the IDE layer should
probably just block the IO queues while the IDE susbsytem is powered
off (not talking about disk sleep, but complete power off of the
controller), while an USB host controller should probably return
errors to URBs sent by drivers to a sleeping controller since those
upper-level drivers should have been put to sleep before the host
controller.
That part is almost completely overlooked right now.

>> - Some devices just can't be brought back to life from D3 state without
>> a PCI reset (ATI Rage M3 for example) and that require some arch specific
>> support (when it's possible at all).
>
>When a device comes out of D3[hot], the equivalent of a soft reset is
>performed. From D3[cold], PCI RST# is asserted, and the device must be
>completely reinitialized.

Some devices (bad bad HW designers ;) just can't do it themselves. The
Rage M3 requires the host to assert PCI RST#, and some motherboards
provide no documented facility for that (it might be possible with Apple
ASICs for example, it's just not documented).

Also, still in the case of the Rage M3, we just can't bring it out of
D3 for the same reason we can't bring the r128 in the AGP slot of a
Cube Mac out of PowerOff : The complete init sequence of those chips
is dependent on the chip revision, requires some informations about
undocumented registers that we don't have (at least that's my understanding
from talks with ATI) and so can basically only be done by a BIOS (or
OpenFirmware driver in my case), and we can't run that on wakeup (OF
is dead on macs once the kernel takes over). So we have to limit
ourselves to D2 mode on machines that don't remove power from the
slots (powerbooks, ibooks & imacs) and we can't do deep sleep at all
on machines that remove power from the slot (Cube, G4s, ...), at least
until we figure out the proper init sequence for those cards.

So the point here, as far as the kernel is concerned, is that drivers
should have a way to let the kenrel know the min/max power state they
support.

>It's not about what the device supports, it's about what the driver
>supports. STR and STD imply that all devices will lose power. The drivers
>are responsible for reinitializing the devices, regardless of what that
>may involve.

Right. I'm typing too fast, but that's what I meant.

>Hmm. How about doing two walks of the device tree - the first calls a
>save_state() function for each device, which gives it the opportunity to
>allocate memory and save appropriate registers, etc. The second actually
>places the device in a low power state.
>
>This could give the kernel the chance to disable swap, or for the action
>to be cancelled before anything is actually put to sleep.

Yup. That's approximately what I do with the PPC-specific
"sleep notifiers" we are using. The only difference is that the real
save state is done on the "sleep now" (latest) request, not on the
"sleep request" (earlier) request.

The basic idea here is that the first pass will do all of the memory
allocation (or whatever requires all system resources to be available,
that can be sending a special power management message to the device,
like enabling the remote wakup on USB, etc...). So this first pass
requires system services (all other drivers if you prefer, especially
the swap device) to be fully alive.

The second pass will do the actual IO blocking, state save, and eventually
enter device suspend mode for cases where it's controlled by the driver.

>> - On SMP, we need some way to stop other CPUs in the scheduler
>> while running the last round of sleep (putting devices to sleep) at least
>> until all IO layers in Linux can properly handle blocking of IO queues
>> while the device sleeps.
>
>Ugh. SMP. Not yet.

Well, if all drivers properly handle blocking of IOs, the SMP issue will
be easy to handle. Having the other CPUs run is not a problem as long as
any IO triggered by processes on theose are properly blocked by sleeping
drivers. All is needed is a cross-CPU function call to force the other
CPU into an idle loop (or a idle/sleep loop on PPC) on the very last
step of entering suspend mode.

>> - We need a generic (non-x86 APM or ACPI dependant) way of including
>> userland process that request it in the loop. Some userland process
>> that bang hardware directly (X, but not only X) need to be properly
>> suspended (and the kernel has to wait for ack from them before continuing
>> with devices sleep).
>
>Hmm. Like init?

Maybe. I have to study what init does a bit more closely. What I had
in mind was a kind of ioctl that would allocate a PM node in the kernel
tied to a given file descriptor. The PM thread would call it as part of
the normal chain of PM notifiers. This notifier would then signal (or
complete the ioctl or whatver) and block the PM thread (with a timeout
eventually) until the userland process ack the state change with
another ioctl, or the fd gets closed.

>Another sleep level is not acceptable when entering a system sleep state,
>except for S2, but I've never seen a system that supports that. Power will
>be cut to all devices, and there is no getting around it. If the driver
>can't support reinitializing the device, it should return an error and the
>sleep request be cancelled.

Why ? Some boards support various power levels. On PowerBooks, I know
precisely (well, almsost...) what a given motherboard will do when
entering deep sleep. On embedded systems, you know exactly what you
are doing, and in some case, the sleep process is completely controlled
by the kernel, so you can do whatever you want.

For example, on Apple PowerBooks, iMacs and iBooks, the video chip
is put in "static" mode (unclocked). Additionally, the RageM3 in the
PowerBook and iBook can be put to D2 mode by the driver before that
(which is supposed to have the effect of properly shutting down the
LVDS transmitter).

>The PCI PM Capabilities can be read from a device's config space. The PCI
>PM Spec has register descriptions. There are also #defines for the fields
>in pci.h. So a driver can know exactly what is expected of it.

Well, that depends... some device lie in their config space. In some case,
the device _can_ do D3, but the driver can't revive it out of D3 (but can
revive it out of D2).

Again, this is a matter of arch policy and depends on what the motherboard
supports. All we should do on the driver side is advertise what state we
support getting the device from. Then, the arch specific code will do
whatever it can depending on what drivers says.

>I favor the idea of having a tree view of _all_ devices in the system, but
>that's another story, and something I discussed in a post to the
>linux-power list.

Hehe, right ;) we somewhat have the OF device tree on pmacs, but we don't
use it very much (mostly for initial retreival of interrupt routing, and
for probing of device cells inside Apple combo-ASIC).

>The PCI bus-child dependency and ordering should always be true, AFAIK.
>Some PCI functions may have another source of power, but should only be
>to support the generation of wake events when the device is in D3[cold] -
>it must maintain some of its capability state.

Well... you are optimistic about what HW engineers can invent... I think
the PM tree should, by default, be built with the same hierarchy as the
PCI tree. But the arch should have a way to re-arrange it. The simplest
way I see to do that is to have the generic PCI code "instert" the PM
nodes in the PM tree when probing devices using a function
(pci_add_pm_node() for example) that can be hooked by the arch.

>Possibly a better term is bus-dependent?

Right. That's why I prefer the notifier mecanism that allow you to
easily define additional messages while keeping an overall coherency
in the design.

>How is D3 not safe on all devices? You mean to tell me that I cannot turn
>my workstation off because it is not safe to cut power to some device?
>Every device supports that state. When placing the system in a sleep
>state, you have no choice. D0-D2 are not an option. It's the _driver_ that
>has problems and must be fixed if it can't recover from D3.

Ever heard about bogus hardware ? Some devices require an external assert
of PCI RST# to get out of D3, and some motherboard can't provide it without
a complete reboot of the machine (which is _not_ what happen when putting
a powerbook to sleep, for example).

I'm not sure also that all ethernet controllers can do wake-on-lan in D3
mode. They might (unconfirmed) be able to do it in D1 or D2.

You are right on one point: in most case, it's the driver that has problems
recovering the device. Mostly because lack of documentation. That _does_
happen.

Let's do an mecanism flexible enough for drivers to tell what they can do
and for arch to then decide what to do.

>It is the responsibility of the PM layer to ensure that this doesn't
>happen. This is not the fault of the device or the driver, but must be
>disabled.

It's the responsibility of the motherboard-specific (arch) hooks in
the PM layer to know about that, yup.

>I don't understand why you would want to change the parent of a device. A
>device will always sit behind a bridge, logically if nothing else. It
>should adhere to the semantics to the bus on which it resides. This could
>just be fanciful idealism, but damn it, it makes sense.
>
>Though, I can see the need for a driver to have multiple nodes in the
>device tree. If it were a PCI card, it would have one that was a child of
>the root PCI bus. But it could also implement some logical ACPI object,
>such as a wake-enabled device, in which case another node would be a child
>of the root bus. Maybe.

My idea is not so much about changing the parent, but changing the ordering
at the same level of the tree... That could eventually be done with a
"priority" like field in the PM node. You always have to revive the
parent bridge first of course, as you can't access the device without it.

>What about considering just the USB root host or Firewire equivalent as
>nodes in the tree. When they are put to sleep, they handle walking the
>device scheme that lies behind them, much in the same manner that PCI does
>it now. This way, a bus-specific implementation could be achieved,
>depending on what is needed.

Well, that could be done that way too. I like the idea of the generic PM
notifier interface providing the tree structure, priority value, and
notifier funciton. Then, the messages passed to the notifier function can
be bus dependant. This would allow to add the ability to broadcase some
"system wide" messages as well that might or might not be handled by
some individual devices or define new messages if a given bus supports more
than one power state.

>There are a couple of things that I wanted to respond to. First, it is
>evident that a PM scheme must be implemented for the bridges. They support
>various power states, as well as have state that must be preserved across
>suspend.

Right.

>A tree view of the all the devices in the system is needed to support
>proper ordering when suspending and resuming. At the moment, it's not
>necessary to modify anything to obtain, at least for PCI. PCI handles
>walking its own device tree, which is not a bad model for the rest of
>the buses present on the system.

Except that I beleive we need a way to handle ordering at a given level
of the tree because of possible dependencies introduced by the
motherboard. On macs, for example, the mac-io ASIC must be woken up
first as it provide clocks & reset signals to other devices and
handles the interface to the power management microcontroller.

>But, I also can see a benefit in a two-stage approach, where a call is
>made to save the state of each device, then another is called to put the
>device to sleep. In this case, a complete tree view almost seems
>necessary. Or at least like we would only have to implement the interface
>once, instead of n times.

Well, generalizing the notifier approach makes it easy to define new
messages.

>p.s. Every device supports D3. It must. The drivers must be fixed to do so
>as well. It's absolutely necessary in order to support system sleep
>states.

Why ? My PowerBook is pretty happily sleeping with it's video controller
in D2 state and clock removed...

Ben.



2001-04-20 12:41:19

by Jeff Garzik

[permalink] [raw]
Subject: Re: PCI power management

Benjamin Herrenschmidt wrote:
> >When a device comes out of D3[hot], the equivalent of a soft reset is
> >performed. From D3[cold], PCI RST# is asserted, and the device must be
> >completely reinitialized.
>
> Some devices (bad bad HW designers ;) just can't do it themselves. The
> Rage M3 requires the host to assert PCI RST#, and some motherboards
> provide no documented facility for that (it might be possible with Apple
> ASICs for example, it's just not documented).

Why should we support such a non-spec device? Tell ATI to fix their
hardware, and tell users (a) not to use the hardware, or (b) use the
hardware with the knowledge that you are screwed when it comes to Power
Management.

Unless there are more cases like this, this should not factor at all
into the modifications to the PCI and PM code...

--
Jeff Garzik | The difference between America and England is that
Building 1024 | the English think 100 miles is a long distance and
MandrakeSoft | the Americans think 100 years is a long time.
| (random fortune)

2001-04-20 12:57:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: PCI power management

>> Some devices (bad bad HW designers ;) just can't do it themselves. The
>> Rage M3 requires the host to assert PCI RST#, and some motherboards
>> provide no documented facility for that (it might be possible with Apple
>> ASICs for example, it's just not documented).
>
>Why should we support such a non-spec device? Tell ATI to fix their
>hardware, and tell users (a) not to use the hardware, or (b) use the
>hardware with the knowledge that you are screwed when it comes to Power
>Management.
>
>Unless there are more cases like this, this should not factor at all
>into the modifications to the PCI and PM code...

Well, I can tell all PowerBook and iBook users to forget about sleep...

Also, that would not be the first time we have to deal with poorly
documented hardware. I don't think we should refuse to handle any
hardware that is out of spec... it would be like saying Linux doesn't
support any x86 with a broken BIOS...

It's not so complicated to have the minimum flexibility for the driver
to tell it's maximum supported power level, and I don't see why it would
be a problem to use D2 instead of D3 when we don't support D3 for a given
device (either because the HW is broken, undocumented, or because our
driver just don't know how to bring back the chip to life).

If the motherboard _requires_ it (because it will cut power from the chip),
the we can refuse to enter sleep when one driver can't do it (instead of
letting the user crash the box badly).

In any case, I beleive you are focusing on a point of detail. All
I'm asking for (in this specific case) is a simple mask of flags set
by the driver to tell what it can handle. It's also useful for
devices that don't support PM on machines whose motherboard provide
facility to turn OFF power on selected cards. It would allow us to
turn off cards for drivers that can handle recovering.

Also, I don't think the problem of powering back up the chip and
re-initing it from scratch is specific to those ATI chips. Look at
XFree, it has to run a BIOS emulator to soft boot video chips. On
PCs, I beleive you have the BIOS that re-init them when waking up
from an APM or ACPI suspend. On non-PCs when suspend is not handled
by the firmware but directly by the kernel, that's not the case.

Ben.



2001-04-21 09:09:44

by Russell King

[permalink] [raw]
Subject: Re: PCI power management

On Fri, Apr 20, 2001 at 02:56:15PM +0200, Benjamin Herrenschmidt wrote:
> It's not so complicated to have the minimum flexibility for the driver
> to tell it's maximum supported power level, and I don't see why it would
> be a problem to use D2 instead of D3 when we don't support D3 for a given
> device (either because the HW is broken, undocumented, or because our
> driver just don't know how to bring back the chip to life).

Umm, isn't it true that most VGA cards will have this problem? Are we
going to put an x86 emulator into the kernel so we can run the BIOS on
non-x86 hardware, just so that we can re-initialise the chip? ;|

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html