2001-10-18 00:07:53

by Patrick Mochel

[permalink] [raw]
Subject: [RFC] New Driver Model for 2.5


One July afternoon, while hacking on the pm_dev layer for the purpose of
system-wide power management support, I decided that I was quite tired of
trying to make this layer look like a tree and feel like a tree, but not
have any real integration with the actual device drivers..

I had read the accounts of what the goals were for 2.5. And, after some
conversations with Linus and the (gasp) ACPI guys, I realized that I had a
good chunk of the infrastructural code written; it was a matter of working
out a few crucial details and massaging it in nicely.

I have had the chance this week (after moving and vacationing) to update
the (read: write some) documentation for it. I will not go into details,
and will let the document speak for itself.

With all luck, this should go into the early stages of 2.5, and allow a
significant cleanup of many drivers. Such a model will also allow for neat
tricks like full device power management support, and Plug N Play
capabilities.


In order to support the new driver model, I have written a small in-memory
filesystem, called ddfs, to export a unified interface to userland. It is
mentioned in the doc, and is pretty self-explanatory. More information
will be available soon.


There is code available for the model and ddfs at:

http://kernel.org/pub/linux/kernel/people/mochel/device/

but there are some fairly large caveats concerning it.

First, I feel comfortable with the device layer code and the ddfs
code. Though, the PCI code is still work in progress. I am still working
out some of the finer details concerning it.

Next is the environment under which I developed it all. It was on an ia32
box, with only PCI support, and using ACPI. The latter didn't have too
much of an effect on the development, but there are a few items explicitly
inspired by it..

I am hoping both the PCI code, and the structure and in general can be
further improved based on the input of the driver maintainers.


This model is not final, and may be way off from what most people actually
want. It has gotten tentative blessing from all those that have seen it,
though they number but a few. It's definitely not the only solution...

That said, enjoy; and have at it.

-pat


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The (New) Linux Kernel Driver Model

Version 0.01

17 October 2001


Overview
~~~~~~~~

This driver model is a unification of all the current, disparate driver models
that are currently in the kernel. It is intended is to augment the
bus-specific drivers for bridges and devices by consolidating a set of data
and operations into globally accessible data structures.

Current driver models implement some sort of tree-like structure (sometimes
just a list) for the devices they control. But, there is no linkage between
the different bus types.

A common data structure can provide this linkage with little overhead: when a
bus driver discovers a particular device, it can insert it into the global
tree as well as its local tree. In fact, the local tree becomes just a subset
of the global tree.

Common data fields can also be moved out of the local bus models into the
global model. Some of the manipulation of these fields can also be
consolidated. Most likely, manipulation functions will become a set
of helper functions, which the bus drivers wrap around to include any
bus-specific items.

The common device and bridge interface currently reflects the goals of the
modern PC: namely the ability to do seamless Plug and Play, power management,
and hot plug. (The model dictated by Intel and Microsoft (read: ACPI) ensures
us that any device in the system may fit any of these criteria.)

In reality, not every bus will be able to support such operations. But, most
buses will support a majority of those operations, and all future buses will.
In other words, a bus that doesn't support an operation is the exception,
instead of the other way around.


Drivers
~~~~~~~

The callbacks for bridges and devices are intended to be singular for a
particular type of bus. For each type of bus that has support compiled in the
kernel, there should be one statically allocated structure with the
appropriate callbacks that each device (or bridge) of that type share.

Each bus layer should implement the callbacks for these drivers. It then
forwards the calls on to the device-specific callbacks. This means that
device-specific drivers must still implement callbacks for each operation.
But, they are not called from the top level driver layer.

This does add another layer of indirection for calling one of these functions,
but there are benefits that are believed to outweigh this slowdown.

First, it prevents device-specific drivers from having to know about the
global device layer. This speeds up integration time incredibly. It also
allows drivers to be more portable across kernel versions. Note that the
former was intentional, the latter is an added bonus.

Second, this added indirection allows the bus to perform any additional logic
necessary for its child devices. A bus layer may add additional information to
the call, or translate it into something meaningful for its children.

This could be done in the driver, but if it happens for every object of a
particular type, it is best done at a higher level.

Recap
~~~~~

Instances of devices and bridges are allocated dynamically as the system
discovers their existence. Their fields describe the individual object.
Drivers - in the global sense - are statically allocated and singular for a
particular type of bus. They describe a set of operations that every type of
bus could implement, the implementation following the bus's semantics.


Downstream Access
~~~~~~~~~~~~~~~~~

Common data fields have been moved out of individual bus layers into a common
data structure. But, these fields must still be accessed by the bus layers,
and
sometimes by the device-specific drivers.

Other bus layers are encouraged to do what has been done for the PCI layer.
struct pci_dev now looks like this:

struct pci_dev {
...

struct device device;
};

Note first that it is statically allocated. This means only one allocation on
device discovery. Note also that it is at the _end_ of struct pci_dev. This is
to make people think about what they're doing when switching between the bus
driver and the global driver; and to prevent against mindless casts between
the two.

The PCI bus layer freely accesses the fields of struct device. It knows about
the structure of struct pci_dev, and it should know the structure of struct
device. PCI devices that have been converted generally do not touch the fields
of struct device. More precisely, device-specific drivers should not touch
fields of struct device unless there is a strong compelling reason to do so.

This abstraction is prevention of unnecessary pain during transitional phases.
If the name of the field changes or is removed, then every downstream driver
will break. On the other hand, if only the bus layer (and not the device
layer) accesses struct device, it is only those that need to change.


User Interface
~~~~~~~~~~~~~~

By virtue of having a complete hierarchical view of all the devices in the
system, exporting a complete hierarchical view to userspace becomes relatively
easy. Whenever a device is inserted into the tree, a file or directory can be
created for it.

In this model, a directory is created for each bridge and each device. When it
is created, it is populated with a set of default files, first at the global
layer, then at the bus layer. The device layer may then add its own files.

These files export data about the driver and can be used to modify behavior of
the driver or even device.

For example, at the global layer, a file named 'status' is created for each
device. When read, it reports to the user the name of the device, its bus ID,
its current power state, and the name of the driver its using.

By writing to this file, you can have control over the device. By writing
"suspend 3" to this file, one could place the device into power state "3".
Basically, by writing to this file, the user has access to the operations
defined in struct device_driver.

The PCI layer also adds default files. For devices, it adds a "resource" file
and a "wake" file. The former reports the BAR information for the device; the
latter reports the wake capabilities of the device.

The device layer could also add files for device-specific data reporting and
control.

The dentry to the device's directory is kept in struct device. It also keeps a
linked list of all the files in the directory, with pointers to their read and
write callbacks. This allows the driver layer to maintain full control of its
destiny. If it desired to override the default behavior of a file, or simply
remove it, it could easily do so. (It is assumed that the files added upstream
will always be a known quantity.)

These features were initially implemented using procfs. However, after one
conversation with Linus, a new filesystem - ddfs - was created to implement
these features. It is an in-memory filesystem, based heavily off of ramfs,
though it uses procfs as inspiration for its callback functionality.


Device Structures
~~~~~~~~~~~~~~~~~

struct device {
struct list_head bus_list;
struct io_bus *parent;
struct io_bus *subordinate;

char name[DEVICE_NAME_SIZE];
char bus_id[BUS_ID_SIZE];

struct dentry *dentry;
struct list_head files;

struct semaphore lock;

struct device_driver *driver;
void *driver_data;
void *platform_data;

u32 current_state;
unsigned char *saved_state;
};

bus_list:
List of all devices on a particular bus; i.e. the device's siblings

parent:
The parent bridge for the device.

subordinate:
If the device is a bridge itself, this points to the struct io_bus that is
created for it.

name:
Human readable (descriptive) name of device. E.g. "Intel EEPro 100"

bus_id:
Parsable (yet ASCII) bus id. E.g. "00:04.00" (PCI Bus 0, Device 4, Function
0). It is necessary to have a searchable bus id for each device; making it
ASCII allows us to use it for its directory name without translating it.

dentry:
Pointer to driver's ddfs directory.

files:
Linked list of all the files that a driver has in its ddfs directory.

lock:
Driver specific lock.

driver:
Pointer to a struct device_driver, the common operations for each device. See
next section.

driver_data:
Private data for the driver.
Much like the PCI implementation of this field, this allows device-specific
drivers to keep a pointer to a device-specific data.

platform_data:
Data that the platform (firmware) provides about the device.
For example, the ACPI BIOS or EFI may have additional information about the
device that is not directly mappable to any existing kernel data structure.
It also allows the platform driver (e.g. ACPI) to a driver without the driver
having to have explicit knowledge of (atrocities like) ACPI.


current_state:
Current power state of the device. For PCI and other modern devices, this is
0-3, though it's not necessarily limited to those values.

saved_state:
Pointer to driver-specific set of saved state.
Having it here allows modules to be unloaded on system suspend and reloaded
on resume and maintain state across transitions.
It also allows generic drivers to maintain state across system state
transitions.
(I've implemented a generic PCI driver for devices that don't have a
device-specific driver. Instead of managing some vector of saved state
for each device the generic driver supports, it can simply store it here.)



struct device_driver {
int (*probe) (struct device *dev);
int (*remove) (struct device *dev);

int (*init) (struct device *dev);
int (*shutdown) (struct device *dev);

int (*save_state) (struct device *dev, u32 state);
int (*restore_state)(struct device *dev);

int (*suspend) (struct device *dev, u32 state);
int (*resume) (struct device *dev);
}


probe:
Check for device existence and associate driver with it.

remove:
Dissociate driver with device. Releases device so that it could be used by
another driver. Also, if it is a hotplug device (hotplug PCI, Cardbus), an
ejection event could take place here.

init:
Initialise the device - allocate resources, irqs, etc.

shutdown:
"De-initialise" the device - release resources, free memory, etc.

save_state:
Save current device state before entering suspend state.

restore_state:
Restore device state, after coming back from suspend state.

suspend:
Physically enter suspend state.

resume:
Physically leave suspend state and re-initialise hardware.


Initially, the probe/remove sequence followed the PCI semantics exactly, but
have since been broken up into a four-stage process: probe(), remove(),
init(), and shutdown().

While it's not entirely necessary in all environments, breaking them up so
each routine does only one thing makes sense.

Hot-pluggable devices may also benefit from this model, especially ones that
can be subjected to suprise removals - only the remove function would be
called, and the driver could easily know if the there was still hardware there
to shutdown.

Drivers that are controlling failing, or buggy, hardware, by allowing the user
to trigger a removal of the driver from userspace, without trying to shutdown
down the device.

In each case that remove() is called without a shutdown(), it's important to
note that resources will still need to be freed; it's only the hardware that
cannot be assumed to be present.


Suspend/resume transitions are broken into four stages as well to provide
graceful recovery from a failed suspend attempt; and to ensure that state gets
stored in a non-volatile location before the system (and its devices) are
suspended.

When a suspend transition is triggered, the device tree is walked first to
save the state of all the devices in the system. Once this is complete, the
saved state, now residing in memory, can be written to some non-volatile
location, like a disk partition or network location.

The device tree is then walked again to suspend all of the devices. This
guarantees that the device controlling the location to write the state is
still powered on while you have a snapshot of the system state.

If a device is in a critical I/O transaction, or for some other reason cannot
stand to be suspended, it notify the kernel by failing in the save state
step. At this point, state can either be restored, or dropped, for all the
devices that had been already been touched, and execution may resume. No
devices will have been powered off at this point, making it much easier to
recover.

The resume transition is broken up into two steps mainly to stress the
singularity of each step: resume() powers on the device and reinitialises it;
restore_state() restores the device and bus-specific registers of the device.
resume() will happen with interrupts disabled; restore_state() with them
enabled.


Bus Structures
~~~~~~~~~~~~~~

struct io_bus {
struct list_head node;
struct io_bus *parent;
struct list_head children;
struct list_head devices;

struct list_head bus_list;

struct device *self;
struct dentry *dentry;
struct list_head files;

char name[DEVICE_NAME_SIZE];
char bus_id[BUS_ID_SIZE];

struct bus_driver *driver;
};

node:
Bus's node in sibling list (its parent's list of child buses).

parent:
Pointer to parent bridge.

children:
List of subordinate buses.
In the children, this correlates to their 'node' field.

devices:
List of devices on the bus this bridge controls.
This field corresponds to the 'bus_list' field in each child device.

bus_list:
Each type of bus keeps a list of all bridges that it finds. This is the
bridges entry in that list.

self:
Pointer to the struct device for this bridge.

dentry:
Every bus also gets a ddfs directory for which to add files to, as well as
child device directories. Actually, every bridge will have two directories -
one for the bridge device, and one for the subordinate device.

files:
Each bus also gets a list of the files that are in the ddfs directory, for
the same reasons as the devices - to have explicit control over the behavior
and easy access to each file that any higher layers may have added.

name:
Human readable ASCII name of bus.

bus_id:
Machine readable (though ASCII) description of position on parent bus.

driver:
Pointer to operations for bus.


struct bus_driver {
char name[16];
struct list_head node;
int (*scan) (struct io_bus*);
int (*rescan) (struct io_bus*);
int (*add_device) (struct io_bus*, char*);
int (*remove_device)(struct io_bus*, struct device*);
int (*add_bus) (struct io_bus*, char*);
int (*remove_bus) (struct io_bus*, struct io_bus*);
};

name:
ASCII name of bus.

node:
List of buses of this type in system.

scan:
Search the bus for devices. This is meant to be done only once - when the
bridge is initially discovered.

rescan:
Search the bus again and look for changes. I.e. check for device insertion or
removal.

add_device:
Trigger a device insertion at a particular location.

remove_device:
Trigger the removal of a particular device.

add_bus:
Trigger insertion of a new bridge device (and child bus) at a particular
location on the bus.

remove_bus:
Remove a particular bridge and subordinate bus.







2001-10-18 06:23:06

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Patrick Mochel wrote:
>
> One July afternoon, while hacking on the pm_dev layer for the purpose of
> system-wide power management support, I decided that I was quite tired of
> trying to make this layer look like a tree and feel like a tree, but not
> have any real integration with the actual device drivers..
>
> I had read the accounts of what the goals were for 2.5. And, after some
> conversations with Linus and the (gasp) ACPI guys, I realized that I had a
> good chunk of the infrastructural code written; it was a matter of working
> out a few crucial details and massaging it in nicely.
>
> I have had the chance this week (after moving and vacationing) to update
> the (read: write some) documentation for it. I will not go into details,
> and will let the document speak for itself.
>
> With all luck, this should go into the early stages of 2.5, and allow a
> significant cleanup of many drivers. Such a model will also allow for neat
> tricks like full device power management support, and Plug N Play
> capabilities.
>
> In order to support the new driver model, I have written a small in-memory
> filesystem, called ddfs, to export a unified interface to userland. It is
> mentioned in the doc, and is pretty self-explanatory. More information
> will be available soon.
>
> There is code available for the model and ddfs at:
>
> http://kernel.org/pub/linux/kernel/people/mochel/device/
>
> but there are some fairly large caveats concerning it.
>
> First, I feel comfortable with the device layer code and the ddfs
> code. Though, the PCI code is still work in progress. I am still working
> out some of the finer details concerning it.
>
> Next is the environment under which I developed it all. It was on an ia32
> box, with only PCI support, and using ACPI. The latter didn't have too
> much of an effect on the development, but there are a few items explicitly
> inspired by it..
>
> I am hoping both the PCI code, and the structure and in general can be
> further improved based on the input of the driver maintainers.
>
> This model is not final, and may be way off from what most people actually
> want. It has gotten tentative blessing from all those that have seen it,
> though they number but a few. It's definitely not the only solution...
>
> That said, enjoy; and have at it.
>
> -pat
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The (New) Linux Kernel Driver Model
>
> Version 0.01
>
> 17 October 2001
>
> Overview
> ~~~~~~~~
>
> This driver model is a unification of all the current, disparate driver models
> that are currently in the kernel. It is intended is to augment the
> bus-specific drivers for bridges and devices by consolidating a set of data
> and operations into globally accessible data structures.
>
> Current driver models implement some sort of tree-like structure (sometimes
> just a list) for the devices they control. But, there is no linkage between
> the different bus types.
>
> A common data structure can provide this linkage with little overhead: when a
> bus driver discovers a particular device, it can insert it into the global
> tree as well as its local tree. In fact, the local tree becomes just a subset
> of the global tree.
>
> Common data fields can also be moved out of the local bus models into the
> global model. Some of the manipulation of these fields can also be
> consolidated. Most likely, manipulation functions will become a set
> of helper functions, which the bus drivers wrap around to include any
> bus-specific items.
>
> The common device and bridge interface currently reflects the goals of the
> modern PC: namely the ability to do seamless Plug and Play, power management,
> and hot plug. (The model dictated by Intel and Microsoft (read: ACPI) ensures
> us that any device in the system may fit any of these criteria.)
>
> In reality, not every bus will be able to support such operations. But, most
> buses will support a majority of those operations, and all future buses will.
> In other words, a bus that doesn't support an operation is the exception,
> instead of the other way around.
>
> Drivers
> ~~~~~~~
>
> The callbacks for bridges and devices are intended to be singular for a
> particular type of bus. For each type of bus that has support compiled in the
> kernel, there should be one statically allocated structure with the
> appropriate callbacks that each device (or bridge) of that type share.
>
> Each bus layer should implement the callbacks for these drivers. It then
> forwards the calls on to the device-specific callbacks. This means that
> device-specific drivers must still implement callbacks for each operation.
> But, they are not called from the top level driver layer.
>
> This does add another layer of indirection for calling one of these functions,
> but there are benefits that are believed to outweigh this slowdown.
>
> First, it prevents device-specific drivers from having to know about the
> global device layer. This speeds up integration time incredibly. It also
> allows drivers to be more portable across kernel versions. Note that the
> former was intentional, the latter is an added bonus.
>
> Second, this added indirection allows the bus to perform any additional logic
> necessary for its child devices. A bus layer may add additional information to
> the call, or translate it into something meaningful for its children.
>
> This could be done in the driver, but if it happens for every object of a
> particular type, it is best done at a higher level.
>
> Recap
> ~~~~~
>
> Instances of devices and bridges are allocated dynamically as the system
> discovers their existence. Their fields describe the individual object.
> Drivers - in the global sense - are statically allocated and singular for a
> particular type of bus. They describe a set of operations that every type of
> bus could implement, the implementation following the bus's semantics.
>
> Downstream Access
> ~~~~~~~~~~~~~~~~~
>
> Common data fields have been moved out of individual bus layers into a common
> data structure. But, these fields must still be accessed by the bus layers,
> and
> sometimes by the device-specific drivers.
>
> Other bus layers are encouraged to do what has been done for the PCI layer.
> struct pci_dev now looks like this:
>
> struct pci_dev {
> ...
>
> struct device device;
> };
>
> Note first that it is statically allocated. This means only one allocation on
> device discovery. Note also that it is at the _end_ of struct pci_dev. This is
> to make people think about what they're doing when switching between the bus
> driver and the global driver; and to prevent against mindless casts between
> the two.
>
> The PCI bus layer freely accesses the fields of struct device. It knows about
> the structure of struct pci_dev, and it should know the structure of struct
> device. PCI devices that have been converted generally do not touch the fields
> of struct device. More precisely, device-specific drivers should not touch
> fields of struct device unless there is a strong compelling reason to do so.
>
> This abstraction is prevention of unnecessary pain during transitional phases.
> If the name of the field changes or is removed, then every downstream driver
> will break. On the other hand, if only the bus layer (and not the device
> layer) accesses struct device, it is only those that need to change.
>
> User Interface
> ~~~~~~~~~~~~~~
>
> By virtue of having a complete hierarchical view of all the devices in the
> system, exporting a complete hierarchical view to userspace becomes relatively
> easy. Whenever a device is inserted into the tree, a file or directory can be
> created for it.
>
> In this model, a directory is created for each bridge and each device. When it
> is created, it is populated with a set of default files, first at the global
> layer, then at the bus layer. The device layer may then add its own files.
>
> These files export data about the driver and can be used to modify behavior of
> the driver or even device.
>
> For example, at the global layer, a file named 'status' is created for each
> device. When read, it reports to the user the name of the device, its bus ID,
> its current power state, and the name of the driver its using.
>
> By writing to this file, you can have control over the device. By writing
> "suspend 3" to this file, one could place the device into power state "3".
> Basically, by writing to this file, the user has access to the operations
> defined in struct device_driver.
>
> The PCI layer also adds default files. For devices, it adds a "resource" file
> and a "wake" file. The former reports the BAR information for the device; the
> latter reports the wake capabilities of the device.
>
> The device layer could also add files for device-specific data reporting and
> control.
>
> The dentry to the device's directory is kept in struct device. It also keeps a
> linked list of all the files in the directory, with pointers to their read and
> write callbacks. This allows the driver layer to maintain full control of its
> destiny. If it desired to override the default behavior of a file, or simply
> remove it, it could easily do so. (It is assumed that the files added upstream
> will always be a known quantity.)
>
> These features were initially implemented using procfs. However, after one
> conversation with Linus, a new filesystem - ddfs - was created to implement
> these features. It is an in-memory filesystem, based heavily off of ramfs,
> though it uses procfs as inspiration for its callback functionality.
>
> Device Structures
> ~~~~~~~~~~~~~~~~~
>
> struct device {
> struct list_head bus_list;
> struct io_bus *parent;
> struct io_bus *subordinate;
>
> char name[DEVICE_NAME_SIZE];
> char bus_id[BUS_ID_SIZE];
>
> struct dentry *dentry;
> struct list_head files;
>
> struct semaphore lock;
>
> struct device_driver *driver;
> void *driver_data;
> void *platform_data;
>
> u32 current_state;
> unsigned char *saved_state;
> };
>
> bus_list:
> List of all devices on a particular bus; i.e. the device's siblings
>
> parent:
> The parent bridge for the device.
>
> subordinate:
> If the device is a bridge itself, this points to the struct io_bus that is
> created for it.
>
> name:
> Human readable (descriptive) name of device. E.g. "Intel EEPro 100"
>
> bus_id:
> Parsable (yet ASCII) bus id. E.g. "00:04.00" (PCI Bus 0, Device 4, Function
> 0). It is necessary to have a searchable bus id for each device; making it
> ASCII allows us to use it for its directory name without translating it.
>
> dentry:
> Pointer to driver's ddfs directory.
>
> files:
> Linked list of all the files that a driver has in its ddfs directory.
>
> lock:
> Driver specific lock.
>
> driver:
> Pointer to a struct device_driver, the common operations for each device. See
> next section.
>
> driver_data:
> Private data for the driver.
> Much like the PCI implementation of this field, this allows device-specific
> drivers to keep a pointer to a device-specific data.
>
> platform_data:
> Data that the platform (firmware) provides about the device.
> For example, the ACPI BIOS or EFI may have additional information about the
> device that is not directly mappable to any existing kernel data structure.
> It also allows the platform driver (e.g. ACPI) to a driver without the driver
> having to have explicit knowledge of (atrocities like) ACPI.
>
> current_state:
> Current power state of the device. For PCI and other modern devices, this is
> 0-3, though it's not necessarily limited to those values.
>
> saved_state:
> Pointer to driver-specific set of saved state.
> Having it here allows modules to be unloaded on system suspend and reloaded
> on resume and maintain state across transitions.
> It also allows generic drivers to maintain state across system state
> transitions.
> (I've implemented a generic PCI driver for devices that don't have a
> device-specific driver. Instead of managing some vector of saved state
> for each device the generic driver supports, it can simply store it here.)
>
> struct device_driver {
> int (*probe) (struct device *dev);
> int (*remove) (struct device *dev);
>
> int (*init) (struct device *dev);
> int (*shutdown) (struct device *dev);
>
> int (*save_state) (struct device *dev, u32 state);
> int (*restore_state)(struct device *dev);
>
> int (*suspend) (struct device *dev, u32 state);
> int (*resume) (struct device *dev);
> }
>
> probe:
> Check for device existence and associate driver with it.
>
> remove:
> Dissociate driver with device. Releases device so that it could be used by
> another driver. Also, if it is a hotplug device (hotplug PCI, Cardbus), an
> ejection event could take place here.
>
> init:
> Initialise the device - allocate resources, irqs, etc.
>
> shutdown:
> "De-initialise" the device - release resources, free memory, etc.
>
> save_state:
> Save current device state before entering suspend state.
>
> restore_state:
> Restore device state, after coming back from suspend state.
>
> suspend:
> Physically enter suspend state.
>
> resume:
> Physically leave suspend state and re-initialise hardware.
>
> Initially, the probe/remove sequence followed the PCI semantics exactly, but
> have since been broken up into a four-stage process: probe(), remove(),
> init(), and shutdown().
>
> While it's not entirely necessary in all environments, breaking them up so
> each routine does only one thing makes sense.
>
> Hot-pluggable devices may also benefit from this model, especially ones that
> can be subjected to suprise removals - only the remove function would be
> called, and the driver could easily know if the there was still hardware there
> to shutdown.
>
> Drivers that are controlling failing, or buggy, hardware, by allowing the user
> to trigger a removal of the driver from userspace, without trying to shutdown
> down the device.
>
> In each case that remove() is called without a shutdown(), it's important to
> note that resources will still need to be freed; it's only the hardware that
> cannot be assumed to be present.

So, remove() might be called without a shutdown(), and then asked to
perform the duties normally performed by shutdown()? That sounds like
API dain bramage. :)

Your proposal sounds ok, my one objection is separating probe/remove
further into init/shutdown. Can you give real-life cases where this
will be useful? I don't see it causing much except headache.

The preferred way of doing things (IMHO) is to do some simply sanity
checking of the h/w device at probe time, and then perform lots of
initialization and such at device/interface open time. You ideally want
a device driver lifecycle to look like

probe:
register interface
sanity check h/w to make sure it's there and alive
stop DMA/interrupts/etc., just in case
start timer to powerdown h/w in N seconds

dev_open:
wake up device, if necessary
init device

dev_close:
stop DMA/interrupts/etc.
start timer to powerdown h/w in N seconds

With that in mind, init -really- happens at device open, and in
additional is driven more through normal user interaction via standard
APIs, than the PCI and PM subsystems.

--
Jeff Garzik | "Mind if I drive?" -Sam
Building 1024 | "Not if you don't mind me clawing at the dash
MandrakeSoft | and shrieking like a cheerleader." -Max

2001-10-18 12:13:15

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>>
>> struct device {
>> struct list_head bus_list;
>> struct io_bus *parent;
>> struct io_bus *subordinate;
>>
>> char name[DEVICE_NAME_SIZE];
>> char bus_id[BUS_ID_SIZE];
>>
>> struct dentry *dentry;
>> struct list_head files;
>>
>> struct semaphore lock;
>>
>> struct device_driver *driver;
>> void *driver_data;
>> void *platform_data;
>>
>> u32 current_state;
>> unsigned char *saved_state;
>> };

Hi Patrick ! Nice to see this happening ;)

I would add to the generic structure device, a "uuid" string field.
This field would contain a "munged" unique identifier composed of
the bus type followed which whatever bus-specific unique ID is
provided by the driver. If the driver don't provide one, it defaults
to a copy of the busID.

What I have in mind here is to have a common place to look for the
best possible unique identification for a device. Typical example are
ieee1394 hard disks which do have a unique ID, and so can be properly
tracked between insertion removal.

Also, I'd like to see a simple ability for the arch code to add
entries to the exposed device filesystem nodes. The main reason for
this is that on machines like PPC with OpenFirmware or Sun with
OpenBoot, it makes a lot of sense (and is very useful for bootloader
configuration among others) to be able to know the firmware "path"
corresponding to a given device. On PPC, the generic PCI code can
do the convertion between an Open Firmware device node an a PCI
device in the kernel, but doing so from userland is a lot more
tricky. The device filesystem is a very good way to fix that problem
once for all.

>The preferred way of doing things (IMHO) is to do some simply sanity
>checking of the h/w device at probe time, and then perform lots of
>initialization and such at device/interface open time. You ideally want
>a device driver lifecycle to look like
>
>probe:
> register interface
> sanity check h/w to make sure it's there and alive
> stop DMA/interrupts/etc., just in case
> start timer to powerdown h/w in N seconds
>
>dev_open:
> wake up device, if necessary
> init device
>
>dev_close:
> stop DMA/interrupts/etc.

I completely agree there as well. In some case, the suspend (or powerdown)
of the device can even be triggered on an open device with an idle timer.
Good candidates are hard disks and sound (which is often kept open
all the time by the userland mixer).

However, there is another important point about power management I
discovered the "hard way", which is memory allocation vs. turning
off of swapping devices (that is the swap device itself or any device
on which you may have mmap'ed files).

For "transcient" power management (that is dynamically putting a
subsystem to sleep when idle until it gets a new request), there
is no real problem provided that the driver can do the wakeup without
allocating memory.

For system power sleep, where you actually shut down everything,
the problem happens when you start shutting down those "swap" devices.
Once done, you may be in a situation where another device, to be shut
down or to wake up properly, need to allocate memory (see for example
USB devices that need to allocate urb's). This may cause requests
to swap_out which will block indefinitely if trying to swap out
pages to an already sleeping device.

I "work around" this in the PowerBook sleep code in a bit dumb way
which work in 99% of the case but is probably broken as well if you
are really near oom. Basically, instead of calling only the "suspend"
callbacks of devices, I have an additional "suspend requested"
one that is sent to every driver using my specific PM scheme _before_
starting the real round of "suspend" callbacks. Drivers that need
a significant amount of backup memory (like some framebuffers) will
do the necessary allocations from this early callback.

Another issue with suspend and resume is with interrupt sharing and
some bad devices that unconditionally assert their interrupt line
when put to any PM state. On the contrary, some drivers, in order
to properly block any new request in it's queues and wait for any
pending one to complete, may need to operate with interrupt still
running. I discussed that a bit with Alan, and it seem that we really
need 2 rounds of "suspend" callbacks in this case (at least for
system suspend), one with interrupts still enabled, one with interrupts
disabled.

Finally, I have another need for which I'm not sure how to react
with either the current scheme or the new scheme. On "desktop"
Apple systems (at least all the recent G4 ones), the PCI bus will
be effectively powered down during system sleep. That means that
we must (at least that's what both MacOS and MacOS X do) prevent
the complete system sleep when at least one PCI slot contains a
card for which the driver can't properly restore the state after
a complete shutdown. This frequently happens, for example, with
video cards that rely on some initial chip & pll configuration
to be done by the firmware. We may be able to fallback to some
kind of "light suspend" where we suspend any device we can but
not the motherboard, but that mean that the "main" PM code has
to know about the problem and need some way to know if a given
node in the device tree can or cannot be revived from a given
power state (in this case, we might consider beeing powered down
as equivalent to D3 state). My current solution is to not allow
system sleep at all on those desktop machines.

Regards,
Ben.


2001-10-18 15:32:25

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5




> So, remove() might be called without a shutdown(), and then asked to
> perform the duties normally performed by shutdown()? That sounds like
> API dain bramage. :)

:) I cannot disagree. I probably shouldn't have actually stated that in
the document, since the cases in which that could happen are very rare -
hotplug devices that can survive a suprise removal..Those drivers are
special and (should they ever exist) will have to know to do
that. Consider it removed.

> Your proposal sounds ok, my one objection is separating probe/remove
> further into init/shutdown. Can you give real-life cases where this
> will be useful? I don't see it causing much except headache.
>
> The preferred way of doing things (IMHO) is to do some simply sanity
> checking of the h/w device at probe time, and then perform lots of
> initialization and such at device/interface open time. You ideally want
> a device driver lifecycle to look like
>
> probe:
> register interface
> sanity check h/w to make sure it's there and alive
> stop DMA/interrupts/etc., just in case
> start timer to powerdown h/w in N seconds
>
> dev_open:
> wake up device, if necessary
> init device
>
> dev_close:
> stop DMA/interrupts/etc.
> start timer to powerdown h/w in N seconds
>
> With that in mind, init -really- happens at device open, and in
> additional is driven more through normal user interaction via standard
> APIs, than the PCI and PM subsystems.

I agree. My main goal was to change probe() to be simple answer to the
question "Hey, are you there?", and move the init features out of it.
In devices that support power management, that would happen anyway, so
anyway, so that resume() could re-init the device.

I will update the code and the document to note that.

Thanks,

-pat

2001-10-18 16:07:39

by Taral

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Wed, Oct 17, 2001 at 04:52:29PM -0700, Patrick Mochel wrote:
> When a suspend transition is triggered, the device tree is walked first to
> save the state of all the devices in the system. Once this is complete, the
> saved state, now residing in memory, can be written to some non-volatile
> location, like a disk partition or network location.
>
> The device tree is then walked again to suspend all of the devices. This
> guarantees that the device controlling the location to write the state is
> still powered on while you have a snapshot of the system state.

Aha! A much nicer solution to the problem the ACPI people are having
with suspend/resume (ordering problems).

--
Taral <[email protected]>
This message is digitally signed. Please PGP encrypt mail to me.
"Any technology, no matter how primitive, is magic to those who don't
understand it." -- Florence Ambrose


Attachments:
(No filename) (887.00 B)
(No filename) (197.00 B)
Download all attachments

2001-10-18 16:34:52

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


Hi there.

> I would add to the generic structure device, a "uuid" string field.
> This field would contain a "munged" unique identifier composed of
> the bus type followed which whatever bus-specific unique ID is
> provided by the driver. If the driver don't provide one, it defaults
> to a copy of the busID.
>
> What I have in mind here is to have a common place to look for the
> best possible unique identification for a device. Typical example are
> ieee1394 hard disks which do have a unique ID, and so can be properly
> tracked between insertion removal.

Hmm. So, this would be a device ID, much like the Vendor/Device ID pair in
PCI space? Does this need to happen at the top layer, or can it work at
the bus layer?

> Also, I'd like to see a simple ability for the arch code to add
> entries to the exposed device filesystem nodes. The main reason for
> this is that on machines like PPC with OpenFirmware or Sun with
> OpenBoot, it makes a lot of sense (and is very useful for bootloader
> configuration among others) to be able to know the firmware "path"
> corresponding to a given device. On PPC, the generic PCI code can
> do the convertion between an Open Firmware device node an a PCI
> device in the kernel, but doing so from userland is a lot more
> tricky. The device filesystem is a very good way to fix that problem
> once for all.

That shouldn't be too hard. ACPI wants to do something like that as well -
they will be able to ascertain information about some devices that we
otherwise wouldn't know, and will want to export that to userspace.

The idea was to make a call to platform_notify() on each device
registration, so the platform/firmware/arch could do things liike that.

> However, there is another important point about power management I
> discovered the "hard way", which is memory allocation vs. turning
> off of swapping devices (that is the swap device itself or any device
> on which you may have mmap'ed files).
>
> For "transcient" power management (that is dynamically putting a
> subsystem to sleep when idle until it gets a new request), there
> is no real problem provided that the driver can do the wakeup without
> allocating memory.
>
> For system power sleep, where you actually shut down everything,
> the problem happens when you start shutting down those "swap" devices.
> Once done, you may be in a situation where another device, to be shut
> down or to wake up properly, need to allocate memory (see for example
> USB devices that need to allocate urb's). This may cause requests
> to swap_out which will block indefinitely if trying to swap out
> pages to an already sleeping device.
>
> I "work around" this in the PowerBook sleep code in a bit dumb way
> which work in 99% of the case but is probably broken as well if you
> are really near oom. Basically, instead of calling only the "suspend"
> callbacks of devices, I have an additional "suspend requested"
> one that is sent to every driver using my specific PM scheme _before_
> starting the real round of "suspend" callbacks. Drivers that need
> a significant amount of backup memory (like some framebuffers) will
> do the necessary allocations from this early callback.

I think this can be solved in the suspend transition that I desribed:

- save_state
- suspend
- resume
- restore_state

The save_state() call is the notification that the device will be
suspended. It is in here that the driver allocates memory to save
state. But, no devices are actually put to sleep until the enter tree has
been walked to save state.

Then, we make a rule that says "Thou shall not allocate memory in
suspend() or resume()" and let them be damned if they do.

> Another issue with suspend and resume is with interrupt sharing and
> some bad devices that unconditionally assert their interrupt line
> when put to any PM state. On the contrary, some drivers, in order
> to properly block any new request in it's queues and wait for any
> pending one to complete, may need to operate with interrupt still
> running. I discussed that a bit with Alan, and it seem that we really
> need 2 rounds of "suspend" callbacks in this case (at least for
> system suspend), one with interrupts still enabled, one with interrupts
> disabled.

I remember that discussion, and I think the above transition should fix
that as well - have save_state() and restore_state() operate with
interrupts enabled, while suspend() and resume() execute with interrupts
disabled.

> Finally, I have another need for which I'm not sure how to react
> with either the current scheme or the new scheme. On "desktop"
> Apple systems (at least all the recent G4 ones), the PCI bus will
> be effectively powered down during system sleep. That means that
> we must (at least that's what both MacOS and MacOS X do) prevent
> the complete system sleep when at least one PCI slot contains a
> card for which the driver can't properly restore the state after
> a complete shutdown. This frequently happens, for example, with
> video cards that rely on some initial chip & pll configuration
> to be done by the firmware. We may be able to fallback to some
> kind of "light suspend" where we suspend any device we can but
> not the motherboard, but that mean that the "main" PM code has
> to know about the problem and need some way to know if a given
> node in the device tree can or cannot be revived from a given
> power state (in this case, we might consider beeing powered down
> as equivalent to D3 state). My current solution is to not allow
> system sleep at all on those desktop machines.

Yes, I remember these discussions as well. Oh, and what a nightmare that
is. The bus layer needs to have logic to know what power state to enter
based on the power state of all its children; a PCI bridge cannot enter a
state lower than the lowest state of all its children. The PM layer should
then take this into account and react appropriately.

Ideally, we want some way to reinit all devices. Most should be possible,
with one glaring exception: video. In order to reinit video, we need:

- a framebuffer driver that knows the innards of all the cards it supports

or

- make something else do it, like X.

The latter seems most plausible, since it knows about most cards. And, for
initialisation, at least on x86, it can run the BIOS routines.

Of course, that does nothing for you on PPC, but I am hoping something
similar can be accomplished. Can X run the OFW routines in the video ROM?

-pat

2001-10-18 16:53:12

by Jonathan Lundell

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

At 11:08 AM -0500 10/18/01, Taral wrote:
>On Wed, Oct 17, 2001 at 04:52:29PM -0700, Patrick Mochel wrote:
>> When a suspend transition is triggered, the device tree is walked first to
>> save the state of all the devices in the system. Once this is complete, the
>> saved state, now residing in memory, can be written to some non-volatile
>> location, like a disk partition or network location.
>>
>> The device tree is then walked again to suspend all of the devices. This
>> guarantees that the device controlling the location to write the state is
>> still powered on while you have a snapshot of the system state.
>
>Aha! A much nicer solution to the problem the ACPI people are having
>with suspend/resume (ordering problems).

What happens to state changes between the first and second traversal
of the device tree?
--
/Jonathan Lundell.

2001-10-18 17:05:02

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> The (New) Linux Kernel Driver Model

It looks like a good start - a lot of things will be cleaner afterward.

A question...

In struct device_driver:

> probe:
> Check for device existence and associate driver with it.

What, exactly, does "associate driver" mean? Filling in the struct device
field, perhaps? Calling register_chrdev (or register_whatever)? Creation
of a ddfs entry? As a driver writer I can understand that the probe
routine should check for the existence of some device, and perhaps set up
an internal data structure. What else happens?

jon

Jonathan Corbet
Executive editor, LWN.net
[email protected]

2001-10-18 17:35:22

by Tim Jansen

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Thursday 18 October 2001 18:19, Patrick Mochel wrote:
> Hmm. So, this would be a device ID, much like the Vendor/Device ID pair in
> PCI space? Does this need to happen at the top layer, or can it work at
> the bus layer?

Probably both. See this discussion on device ids (from linux-hotplug-devel):

http://www.geocrawler.com/archives/3/9005/2001/9/0/6716219/
http://www.geocrawler.com/mail/thread.php3?subject=IDs+%28was+Re%3A+Hotplugging+for+the+input+subsystem%29&list=9005

bye...

2001-10-18 17:48:52

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


> > probe:
> > Check for device existence and associate driver with it.
>
> What, exactly, does "associate driver" mean? Filling in the struct device
> field, perhaps? Calling register_chrdev (or register_whatever)? Creation
> of a ddfs entry? As a driver writer I can understand that the probe
> routine should check for the existence of some device, and perhaps set up
> an internal data structure. What else happens?

That's basically it. The bus should have already known about the existence
of the device, filled in the fields of struct device and registered it in
the global tree.

As Jeff Garzik suggested:

probe:
register interface
sanity check h/w to make sure it's there and alive
stop DMA/interrupts/etc., just in case
start timer to powerdown h/w in N seconds

in which interface would be your device node (char dev, devfs node, etc).


-pat



2001-10-18 17:53:32

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5




On Thu, 18 Oct 2001, Jonathan Lundell wrote:

> At 11:08 AM -0500 10/18/01, Taral wrote:
> >On Wed, Oct 17, 2001 at 04:52:29PM -0700, Patrick Mochel wrote:
> >> When a suspend transition is triggered, the device tree is walked first to
> >> save the state of all the devices in the system. Once this is complete, the
> >> saved state, now residing in memory, can be written to some non-volatile
> >> location, like a disk partition or network location.
> >>
> >> The device tree is then walked again to suspend all of the devices. This
> >> guarantees that the device controlling the location to write the state is
> >> still powered on while you have a snapshot of the system state.
> >
> >Aha! A much nicer solution to the problem the ACPI people are having
> >with suspend/resume (ordering problems).
>
> What happens to state changes between the first and second traversal
> of the device tree?

State changes of what?

After the first walk (save_state), you essentially have a snapshot of the
system in memory which can be written to disk, memory, etc.

Once that is done, you disable interrupts and walk the tree again to power
off devices.

-pat

2001-10-18 17:57:02

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


> After the first walk (save_state), you essentially have a snapshot of the
> system in memory which can be written to disk, memory, etc.

Sorry: written to disk, network, etc. (since it's already in memory ;)

-pat


2001-10-18 18:28:11

by Jonathan Lundell

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

At 10:38 AM -0700 10/18/01, Patrick Mochel wrote:
>On Thu, 18 Oct 2001, Jonathan Lundell wrote:
>
>> At 11:08 AM -0500 10/18/01, Taral wrote:
>> >On Wed, Oct 17, 2001 at 04:52:29PM -0700, Patrick Mochel wrote:
>> >> When a suspend transition is triggered, the device tree is
>>walked first to
>> >> save the state of all the devices in the system. Once this is
>>complete, the
>> >> saved state, now residing in memory, can be written to some non-volatile
>> >> location, like a disk partition or network location.
> > >>...
> > What happens to state changes between the first and second traversal
>> of the device tree?
>
>State changes of what?
>
>After the first walk (save_state), you essentially have a snapshot of the
>system in memory which can be written to disk, memory, etc.
>
>Once that is done, you disable interrupts and walk the tree again to power
>off devices.

The "state of all the devices in the system". Presumably, while you
walk the tree the first time (to save state) interrupts are enabled,
and devices are active. Operations (including interrupts) on the
device can, presumably, change the state of the device after its
state has been saved.

To take a crude example, suppose you save the state of an Ethernet
NIC, then change its MAC address, and then suspend the device. The
saved state now has the wrong MAC address.

In this particular case, of course, the driver can keep a soft copy
of the current MAC address and and restore from that, but that means
making special cases of special things.

Look at it another way. Why not save the state at the beginning of
time (say when the device is first initialized) instead of walking
the tree at suspend time? Presumably because there's some difference
between the state then, and the state at suspend time. How did that
difference happen, and why couldn't it happen after the save-state
tree-walk but before the actual device suspend?
--
/Jonathan Lundell.

2001-10-18 20:04:22

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


> The "state of all the devices in the system". Presumably, while you
> walk the tree the first time (to save state) interrupts are enabled,
> and devices are active. Operations (including interrupts) on the
> device can, presumably, change the state of the device after its
> state has been saved.

Ya, I'm an idiot sometimes. I relized this just as I was leaving for
lunch. I almost turned around to come back and answer..

This is what I had in mind; If someone could give me a thumbs-up or
thumbs-down on whether or not this would work:

When the driver gets a save_state request, that is its notification that
it is going to sleep. It should then stop/finish all I/O requests. It
should then prevent itself from taking any more - by setting a flag or
whatever. Then, device save state.

>From that point in, it should know not to take any requests, theoretically
preserving state.

When it gets the restore_state() call, it should first restore device
state. Once it does that, it knows that it can take I/O requests again.

That should work, right?

The only thing that that won't work for is the device to which we're
saving state, like the disk. At some point, though we have to accept that
the state that we saved was some checkpoint in the past, and it won't
reflect the state that changed in the process of writing the system state.

-pat



2001-10-18 20:40:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Patrick Mochel wrote:
> When the driver gets a save_state request, that is its notification that
> it is going to sleep. It should then stop/finish all I/O requests. It
> should then prevent itself from taking any more - by setting a flag or
> whatever. Then, device save state.
>
> From that point in, it should know not to take any requests, theoretically
> preserving state.
>
> When it gets the restore_state() call, it should first restore device
> state. Once it does that, it knows that it can take I/O requests again.
>
> That should work, right?

Seems reasonable. If a save_state is refused, I assume you
restore_state for all other devices and bring the system from a
half-working state [at the time the suspend was rejected] to a
full-working state?

Consider that it will take some amount of time to stop pending I/O
requests. You might want to walk the tree, and tell devices "start
saving", and then walk the tree again and say "finish saving."

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-18 21:32:12

by John Alvord

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Thu, 18 Oct 2001 12:49:01 -0700 (PDT), Patrick Mochel
<[email protected]> wrote:

>
>> The "state of all the devices in the system". Presumably, while you
>> walk the tree the first time (to save state) interrupts are enabled,
>> and devices are active. Operations (including interrupts) on the
>> device can, presumably, change the state of the device after its
>> state has been saved.
>
>Ya, I'm an idiot sometimes. I relized this just as I was leaving for
>lunch. I almost turned around to come back and answer..
>
>This is what I had in mind; If someone could give me a thumbs-up or
>thumbs-down on whether or not this would work:
>
>When the driver gets a save_state request, that is its notification that
>it is going to sleep. It should then stop/finish all I/O requests. It
>should then prevent itself from taking any more - by setting a flag or
>whatever. Then, device save state.
>
>>From that point in, it should know not to take any requests, theoretically
>preserving state.
>
>When it gets the restore_state() call, it should first restore device
>state. Once it does that, it knows that it can take I/O requests again.
>
>That should work, right?
>
>The only thing that that won't work for is the device to which we're
>saving state, like the disk. At some point, though we have to accept that
>the state that we saved was some checkpoint in the past, and it won't
>reflect the state that changed in the process of writing the system state.

Maybe each driver could pass back a value indicating

1) all done
2) N milliseconds more, please

and you could keep calling until every driver says all done. The
all-done drivers would ignore any new interrupts. The Not-Yet drivers
could get the last few interrupts the need to complete. Of course
there would need to be an overall timeout. That would leave most of
the responsibility with the drivers... who know most of the true
requirements.

john alvord

2001-10-18 22:06:16

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>Hmm. So, this would be a device ID, much like the Vendor/Device ID pair in
>PCI space? Does this need to happen at the top layer, or can it work at
>the bus layer?

No, the idea is to have a unique identifier for a given device "instance".
A VendorID/DeviceID pair isn't unique as two PCI cards can perfectly have
the same one.

Some devices, and I think this is mandatory in the ieee1394 spec, provide
a per-device unique ID, similar to an ethernet address or a serial number.

The goal here is to provide in a common location for any kind of a device
the best approximation of a unique identifier a device can provide.

I beleive the "bus" driver would setup a "default" one for a given device
(like a munge of pci slot/vendorid/deviceId for a PCI device that doesn't
provide anything better), and let the device driver override that with
something with better "uniqueness" if available.

However, I didn't explain myself correctly when writing about appending
that id to a "bus type". It's not actually a bus type I was thinking
about, but a type related to the uuid itself to avoid name space
collisions between uuid's of different devices types. In the case of
ethernet hardware, the MAC address seems to be the best type of uuid
available, so it would be something like "ethaddr,xx:xx:xx:xx:xx:xx",
FireWire has a generic uuid allocation scheme as well, it could be
"ieee1394,xxxxxx...", etc...

The goal is to help userland, especially configuration tools, to
keep track or hardware. In the case of block devices like sbp2 disks,
it can help making sure that if a device is unplugged before beeing
properly unmounted, when plugged back, it can be identified as beeing
the same device.

>That shouldn't be too hard. ACPI wants to do something like that as well -
>they will be able to ascertain information about some devices that we
>otherwise wouldn't know, and will want to export that to userspace.
>
>The idea was to make a call to platform_notify() on each device
>registration, so the platform/firmware/arch could do things liike that.

Ok, nice. maybe provide some room (for a pointer) in the device
structure to be used by the platform.

Thinking a bit more about it, why not simply calling the node's
father with a kind of child_notify() callback ? The default
behaviour would be to call the parent until it ends up at the
motherboard/arch level. That way, special bus types can add
bus-related informations to devices more easily.

>I think this can be solved in the suspend transition that I desribed:
>
>- save_state
>- suspend
>- resume
>- restore_state
>
>The save_state() call is the notification that the device will be
>suspended. It is in here that the driver allocates memory to save
>state. But, no devices are actually put to sleep until the enter tree has
>been walked to save state.

Well, I remember when you first implemented this save state mecanism. I'm
not sure I like the save_state and restore_state semantics much. Since the
device can take additional requests after save_state (typically, it's a
block device and another driver is causing swap out from it's save_state
routine), then your state gets changed. You are not really saving the device
state at this point, you are allocating room to save state.

>Then, we make a rule that says "Thou shall not allocate memory in
>suspend() or resume()" and let them be damned if they do.

Ok. That means that things like USB must make sure they pre-allocate
any USBs that may be needed. Sounds fine to me.

>I remember that discussion, and I think the above transition should fix
>that as well - have save_state() and restore_state() operate with
>interrupts enabled, while suspend() and resume() execute with interrupts
>disabled.

I don't agree there. As I said, since the device can still take requests
after save_state, it can't really save its state nor block it's IO queues
if any. So that has to happen within suspend itself.

I've turned that problem in every possible directions ;) I think there's
really 3 required suspend steps, even if most drivers will only need to
really implement one or two.

>Yes, I remember these discussions as well. Oh, and what a nightmare that
>is. The bus layer needs to have logic to know what power state to enter
>based on the power state of all its children; a PCI bridge cannot enter a
>state lower than the lowest state of all its children. The PM layer should
>then take this into account and react appropriately.

Ok. So if I we implement a toplevel "motherboard" node that is father
of all PCI busses (and anything else), we can have the arch handle that.
If the loop of all it's device don't return D3 but something less, then
the motherboard won't be put to suspend. That's fine... for me ;) But
what if not beeing able to set a given PCI bus to D3 is not a real
issue ? (for example, the motherboard can be told not to shutdown that
specific PCI bus). Well, I beleive in this case, the motherboard can
have more intimate knowledge of it's children and which ones are
really "mandatory" for sleep or not.

>Ideally, we want some way to reinit all devices. Most should be possible,
>with one glaring exception: video. In order to reinit video, we need:

Are you sure we know how to reinit all sorts of SCSI cards our there
as well ?

>- a framebuffer driver that knows the innards of all the cards it supports
>
>or
>
>- make something else do it, like X.
>
>The latter seems most plausible, since it knows about most cards. And, for
>initialisation, at least on x86, it can run the BIOS routines.

Well, both may work. It's a matter of motherboard policy I beleive.
We might add a special case to the fbdev layer to be told by X "hey,
don't bother about sleep, I'll handle it", but X is only allowed to
touch hardware when frontmost, and this would require more linux-specific
cruft in X which would be difficult to get accepted.

I beleive the way to get back the video card will have to be dealt on
a per card basis anyway. If we have the infrastructure for drivers to
say "I can't deal with shutdown", it's enough for now. I'm looking
into some ways, on PPC, to re-run the card's firmware with a small
forth interpreter ;) (reminds you of ACPI ? heh ;) For now, it's
not an important issue as putting desktop machines to sleep is not
as important as putting laptops to sleep, and fortunately, so far,
Apple laptops don't power off the AGP slot during sleep (we use D2).


>Of course, that does nothing for you on PPC, but I am hoping something
>similar can be accomplished. Can X run the OFW routines in the video ROM?

It can't (well, it could probably run the BIOS of an x86 card), but
there might be other ways. Re-initing the card completely after figuring
out all registers values for a given model may be a working solution for
macs as they almost all use a limited range of ATI hardware.
Emulating OF might be a solution as well. One last would be a wrapper
to run Apple MacOS drivers (which are in card's ROMs most of the time,
this is more or less already what Apple does with MacOS X).

Ben.



2001-10-18 22:24:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>Maybe each driver could pass back a value indicating
>
>1) all done
>2) N milliseconds more, please
>
>and you could keep calling until every driver says all done. The
>all-done drivers would ignore any new interrupts. The Not-Yet drivers
>could get the last few interrupts the need to complete. Of course
>there would need to be an overall timeout. That would leave most of
>the responsibility with the drivers... who know most of the true
>requirements.

Hrm... The interesting thing with this scheme is that it allows
you to first block your queue, then let other driver do the same
while your async IO completes, and then come back. Well... this
could be an option to step "2" of my earlier proposal.
This requires the device structure to keep track of which driver
still wants to be called. It would only go to step 3 once all
drivers have ack'ed step 2.

Ben.


2001-10-18 22:18:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>
>
>When the driver gets a save_state request, that is its notification that
>it is going to sleep. It should then stop/finish all I/O requests. It
>should then prevent itself from taking any more - by setting a flag or
>whatever. Then, device save state.

Ok, that's what I call more or less "blocking IO queues"..

>From that point in, it should know not to take any requests, theoretically
>preserving state.

But that's my problem :) Imagine another driver next in the loop need
memory and call swap_out to happen. Problem: you just have blocked the
IO queue of your swap device.

>When it gets the restore_state() call, it should first restore device
>state. Once it does that, it knows that it can take I/O requests again.
>
>That should work, right?
>
>The only thing that that won't work for is the device to which we're
>saving state, like the disk. At some point, though we have to accept that
>the state that we saved was some checkpoint in the past, and it won't
>reflect the state that changed in the process of writing the system state.

That's why I prefer more explicit semantics:

- Prepare sleep: Allocate enough memory to save state. For most
devices, it will be a fixed quantity. In the case of devices that need
per-request allocation, like USB of firewire, just allocate a limited
pool. That means that you will eventually cause serialisation to
happen when not needed and hurt perfs, but nobody will care at this
point ;)

- Suspend activity: There you lock your IO queues, set your busy flag
or whatever, and wait for any pending IO to be completed. Interrupts
are enabled, scheduling as well (and other CPUs). Each driver is
responsible to properly block a process issuing a request (which should
not be a problem to implement for most of them, a single semaphore
is enough for simple drivers, drivers with IO queues just need to
leave requests in the queues, etc...)

- Set power state: Here you shut your device down for real. Interrupt
are disabled. Only one CPU is still active (the others can be put in
whatever state your arch allow, like a sleep loop or whatever...).

The actual state save can be in step 2 or 3, we don't really care,
it depends mostly on what is more convenient for the driver writer.

The resume process only needs 2 state imho:

- Set power state: You power back on your device, re-configure the
hardware properly, and make sure it won't send spurrious interrupts.
System interrupts are disabled. One CPU is running.

- Resume activity: System interrupts are back, scheduling too, you
start handling pending requests.


Regards,
Ben.


2001-10-18 22:25:28

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

John Alvord wrote:
>
> On Thu, 18 Oct 2001 12:49:01 -0700 (PDT), Patrick Mochel
> <[email protected]> wrote:
>
> >
> >> The "state of all the devices in the system". Presumably, while you
> >> walk the tree the first time (to save state) interrupts are enabled,
> >> and devices are active. Operations (including interrupts) on the
> >> device can, presumably, change the state of the device after its
> >> state has been saved.
> >
> >Ya, I'm an idiot sometimes. I relized this just as I was leaving for
> >lunch. I almost turned around to come back and answer..
> >
> >This is what I had in mind; If someone could give me a thumbs-up or
> >thumbs-down on whether or not this would work:
> >
> >When the driver gets a save_state request, that is its notification that
> >it is going to sleep. It should then stop/finish all I/O requests. It
> >should then prevent itself from taking any more - by setting a flag or
> >whatever. Then, device save state.
> >
> >>From that point in, it should know not to take any requests, theoretically
> >preserving state.
> >
> >When it gets the restore_state() call, it should first restore device
> >state. Once it does that, it knows that it can take I/O requests again.
> >
> >That should work, right?
> >
> >The only thing that that won't work for is the device to which we're
> >saving state, like the disk. At some point, though we have to accept that
> >the state that we saved was some checkpoint in the past, and it won't
> >reflect the state that changed in the process of writing the system state.
>
> Maybe each driver could pass back a value indicating
>
> 1) all done
> 2) N milliseconds more, please

It seems far less complex to simply let the driver do what it needs to
do, in the time it needs to do it. The probe step in current drivers
for example could take anywhere from less than a second to several
seconds, depending on what needs to be done.

Though like I mentioned in a previous mail, if you have a two-stage
save-state step, a lot of those delays can be parallelized: in the
first save-state the driver stops the hardware from accepting further
transaction, and initiates I/O request completion (where possible). The
second save-state cleans up any outstanding transactions and shuts the
rest of the hardware down.

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-18 23:29:56

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


> That's why I prefer more explicit semantics:
>
> - Prepare sleep: Allocate enough memory to save state. For most
> devices, it will be a fixed quantity. In the case of devices that need
> per-request allocation, like USB of firewire, just allocate a limited
> pool. That means that you will eventually cause serialisation to
> happen when not needed and hurt perfs, but nobody will care at this
> point ;)
>
> - Suspend activity: There you lock your IO queues, set your busy flag
> or whatever, and wait for any pending IO to be completed. Interrupts
> are enabled, scheduling as well (and other CPUs). Each driver is
> responsible to properly block a process issuing a request (which should
> not be a problem to implement for most of them, a single semaphore
> is enough for simple drivers, drivers with IO queues just need to
> leave requests in the queues, etc...)
>
> - Set power state: Here you shut your device down for real. Interrupt
> are disabled. Only one CPU is still active (the others can be put in
> whatever state your arch allow, like a sleep loop or whatever...).

Ok, so we need another walk before we go to sleep.

But, first a question - does the swap device need to absolutely be the
last thing to stop taking requests? Or, can it stop after everything is
done allocating memory?

> The actual state save can be in step 2 or 3, we don't really care,
> it depends mostly on what is more convenient for the driver writer.

For most devices, it seems it could happen in the first, as well. They
should be fine with stopping I/O requests early on. It's only special
cases like swap and maybe one or two others that need an extra step,
right?

-pat

2001-10-18 23:44:46

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>
>Ok, so we need another walk before we go to sleep.
>
>But, first a question - does the swap device need to absolutely be the
>last thing to stop taking requests? Or, can it stop after everything is
>done allocating memory?

The problem with VM is that you don't really have one swap device.

You can have swap on files from several devices, you can have mmap'ed
files from any mounted filesystem on any block device, you can have
NFS, etc...

That's why we must completely separate allocation from blocking of
activity. If we do so, we don't need to care about any ordering rule
between drivers (at least not because of this problem, other issues
may require ordering rules, but it's an arch matter).

>> The actual state save can be in step 2 or 3, we don't really care,
>> it depends mostly on what is more convenient for the driver writer.
>
>For most devices, it seems it could happen in the first, as well. They
>should be fine with stopping I/O requests early on. It's only special
>cases like swap and maybe one or two others that need an extra step,
>right?

Well, you may think it's ok to do it, let's say, for a serial port, in
step 1. But... what about NFS over PPP over that serial port ? :)

If a device don't need to allocate memory and can do the save_state
and shutdown in one step, then it only need to respond to step 2. It
will skip step 1 and step 3.

Ben.


2001-10-18 23:52:17

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Benjamin Herrenschmidt wrote:
> Well, you may think it's ok to do it, let's say, for a serial port, in
> step 1. But... what about NFS over PPP over that serial port ? :)

In fact, I have done to that connect a former roommate's Amiga to my
own. He accessed my files across NFS using SLIP and a null modem
cable... :)

--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-19 00:19:47

by kaih

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

[email protected] (Patrick Mochel) wrote on 18.10.01 in <Pine.LNX.4.21.0110180826240.16868-100000@marty.infinity.powertie.org>:

> > I would add to the generic structure device, a "uuid" string field.
> > This field would contain a "munged" unique identifier composed of
> > the bus type followed which whatever bus-specific unique ID is
> > provided by the driver. If the driver don't provide one, it defaults
> > to a copy of the busID.
> >
> > What I have in mind here is to have a common place to look for the
> > best possible unique identification for a device. Typical example are
> > ieee1394 hard disks which do have a unique ID, and so can be properly
> > tracked between insertion removal.
>
> Hmm. So, this would be a device ID, much like the Vendor/Device ID pair in
> PCI space?

Except for the fact that the Vendor/Device ID pair is a device *class*
identifier, and the uuid is a device *instance* identifier.


MfG Kai

Subject: Re: [RFC] New Driver Model for 2.5

Benjamin Herrenschmidt <[email protected]> writes:

>>> struct device {
>>> struct list_head bus_list;
>>> struct io_bus *parent;
>>> struct io_bus *subordinate;
>>>
>>> char name[DEVICE_NAME_SIZE];
>>> char bus_id[BUS_ID_SIZE];
>>>
>>> struct dentry *dentry;
>>> struct list_head files;
>>>
>>> struct semaphore lock;
>>>
>>> struct device_driver *driver;
>>> void *driver_data;
>>> void *platform_data;
>>>
>>> u32 current_state;
>>> unsigned char *saved_state;
>>> };

>Hi Patrick ! Nice to see this happening ;)

>I would add to the generic structure device, a "uuid" string field.

And a version field! Please add a version field right to the
beginning. This would make supporting legacy drivers in later versions
_much_ easier.

Ciao
Henning

--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2001-10-19 08:08:33

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

"Henning P. Schmiedehausen" wrote:
> And a version field! Please add a version field right to the
> beginning. This would make supporting legacy drivers in later versions
> _much_ easier.

That's something to be done in a Windows not Linux driver.

This is not a structure that is directly exposed to userspace, so it
doesn't need to be versioned.

We don't really support legacy drivers in the first place, much less
take up space with versions in structs everywhere..

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-19 08:31:46

by Keith Owens

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Fri, 19 Oct 2001 04:09:05 -0400,
Jeff Garzik <[email protected]> wrote:
>"Henning P. Schmiedehausen" wrote:
>> And a version field! Please add a version field right to the
>> beginning. This would make supporting legacy drivers in later versions
>> _much_ easier.
>
>This is not a structure that is directly exposed to userspace, so it
>doesn't need to be versioned.

Will you want modutils support for this new struct? If so it needs
a version field.

2001-10-19 08:43:00

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Keith Owens wrote:
> Will you want modutils support for this new struct? If so it needs
> a version field.

For struct device? Um, no, we don't need modutils support for it.

Jeff


--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-19 15:19:58

by Taral

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Thu, Oct 18, 2001 at 02:13:18PM +0200, Benjamin Herrenschmidt wrote:
> I would add to the generic structure device, a "uuid" string field.
> This field would contain a "munged" unique identifier composed of
> the bus type followed which whatever bus-specific unique ID is
> provided by the driver. If the driver don't provide one, it defaults
> to a copy of the busID.
>
> What I have in mind here is to have a common place to look for the
> best possible unique identification for a device. Typical example are
> ieee1394 hard disks which do have a unique ID, and so can be properly
> tracked between insertion removal.

Actually, if this field were to be added, I think it would be far better
to have it be NULL in the case where there is no ID which can be
expected to remain the same on insert/remove. Otherwise we might have
people getting very confused when someone removes device A and adds
device B and they end up with the same "unique id" because neither one
has a real unique id.

--
Taral <[email protected]>
This message is digitally signed. Please PGP encrypt mail to me.
"Any technology, no matter how primitive, is magic to those who don't
understand it." -- Florence Ambrose

2001-10-19 17:02:32

by Kevin Easton

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Hi,

Am I correct in thinking that the current "state of play" after these recent
discussions is a 3 step suspend process, following an algorithm similar to:

if (suspend_prepare(device_list) == failed) {
suspend_cancel(device_list);
return failed;
}

if (suspend_save_state(device_list) == failed) {
suspend_cancel(device_list);
return failed;
}

write_out_state();
suspend_now(device_list);

Where these operations on the drivers are defined as:

suspend_prepare:
Allocate any memory needed for saving of state, suspending & resuming
device. LAST CHANCE TO ALLOCATE MEMORY.

suspend_save_state:
Stop accepting requests.
Save state of device.

suspend_now:
Turn off device.

suspend_cancel:
Free any memory that may have been allocated for saving of state.
Resume normal operation.

...and write_out_state() somehow stores the saved (in memory) state of the
devices to nonvolatile storage.

If this is approximately the right idea, then how will write_out_state work if
the device(s) that this operation uses aren't accepting requests anymore
(because they've done suspend_save_state)? Is it that "Stop accepting
requests" is actually "Stop accepting requests that will cause a change in the
device state"? In that case, devices that can have the state written out to
them will be limited to those where the act of writing it out will never cause
such a request, right?

- Kevin.

2001-10-19 17:13:14

by Jonathan Lundell

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

At 4:19 PM -0400 10/18/01, Jeff Garzik wrote:
> > In this particular case, of course, the driver can keep a soft copy
>> of the current MAC address and and restore from that, but that means
>> making special cases of special things.
>
>For that specific case, NIC drivers should read a copy of the MAC
>address at probe time, and store it in dev->dev_addr. Each power-up+if
>open cycle, the MAC address to programmed onto the NIC. So that is not
>a special case but the normal case, keeping a soft copy of the MAC.
>
>Just being picky :)

No, you're right, and it's especially true of NIC drivers. Partly, I
assume, because it's SOP in NIC drivers to routinely reinitialize the
hardware after various errors. And for Ethernet makes it easy,
because we're allowed to silently discard packets.

My own inclination would be to always keep enough information in
driver structures to reinitialize the device, though I'd hesitate to
assert that this is always possible, or practical.

WRT the suspend/resume sequence, I'd like to see the process
extensible. So, for example, a single suspend entry point with an
argument specifying the current action. The stuff I'm working on
requires a kind of "suspend with extreme prejudice" in which the
driver can't decline to suspend, as well as a suspend that comes
*after* a bus (therefore device) reset (which would explain my urge
to keep device state in soft structures). This is generally simple
enough for Ethernet drivers, but a little trickier for other devices.

BTW (and excuse me for not searching this out, if it's available), is
2.5 intended to have a real device tree? There's a related issue for
suspend/resume, namely the hierarchical relationship of some devices
(eg md->sd->adapter or whatever).
--
/Jonathan Lundell.

2001-10-19 17:33:06

by kaih

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

[email protected] (Benjamin Herrenschmidt) wrote on 19.10.01 in <[email protected]>:

> collisions between uuid's of different devices types. In the case of
> ethernet hardware, the MAC address seems to be the best type of uuid
> available, so it would be something like "ethaddr,xx:xx:xx:xx:xx:xx",
> FireWire has a generic uuid allocation scheme as well, it could be
> "ieee1394,xxxxxx...", etc...

I have no idea what Firewire uses, but there are two generic kinds of
numbers that the IEEE allocates (actually, they're two different views on
a single id space).

Those are the MAC-48 address used by ethernet, fddi, and various other
protocols, and the EUI-64 used by more modern designs (and referenced by
IPv6; in fact, there's an algorithm that lets you create an EUI-64 from a
MAC-48 via bit stuffing).

Both of these depend on a 24 bit id called company_id or OUI which you can
buy from the IEEE for US$1.250,00 (for 16 million MAC-48's or 1 trillion
EUI-64's).

The list of public OUIs is at <URL:http://standards.ieee.org/regauth/oui/
oui.txt> (there are "unlisted numbers" in that namespace, too).

Ah, I see IEEE 1394 *does* use OUIs. Not at all surprising, of course.

So, the namespace should be used, not the appliation. In fact, given the
standard conversion from MAC-48 to EUI-64, we should probably just use one
namespace for both: my current ethernet card 00:50:FC:0C:63:69 would thus
be named "eui-64,00:50:fc:ff:ff:0c:63:69".

More than you ever wanted to know about this stuff: <URL:http://
standards.ieee.org/regauth/oui/>.

Of course, there *are* other namespaces.

MfG Kai

2001-10-19 18:26:13

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


> > Hmm. So, this would be a device ID, much like the Vendor/Device ID pair in
> > PCI space?
>
> Except for the fact that the Vendor/Device ID pair is a device *class*
> identifier, and the uuid is a device *instance* identifier.

Actually, the Vendor/Device ID pair is a unique identifier for the device
model. There are a Base Class and Subclass IDs, as well as a subsystem
vendor ID.

There are equivalents in USB. But, neither of them are globally unique
identifiers for the device. That doesn't necessarily mean that one
couldn't be ascertained from the device; ethernet cards do have MAC
addresses. But, I don't think that many will have a ID/serial number.

And this leads to inconsistency. You'll have PCI devices that have a
Vendor/Device/Class ID, and some that have a device-specific ID. Then
you'll have USB devices with the same. And what about legacy devices?

Which leads me to the question: what real benefit does this have? Why
would you ever want to do a global search in kernel space for a particular
device? The bus structure can keep (and likely already does keep) this
information. It can export it to userland on its own; the top layer
doesn't need to do that.

Yes, the formats of each file will be different, but they would be anyway.
The names may be different for different buses, but we can encourage the
bus layers to all export a file of the same name ("ID") if we want.

I don't think the UUID belongs in the top level structure. It belongs in
whatever structure dictates it - the bus structure or the class strucutre.

-pat


2001-10-19 18:40:04

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


> Am I correct in thinking that the current "state of play" after these recent
> discussions is a 3 step suspend process, following an algorithm similar to:

Yes. After some discussion, I think we need a 3-step process. I will be
updating the docs today.

> If this is approximately the right idea, then how will write_out_state work if
> the device(s) that this operation uses aren't accepting requests anymore
> (because they've done suspend_save_state)? Is it that "Stop accepting
> requests" is actually "Stop accepting requests that will cause a change in the
> device state"? In that case, devices that can have the state written out to
> them will be limited to those where the act of writing it out will never cause
> such a request, right?

That's an interesting question, and one that depends on the answer to
several questions.

The mechanism for going to sleep is dependent first on the architecture
and secondly on the power managment scheme. It is up to the scheme to work
out the finer details concerning it.

(That's not a copout; we're just not likely to have a generic suspend
routine. Even if every implementation is using the same mechanism, I don't
know if it could ever be consolidated into one singular body of code.)

So then, how do we do suspend to disk? All the progress in that area has
been made by swsusp. I don't know the finer details of how it works, so
I'm not about to comment on how to make it work or modify it to better fit
our needs. Maybe someone from that camp could comment on whether or not
the 3-stage model would completely screw them or not? Or, how to make it
work under this model? Or if it even matters?

-pat




2001-10-19 18:48:14

by Tim Jansen

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Friday 19 October 2001 09:57, Henning P. Schmiedehausen wrote:
> Benjamin Herrenschmidt <[email protected]> writes:
> >>> struct device {
> And a version field! Please add a version field right to the
> beginning. This would make supporting legacy drivers in later versions
> _much_ easier.

IMHO it would be a good idea not to create and fill those structs using a
function, instead of letting the driver code create the struct or using
versions.

In the current version of the patch struct device is allocated using

struct device *device_alloc_dev(void);

and then later registered using

int device_register_dev(struct device *dev);


In other words the fields are set by the bus driver. The problem is that when
somebody adds a new, required field then existing code will silently break.
So I would propose to think about using something like

struct device *device_create_dev(const char *name,
const char *bus_id,
struct device_driver *driver,
void *driver_data,
void *platform_data,
u32 current_state);

The advantage is that when you add a new field the old code won't compile
before it has been fixed. It also allows you to do large changes in the
underlying code without breaking source compatibility.
The disadvantage is that you cannot add a field that should be specifier by
the caller without either adding a new function or destroying source
compatibility.

bye...

2001-10-19 18:59:24

by Tim Jansen

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Friday 19 October 2001 20:26, Patrick Mochel wrote:
> There are equivalents in USB. But, neither of them are globally unique
> identifiers for the device. That doesn't necessarily mean that one
> couldn't be ascertained from the device; ethernet cards do have MAC
> addresses. But, I don't think that many will have a ID/serial number.
> [...]
> Which leads me to the question: what real benefit does this have? Why
> would you ever want to do a global search in kernel space for a particular
> device?

For example for harddisks. You usually want them to be mounted in the same
directory. Or if you have several printers of the same type connected your
computer you need a way of identifying them. Or for ethernet adapters:
because each is connected to a different network, so you need to assign
different IP addresses to them.

Actually most USB harddisks, printers and network adapters have unique serial
number (you have to be careful though as some claim to have a serial number,
but it is not unique).

bye...

2001-10-19 19:20:57

by Mike Fedyk

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Fri, Oct 19, 2001 at 09:02:09PM +0200, Tim Jansen wrote:
> On Friday 19 October 2001 20:26, Patrick Mochel wrote:
> > There are equivalents in USB. But, neither of them are globally unique
> > identifiers for the device. That doesn't necessarily mean that one
> > couldn't be ascertained from the device; ethernet cards do have MAC
> > addresses. But, I don't think that many will have a ID/serial number.
> > [...]
> > Which leads me to the question: what real benefit does this have? Why
> > would you ever want to do a global search in kernel space for a particular
> > device?
>
> For example for harddisks. You usually want them to be mounted in the same
> directory.

When is /etc/fstab going to support this?

#Or if you have several printers of the same type connected your
> computer you need a way of identifying them.

/dev/ttyS0 and /dev/lp0

>Or for ethernet adapters:
> because each is connected to a different network, so you need to assign
> different IP addresses to them.
>

I haven't seen anything assign ethX assign a certain order, except for
ordered module loading, and then if there are multiple devices with the same
driver, the order is chosen by bus scanning order, or module option.

> Actually most USB harddisks, printers and network adapters have unique serial
> number (you have to be careful though as some claim to have a serial number,
> but it is not unique).
>

How different do you expect this new driver model to be?

Does anyone know if devfs will, or has any plans to support any of the above
features?

2001-10-19 20:05:10

by Tim Jansen

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Friday 19 October 2001 21:21, you wrote:
> > For example for harddisks. You usually want them to be mounted in the
> > same directory.
> When is /etc/fstab going to support this?

You can use the device ids to provide stable symlinks, then /etc/fstab
shouldn't be a problem. Or you rewrite mount to support it. Or you do it in
the kernel with a user-space helper: when a new device is connected its ID is
sent to some user-space app, and the user-space app then assigns a minor
number and devfs name to the node.

IMHO using the path of a file in /dev to identify a device node does not work
in a hotplugging environment. You need this to support existing apps, but the
only way to be sure that you always get the same device is to use device IDs.
You could encode that device id in the node's path or use the path as a
moniker for the device id (the symlink solution does this), but you need to
have more information about the device than it's minor number (the X in
/dev/lpX).


> >Or for ethernet adapters:
> > because each is connected to a different network, so you need to assign
> > different IP addresses to them.
> I haven't seen anything assign ethX assign a certain order, except for
> ordered module loading, and then if there are multiple devices with the
> same driver, the order is chosen by bus scanning order, or module option.

Ok, but I think no one doubts that it is a bad idea to assign ethX
semi-randomly. Basically this is the same problem as with device files, only
in a different namespace.


> Does anyone know if devfs will, or has any plans to support any of the
> above features?

The device registry (http://www.tjansen.de/devreg) patches devfs to allow the things
described above though.

bye...

2001-10-19 20:24:12

by Mike Fedyk

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Fri, Oct 19, 2001 at 10:07:39PM +0200, Tim Jansen wrote:
> On Friday 19 October 2001 21:21, you wrote:
> > > For example for harddisks. You usually want them to be mounted in the
> > > same directory.
> > When is /etc/fstab going to support this?
>
> You can use the device ids to provide stable symlinks, then /etc/fstab
> shouldn't be a problem.

Sounds good.

>Or you rewrite mount to support it. Or you do it in
> the kernel with a user-space helper: when a new device is connected its ID is
> sent to some user-space app, and the user-space app then assigns a minor
> number and devfs name to the node.
>

Or, just use autofs, it does pretty much what you're describing.

> IMHO using the path of a file in /dev to identify a device node does not work
> in a hotplugging environment. You need this to support existing apps, but the
> only way to be sure that you always get the same device is to use device IDs.

Actually, I don't have a hotplug envoronment, but that's not the only place
it would be useful. Does ide/scsi have reliably unique device IDs? If so,
once devfs gets rid of those races it would be very useful in a large raid
setup. Hmm, I guess that could be hot-pluggable with high end hardware.

> You could encode that device id in the node's path or use the path as a
> moniker for the device id (the symlink solution does this), but you need to
> have more information about the device than it's minor number (the X in
> /dev/lpX).
>

What does devfs do now?

>
> > >Or for ethernet adapters:
> > > because each is connected to a different network, so you need to assign
> > > different IP addresses to them.
> > I haven't seen anything assign ethX assign a certain order, except for
> > ordered module loading, and then if there are multiple devices with the
> > same driver, the order is chosen by bus scanning order, or module option.
>
> Ok, but I think no one doubts that it is a bad idea to assign ethX
> semi-randomly. Basically this is the same problem as with device files, only
> in a different namespace.
>

So is that in favor of changing the current ethX naming convention or not?

>
> > Does anyone know if devfs will, or has any plans to support any of the
> > above features?
>
> The device registry (http://www.tjansen.de/devreg) patches devfs to allow the things
> described above though.
>

Everything, with all of the ids? What about scsi/ide?

2001-10-19 21:43:32

by Andrew Grover

[permalink] [raw]
Subject: RE: [RFC] New Driver Model for 2.5

My impression was that while long-term having a device tree (with or without
a fs to expose it to userland) may help with the infamous Linux naming
issues, that the first go-round should try to completely avoid this issue
entirely, and focus on just enabling the global device tree itself. (I just
want to suspend/wake my laptop's devices in the right order!)

Everyone pretty much agrees that the device tree and device power management
are good. My hope is we don't let other contentious issues hinder its
implementation.

Regards -- Andy


> From: Mike Fedyk [mailto:[email protected]]
> Sent: Friday, October 19, 2001 1:24 PM
> To: Tim Jansen
> Cc: [email protected]; Patrick Mochel
> Subject: Re: [RFC] New Driver Model for 2.5
>
>
> On Fri, Oct 19, 2001 at 10:07:39PM +0200, Tim Jansen wrote:
> > On Friday 19 October 2001 21:21, you wrote:
> > > > For example for harddisks. You usually want them to be
> mounted in the
> > > > same directory.
> > > When is /etc/fstab going to support this?
> >
> > You can use the device ids to provide stable symlinks, then
> /etc/fstab
> > shouldn't be a problem.
>
> Sounds good.
>
> >Or you rewrite mount to support it. Or you do it in
> > the kernel with a user-space helper: when a new device is
> connected its ID is
> > sent to some user-space app, and the user-space app then
> assigns a minor
> > number and devfs name to the node.
> >
>
> Or, just use autofs, it does pretty much what you're describing.
>
> > IMHO using the path of a file in /dev to identify a device
> node does not work
> > in a hotplugging environment. You need this to support
> existing apps, but the
> > only way to be sure that you always get the same device is
> to use device IDs.
>
> Actually, I don't have a hotplug envoronment, but that's not
> the only place
> it would be useful. Does ide/scsi have reliably unique
> device IDs? If so,
> once devfs gets rid of those races it would be very useful in
> a large raid
> setup. Hmm, I guess that could be hot-pluggable with high
> end hardware.
>
> > You could encode that device id in the node's path or use
> the path as a
> > moniker for the device id (the symlink solution does this),
> but you need to
> > have more information about the device than it's minor
> number (the X in
> > /dev/lpX).
> >
>
> What does devfs do now?
>
> >
> > > >Or for ethernet adapters:
> > > > because each is connected to a different network, so
> you need to assign
> > > > different IP addresses to them.
> > > I haven't seen anything assign ethX assign a certain
> order, except for
> > > ordered module loading, and then if there are multiple
> devices with the
> > > same driver, the order is chosen by bus scanning order,
> or module option.
> >
> > Ok, but I think no one doubts that it is a bad idea to assign ethX
> > semi-randomly. Basically this is the same problem as with
> device files, only
> > in a different namespace.
> >
>
> So is that in favor of changing the current ethX naming
> convention or not?
>
> >
> > > Does anyone know if devfs will, or has any plans to
> support any of the
> > > above features?
> >
> > The device registry (http://www.tjansen.de/devreg) patches devfs
> to allow the things
> > described above though.
> >
>
> Everything, with all of the ids? What about scsi/ide?
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-10-19 22:23:20

by Tim Jansen

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Friday 19 October 2001 22:24, you wrote:
> > You could encode that device id in the node's path or use the path as a
> > moniker for the device id (the symlink solution does this), but you need
> > to have more information about the device than it's minor number (the X
> > in /dev/lpX).
> What does devfs do now?

Gets the name from the device driver, usually X is the minor number (or the
minor number + some constant, if several drivers share a major number). The
names are only constant if the devices are discovered in the same order.


> > Ok, but I think no one doubts that it is a bad idea to assign ethX
> > semi-randomly. Basically this is the same problem as with device files,
> > only in a different namespace.
> So is that in favor of changing the current ethX naming convention or not?

I don't know. You don't need a device file for networking, but if there is
some mechanism to allow stable names it would certainly be good to use it for
network, too.


> > The device registry (http://www.tjansen.de/devreg) patches devfs to allow the
> > things described above though.
> Everything, with all of the ids? What about scsi/ide?

Only SCSI/sd, PCI and USB.

bye...

2001-10-19 23:36:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Reading about the suspend to disk issue, and thinking about
some of my needs, I tend to stil think we have overlooked
that issue. We should probably add a couple of list_heads
to define a second tree in parallell to the device-tree, which
is the power tree. A device is by default inserted in both
tree as a child of it's bus controller. But the arch must be
able to move it elsewhere. I beleive we have a way around the
VM related ordering issues, but we do have other kind of
ordering constraints.

It may be me not reading well, but I think you didn't define
the fact that io_bus is a superset of device. In fact, it's
just a device that has childs, and this should probably be
more generically viewed in struct device itself. Any device
should be able to have childs, so we really have 2 interleaved
trees of devices, the bus tree and the power tree. In fact,
to be complete, we could even define the interrupt tree with
one more set of links as it's really not related to the bus
tree on many archs/machines, and having a tree definition
is really useful when you deal with cascaded controllers.

What do you think ?

Ben.


2001-10-19 23:36:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Reading about the suspend to disk issue, and thinking about
some of pmac needs, I tend to stil think we have overlooked
that ordering issue.

We should probably add a couple of list_heads to define a second
tree in parallell to the device-tree, which is the power tree.
A device is by default inserted in both tree as a child of it's bus
controller.
But the arch must be able to move it elsewhere. I beleive we have
a way around the VM related ordering issues, but we do have other
kind of ordering constraints that have to be dealt with when
we start broadcasting the callbacks.

Also, I think you didn't state that io_bus is a superset of device.
In fact, it's just a device that has childs, and this should
probably be more generically viewed in struct device itself.

Any device should be able to have childs, so we really have 2
interleaved trees of devices, the bus tree and the power tree.
In fact, to be complete, we could even define the interrupt tree
with one more set of links as it's really not related to the bus
tree on many archs/machines, and having a tree definition is really
useful when you deal with cascaded controllers.

What do you think ?

Ben.


2001-10-19 23:57:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>Reading about the suspend to disk issue, and thinking about
>some of my needs, I tend to stil think we have overlooked
>that issue. We should probably add a couple of list_heads
>to define a second tree in parallell to the device-tree, which
>is the power tree. A device is by default inserted in both
>tree as a child of it's bus controller. But the arch must be
>able to move it elsewhere. I beleive we have a way around the
>VM related ordering issues, but we do have other kind of
>ordering constraints.
>
> .../...

Argh, I'm too tired, I sent the draft ! sorry ;)

Ben.


2001-10-20 00:11:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Sat, 20 Oct 2001, Benjamin Herrenschmidt wrote:
>
> Reading about the suspend to disk issue, and thinking about
> some of pmac needs, I tend to stil think we have overlooked
> that ordering issue.

Why?

If there is some ordering inherent in the bus, that has to be shown in the
bus structure. Why would you EVER care about order between devices that
are independent?

Linus

2001-10-20 01:41:54

by john slee

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Fri, Oct 19, 2001 at 12:21:01PM -0700, Mike Fedyk wrote:
> When is /etc/fstab going to support this?

it does; at least on my debian system:

# e2label /dev/hda1

# e2label /dev/hda1 foo
# e2label /dev/hda1
foo
# mount LABEL=foo /mnt
#

you can use the same LABEL=foo syntax in /etc/fstab...
according to my fstab(5) manpage this also works with xfs, although i've
not tried it.

surely i am not deluded and this is a not-debian-specific feature?
having used nothing but debian for some years now i really can't be
sure...

j.

--
R N G G "Well, there it goes again... And we just sit
I G G G here without opposable thumbs." -- gary larson

2001-10-20 09:28:51

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>Why?
>
>If there is some ordering inherent in the bus, that has to be shown in the
>bus structure. Why would you EVER care about order between devices that
>are independent?

The power tree layout isn't necessarily identical to the bus tree layout.
On some macs, for example, we have some ASICs that can control some other
chip's clock and power lines, without having a direct parent relationchip.

So I like having the ability to reorder the power tree layout from
arch code. But I can work around if this ability is not provided by
the struct device. In fact, my main issue here is with Apple's big
"mac-io" ASIC (combo of several devices along with various IO lines
and clocks), and I beleive I will have to handle it as a special
case anyway for other reason (it must really be shut down last as
once down, I can't even talk to the power manager chip ;) I think
all other devices I have to deal with follow the physical bus
ordering.

The problem of suspend-to-disk, which requires, I beleive, that the
device used for the memory backup, to be state-saved last, is still
a problem I don't know how to solve. Maybe using flags in the device
structure indicating it's deferred. That would cause it's parents
to be deferred as well. The presence of the flag would prevent the
actual "suspend" state to be entered during step 3. Once all devices
are suspended, we then know that the bus path to the disk used for
suspend-to-disk is still powered and perform the actual suspend-to
disk operation.

However, I beleive that requires using non-generic IO functions (as
IO queues for the controller have already been blocked by step 2, and
the driver would have to deal with it's own saved state carefuly as
it can't obviously save state to RAM after it has been used to backup
the RAM itself). Maybe that could be a separate message (suspend_to_disk)
sent instead of step 3 (suspend) to this device.

That would give us the following scenario:

- The device for suspend-to-disk is identified and a flag is set
in it's device structure. This flag (or a different one to make
things clear eventually) is "broadcast" all the way up the tree
so it's parent brigdes/controllers are marked as well.
- All devices get "suspend_prepare".
- All devices get "suspend_save_state" and block normal IOs
- All devices not marked above get "suspend"
- Last housecleaning is done by the kernel.
- The device marked above get a special "suspend_to_disk" message
during which it can perform the actual memory backup and suspend
itself.
- The machine is put to sleep.

I currently don't implement suspend-to-disks on Mac, so I may have
overlooked something. Also, I'm not too sure about the requirements
of x86 laptops regarding those features. I'm lucky, Mac laptops
keep the RAM content during suspend ;)

Note also that if not doing suspend-to-disk, I think we should also
make sure to sync all buffers after suspend_prepare and before sending
the suspend_save_state messages. I noticed that recent 2.4 versions
are more sensible about power fail during suspend (that is battery
getting empty, or whatever causing lose of RAM content). I used to
call fsync_devs(0) between those 2 steps in the Mac PM scheme, but
it appears that with recent 2.4's, this doesn't prevent fsck from
finding inconsistencies.

Ben.



2001-10-20 18:22:53

by kaih

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

[email protected] (Tim Jansen) wrote on 20.10.01 in <[email protected]>:

> On Friday 19 October 2001 22:24, you wrote:

> > > Ok, but I think no one doubts that it is a bad idea to assign ethX
> > > semi-randomly. Basically this is the same problem as with device files,
> > > only in a different namespace.
> > So is that in favor of changing the current ethX naming convention or not?
>
> I don't know. You don't need a device file for networking, but if there is
> some mechanism to allow stable names it would certainly be good to use it
> for network, too.

You need stable identifiers, but those identifiers don't need to be the
usual names, as long as you have a way to find out which identifier goes
with which name dynamically.


MfG Kai

2001-10-20 18:22:53

by kaih

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

[email protected] (Mike Fedyk) wrote on 19.10.01 in <[email protected]>:

> On Fri, Oct 19, 2001 at 09:02:09PM +0200, Tim Jansen wrote:
> > On Friday 19 October 2001 20:26, Patrick Mochel wrote:
> > > There are equivalents in USB. But, neither of them are globally unique
> > > identifiers for the device. That doesn't necessarily mean that one
> > > couldn't be ascertained from the device; ethernet cards do have MAC
> > > addresses. But, I don't think that many will have a ID/serial number.
> > > [...]
> > > Which leads me to the question: what real benefit does this have? Why
> > > would you ever want to do a global search in kernel space for a
> > > particular device?
> >
> > For example for harddisks. You usually want them to be mounted in the same
> > directory.
>
> When is /etc/fstab going to support this?

Know your tools.

/etc/fstab:
UUID=eba05cbf-55ff-44d7-846a-7846c6010843 /usr ext2 defaults,nocheck 0 2

I have this mounted right now, on 2.2.19:

/dev/sdb7 3936400 3597588 138852 97% /usr

That's an ext2 partition ID, so even if repartitioning renumbers the
partition, mount will still find it - only mkfs forces me to use a new ID.
Changing the controller and SCSI id obviously makes no difference
whatsoever. I could use labels, too, but they tend to be less unique.

/proc/partitions is necessary to know what partitions to look at, of
course.

> >Or for ethernet adapters:
> > because each is connected to a different network, so you need to assign
> > different IP addresses to them.
> >
>
> I haven't seen anything assign ethX assign a certain order, except for
> ordered module loading, and then if there are multiple devices with the same
> driver, the order is chosen by bus scanning order, or module option.

Exactly. So you can't use the order if there's any possibility of this, so
you need to use the MAC address.

MfG Kai

2001-10-21 18:03:03

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Hi!

> The problem of suspend-to-disk, which requires, I beleive, that the
> device used for the memory backup, to be state-saved last, is still
> a problem I don't know how to solve. Maybe using flags in the device

Don't care about it. Its easy.

> That would give us the following scenario:
>
> - The device for suspend-to-disk is identified and a flag is set
> in it's device structure. This flag (or a different one to make
> things clear eventually) is "broadcast" all the way up the tree
> so it's parent brigdes/controllers are marked as well.

You don't need this.

> - All devices get "suspend_prepare".
> - All devices get "suspend_save_state" and block normal IOs
> - All devices not marked above get "suspend"

... not needed. You are going powerdown (suspend-to-disk ends in
powerdown, right?), so you don't care about state devices are in. You don't need to suspend them.

You just write state to disk and powerdown, now.
Pavel
--
I'm [email protected]. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at [email protected]

2001-10-22 11:08:08

by Padraig Brady

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Kai Henningsen wrote:

>[email protected] (Mike Fedyk) wrote on 19.10.01 in <[email protected]>:
>
>>On Fri, Oct 19, 2001 at 09:02:09PM +0200, Tim Jansen wrote:
>>
>>>On Friday 19 October 2001 20:26, Patrick Mochel wrote:
>>>
>>>>There are equivalents in USB. But, neither of them are globally unique
>>>>identifiers for the device. That doesn't necessarily mean that one
>>>>couldn't be ascertained from the device; ethernet cards do have MAC
>>>>addresses. But, I don't think that many will have a ID/serial number.
>>>>[...]
>>>>Which leads me to the question: what real benefit does this have? Why
>>>>would you ever want to do a global search in kernel space for a
>>>>particular device?
>>>>
>>>For example for harddisks. You usually want them to be mounted in the same
>>>directory.
>>>
>>When is /etc/fstab going to support this?
>>
>
>Know your tools.
>
>/etc/fstab:
>UUID=eba05cbf-55ff-44d7-846a-7846c6010843 /usr ext2 defaults,nocheck 0 2
>
>I have this mounted right now, on 2.2.19:
>
>/dev/sdb7 3936400 3597588 138852 97% /usr
>
>That's an ext2 partition ID, so even if repartitioning renumbers the
>partition, mount will still find it - only mkfs forces me to use a new ID.
>Changing the controller and SCSI id obviously makes no difference
>whatsoever. I could use labels, too, but they tend to be less unique.
>
>/proc/partitions is necessary to know what partitions to look at, of
>course.
>
>>>Or for ethernet adapters:
>>>because each is connected to a different network, so you need to assign
>>>different IP addresses to them.
>>>
>>I haven't seen anything assign ethX assign a certain order, except for
>>ordered module loading, and then if there are multiple devices with the same
>>driver, the order is chosen by bus scanning order, or module option.
>>
>
>Exactly. So you can't use the order if there's any possibility of this, so
>you need to use the MAC address.
>
Yes this (MAC address) is the only general way of doing it.
For ethernet cards (or anything else) on the PCI bus you can use
the following to specify an order:
ftp://platan.vc.cvut.cz/pub/linux/pciorder.patch-2.4.12-ac1.gz
This allows you to pass the following parameter to the kernel:
pciorder=Bus:Node.Fn,Bus:Node.Fn,... e.g.
pciorder=0:0d.0,0:0b.0,0:0a.0

Padraig.


2001-10-23 00:20:05

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


> > That would give us the following scenario:
> >
> > - The device for suspend-to-disk is identified and a flag is set
> > in it's device structure. This flag (or a different one to make
> > things clear eventually) is "broadcast" all the way up the tree
> > so it's parent brigdes/controllers are marked as well.
>
> You don't need this.

Correct. Suspend-to-disk is a process, not a discrete action. It goes much
like Ben described:

>From the system point of view (slightly different from the driver's point
of view):

- notify devices of pending suspension; check for failures
- tell devices to suspend state; write this state to disk
- turn all devices off
- power off (aka suspend system)

> > - All devices get "suspend_prepare".
> > - All devices get "suspend_save_state" and block normal IOs
> > - All devices not marked above get "suspend"
>
> ... not needed. You are going powerdown (suspend-to-disk ends in
> powerdown, right?), so you don't care about state devices are in. You don't need to suspend them.
>
> You just write state to disk and powerdown, now.

Uhm...That would probably work. But, I would rather explicitly turn off
all the devices. In the world of ACPI, S4 and S5 (Suspend to disk and
soft-off respectively) don't power the system completely off; some power
remains to capture wake events.

[ One point to note is that in an ACPI-enabled system, we use S5 to power
the system off. But, there are things still running in the system.
Everything is not shut off until you perform a mechanical off (unplug it).
This means that you can get wake events and cause the system to boot after
you think you've turned it off. This is why some people are experiencing
reboots when the system should be powering down. ]

By explicitly turning off as many devices as possible, we're playing more
on the safe side of things, and possibly reducing the amount of power that
is being consumed.


Btw, I updated the model to support an n-stage suspend process, with 3
stages explicitly defined, as per some discussion about it. Those stages
are:

SUSPEND_NOTIFY
SUSPEND_SAVE_STATE
SUSPEND_POWER_DOWN

To suspend the device tree, one would do something like:

/* Tell all the devices we're going to sleep.
* This also allows them to allocate memory before the swap
* device stops taking orders.
*/
device_suspend(3, SUSPEND_NOTIFY);

/* if someone failed, get out now */

/* Now tell them to stop I/O and save their state */
device_suspend(3, SUSPEND_SAVE_STATE);

/* Write the state to disk... */

/* Finally, turn all of the devices off. */
device_suspend(3, SUSPEND_POWER_DOWN);

/* Now, put the system to sleep. */


I also updated the docs. It can all be found at:

http://kernel.org/pub/linux/kernel/people/mochel/device/


-pat

2001-10-23 00:26:26

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> /* Now tell them to stop I/O and save their state */
> device_suspend(3, SUSPEND_SAVE_STATE);

I'd very much like this one to be two pass, with the second pass occuring
after interrupts are disabled. There are some horrible cases to try and
handle otherwise (like devices that like to jam the irq line high).

Ditto on return from suspend where some devices also like to float the irq
high as you take them over (eg USB on my Palmax). From comments Ben made
ages back I believe ppc has similar issues if not worse


Alan

2001-10-23 00:29:06

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Tue, 23 Oct 2001, Alan Cox wrote:

> > /* Now tell them to stop I/O and save their state */
> > device_suspend(3, SUSPEND_SAVE_STATE);
>
> I'd very much like this one to be two pass, with the second pass occuring
> after interrupts are disabled. There are some horrible cases to try and
> handle otherwise (like devices that like to jam the irq line high).

I forgot to mention to disable interrupts after the SUSPEND_NOTIFY call.
The idea is to allocate all memory in the first pass, disable interrupts,
then save state. Would that work? Or, should some of the state saving take
place with interrupts enabled?


> Ditto on return from suspend where some devices also like to float the irq
> high as you take them over (eg USB on my Palmax). From comments Ben made
> ages back I believe ppc has similar issues if not worse

Yes, the resume sequence is broken into two stages:

device_resume(RESUME_POWER_ON);

/* enable interrupts */

device_resume(RESUME_RESTORE_STATE);

Do you see a need to break it up further?

-pat


2001-10-23 09:45:23

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Hi!

> > > /* Now tell them to stop I/O and save their state */
> > > device_suspend(3, SUSPEND_SAVE_STATE);
> >
> > I'd very much like this one to be two pass, with the second pass occuring
> > after interrupts are disabled. There are some horrible cases to try and
> > handle otherwise (like devices that like to jam the irq line high).
>
> I forgot to mention to disable interrupts after the SUSPEND_NOTIFY call.
> The idea is to allocate all memory in the first pass, disable interrupts,
> then save state. Would that work? Or, should some of the state saving take
> place with interrupts enabled?

That looks ugly, because you'd need to add DONT_SUSPEND_NOTIFY, called
when SUSPEND_NOTIFY fails.
Pavel
--
Casualities in World Trade Center: 6453 dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.

2001-10-23 10:55:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>I'd very much like this one to be two pass, with the second pass occuring
>after interrupts are disabled. There are some horrible cases to try and
>handle otherwise (like devices that like to jam the irq line high).
>
>Ditto on return from suspend where some devices also like to float the irq
>high as you take them over (eg USB on my Palmax). From comments Ben made
>ages back I believe ppc has similar issues if not worse

Well, the idea here was to disable them between the second and third
pass. Device that can completely suspend with interrupts enabled can
do it at the end of step 2, while more broken devices can do it at
step 3. It might be semantically more clear to actually consider step
2 as exclusively "block io & save state", in which case, breaking up
the "suspend" state into 2 separate states with and without interrupt
makes sense. We just didn't fell that was necessary.

Ben.


2001-10-23 11:03:50

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>> I forgot to mention to disable interrupts after the SUSPEND_NOTIFY call.
>> The idea is to allocate all memory in the first pass, disable interrupts,
>> then save state. Would that work? Or, should some of the state saving take
>> place with interrupts enabled?
>
>That looks ugly, because you'd need to add DONT_SUSPEND_NOTIFY, called
>when SUSPEND_NOTIFY fails.
> Pavel

No, interrupts have to be shut down between SUSPEND_SAVE_STATE and
SUSPEND_POWER_DOWN, I beleive.

SUSPEND_SAVE_STATE must run with interrupts enabled, as it's supposed
to both block new incoming IOs and wait for pending ones to complete (*).
It would be sub-efficient to force drivers to implement polled IOs for
this case.

SUSPEND_POWER_DOWN itself should perfectly be able to run with interrupts
disabled, I beleive, as must of the actual suspend sequence can be done
in SUSPEND_SAVE_STATE on most chips.

There is no problem with failure there. Just call RESUME_POWER_ON if
SUSPEND_POWER_DOWN failed (or later), then RESUME_RESTORE_STATE in
all cases. The driver knows from which state it comes from anyway,
and we don't have, I beleive, that strick VM need of separating
suspend from free's. Well... let's think more about it... we might
actually need to allocate memory in RESUME_RESTORE_STATE to create
new requests or whatever the driver need... but we can also do that
earlier, inside SUSPEND_NOTIFY (or just use whatever memory we
pre-allocated to save state). So it might make sense to have the
resume process be an exact mirror of the wakeup one, or not, maybe
just a matter of taste.

In most cases, keep in mind that most drivers won't need to implement
all of these.

Ben.




2001-10-23 11:49:42

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>SUSPEND_SAVE_STATE must run with interrupts enabled, as it's supposed
>to both block new incoming IOs and wait for pending ones to complete (*).
>It would be sub-efficient to force drivers to implement polled IOs for
>this case.

I forgot...

(*) Did you decide if you allowed that "call me again later" result code
from SUSPEND_SAVE_STATE ? If yes, that would mean you must loop notifying
all drivers that have not ack'ed it until they all do before going to
SUSPEND_POWER_DOWN. It's probably not much bloat to let the feature in,
as usual, it doesn't have to be used by drivers, but for those drivers
who knows it will take some time for pending async requests to complete,
it makes sense to let others perform they job. It would slightly speed
up the suspend process, which is not critical, but still nice ;)

Ben.




2001-10-23 12:57:06

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> The idea is to allocate all memory in the first pass, disable interrupts,
> then save state. Would that work? Or, should some of the state saving take
> place with interrupts enabled?

Imagine the state saving done on a USB device. There you need interrupts
on while retrieving the state from say a USB scanner, and in some cases
off while killing the USB controller.

> > Ditto on return from suspend where some devices also like to float the irq
> > high as you take them over (eg USB on my Palmax). From comments Ben made
> > ages back I believe ppc has similar issues if not worse
>
> Yes, the resume sequence is broken into two stages:
>
> device_resume(RESUME_POWER_ON);
>
> /* enable interrupts */
>
> device_resume(RESUME_RESTORE_STATE);
>
> Do you see a need to break it up further?

Nope.

2001-10-23 15:10:52

by Jonathan Lundell

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

At 8:53 AM +0100 10/23/01, Alan Cox wrote:
> > The idea is to allocate all memory in the first pass, disable interrupts,
>> then save state. Would that work? Or, should some of the state saving take
>> place with interrupts enabled?
>
>Imagine the state saving done on a USB device. There you need interrupts
>on while retrieving the state from say a USB scanner, and in some cases
>off while killing the USB controller.

Is this a realistic example? That is, is a kernel-side driver likely
to be able to meaningfully extract state information from a scanner?
And is it necessary?

And for a scanner, if the current operation is a scan generating a GB
of data, what happens if the disk subsystem is no longer accepting
requests?

As Jeff Garzik pointed out, NIC drivers typically don't need to save
any state at all; it's all recreateable from software structures.
Perhaps that characteristic can and should be generalized to other
devices.

In that case, SUSPEND_SAVE_STATE becomes more like SUSPEND_QUIESCE:
stop accepting new requests, and complete current requests.

"Stop accepting new requests" is nontrivial as well, in the general
case. New requests that can't be discarded need to be queued
somewhere. Whose responsibility is that? Ideally at some point where
a queue already exists, possibly in the requester.
--
/Jonathan Lundell.

2001-10-23 15:44:35

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> Is this a realistic example? That is, is a kernel-side driver likely
> to be able to meaningfully extract state information from a scanner?
> And is it necessary?

It may be a bad example - but think about things like page settings. Do you
want a resume to scan in colour when you set black and white just before
suspend ?


> And for a scanner, if the current operation is a scan generating a GB
> of data, what happens if the disk subsystem is no longer accepting
> requests?

It should have refused to suspend because it was active

> In that case, SUSPEND_SAVE_STATE becomes more like SUSPEND_QUIESCE:
> stop accepting new requests, and complete current requests.

Maybe. That sounds like nice design and horrible implementation

2001-10-23 20:45:06

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>"Stop accepting new requests" is nontrivial as well, in the general
>case. New requests that can't be discarded need to be queued
>somewhere. Whose responsibility is that? Ideally at some point where
>a queue already exists, possibly in the requester.

Some driver already handle queues. In the case of network driver, just
stop your network queue and stop accepting incoming packets. If your
driver is too simple to have queues, a simple semaphore on entry points
can often be enough. You shouldn't deadlock as you are not supposed to
re-enter a sleeping driver in step 2.

The above, is ensured by the tree layout which does the dependency
ordering. You might have slightly off-tree dependencies, like I have
in a couple of case on macs. But I figured that all of them could be
handled as special case in some parent nodes without beeing that
dirty (in most case, those are Apple specific ASICs containing devices
with inter-deps, and the workaround is to move some devices sleep code
to the node of the ASIC itself).

Ben.


2001-10-23 20:48:36

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> >"Stop accepting new requests" is nontrivial as well, in the general
> >case. New requests that can't be discarded need to be queued
> >somewhere. Whose responsibility is that? Ideally at some point where
> >a queue already exists, possibly in the requester.
>
> Some driver already handle queues. In the case of network driver, just
> stop your network queue and stop accepting incoming packets. If your
> driver is too simple to have queues, a simple semaphore on entry points
> can often be enough. You shouldn't deadlock as you are not supposed to
> re-enter a sleeping driver in step 2.

Stop accepting new requests is not simple. To complete existing requests you
might need an arbitary other module to complete a new request you submit
as part of your shutdown.

Alan

2001-10-24 00:26:25

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>> Some driver already handle queues. In the case of network driver, just
>> stop your network queue and stop accepting incoming packets. If your
>> driver is too simple to have queues, a simple semaphore on entry points
>> can often be enough. You shouldn't deadlock as you are not supposed to
>> re-enter a sleeping driver in step 2.
>
>Stop accepting new requests is not simple. To complete existing requests you
>might need an arbitary other module to complete a new request you submit
>as part of your shutdown.

That mean you have an ordering dependency, the driver you rely
upon must be stopped after you. That's the point of having a
tree here. Patrick and Linus feel the bus tree is enough to handle
that dependency, which might well be the case for 99% of drivers.

I have a couple of cases where that's not completely true on pmacs,
but nothing that can't worked around simply I beleive.

If you feel more drivers will be affected, then we will probably
need to separate the power-tree from the device-tree and provide
some hooks so that ordering can be tweaked.

All this assumes you don't have circular dependencies of course ;)

I see a lot of cases where this "block IOs" is easily dealt with
in the drivers I maintain on pmac, that might not be that easy on
other archs, I can't tell.

Basically, simple drivers can just use a semaphore. I do that for
our sound driver for example, I block any app doing an ioctl while
the driver is sleeping. (This happens late enough in the sleep
process so that userland using /dev/apm_bios already got
notified and acked the suspend, letting properly written apps to
have stopped themselves already).

Drivers using a request queue usually already have a way to mark
themselves busy (they use that to decide if they have to kick
the HW or not when getting a new request). In cases where a mid-layer
enters the scene, like SCSI, that wants to do timeouts, then well...
we can let it timeout (just stop processing requests), or we can
have the midlayer go to sleep as well :) That later solution
may cause some interesting ordering issues however...

Network drivers can stop their queue or just drop packets... I'd
like if they waited for packets received from the network stack
before the callback is called are waited to be sent. Those packets
may contain the request to a server to send a wake-on-lan magic
packet to your machine ;) For now, I just block the output
queue and flush the rings on pmac, but I also dont support WOL yet.

For fbdevs, I simply switch them to dummy functions when asleep.
This appear to work well. (Well, I do some additional state save
and PM, but all I do for "blocking IOs" is to drop them...)
Any printk done after they are suspended isn't displayed, but that's
not a real issue.

So yes, "blocking IOs" can actually mean "dropping new IOs",
that depends very much on the driver.

For USB, for example, we can consider that when a device driver
(not a controller driver) suspend has been done, any URB it submits
can just be dropped (returned immediately with an error). We don't
need blocking here neither. Of course, that means we have the
framework to call devices' suspend/resume callbacks when the
controller is about to go to sleep.

There might be other examples. I agree it's not a 2 lines fix
per driver, but that's the better I could imagine so far to have
something reliable.

If you have other ideas, please share.

Ben.


2001-10-24 09:51:58

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> upon must be stopped after you. That's the point of having a
> tree here. Patrick and Linus feel the bus tree is enough to handle
> that dependency, which might well be the case for 99% of drivers.

The two trees are certainly closely related - USB devices before USB hub,
USB hub before PCI etc. The scanner example works fine there, providing that
we are careful about memory issues - remember the USB layer allocates memory
to do any transaction, so the scanner has to complete its state save before
we do any interrupt disabling/memory alloc freezing.

Thats still just ordering and maybe two passes

> the HW or not when getting a new request). In cases where a mid-layer
> enters the scene, like SCSI, that wants to do timeouts, then well...
> we can let it timeout (just stop processing requests), or we can
> have the midlayer go to sleep as well :) That later solution
> may cause some interesting ordering issues however...

For scsi you have to complete the pending commands, you don't know what the
transaction granularity is in some cases and half completing the sequence
won't help you. In addition the upper layers have to queue additional scsi
commands to do stuff like cd drawer locking and to ask the drive firmware
to enter powerdown modes

> For USB, for example, we can consider that when a device driver
> (not a controller driver) suspend has been done, any URB it submits
> can just be dropped (returned immediately with an error). We don't
> need blocking here neither. Of course, that means we have the
> framework to call devices' suspend/resume callbacks when the
> controller is about to go to sleep.

That will scramble large numbers of devices. Randomly erroring pending block
writes is -not- civilised.

Alan

2001-10-24 10:35:41

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>
>The two trees are certainly closely related - USB devices before USB hub,
>USB hub before PCI etc. The scanner example works fine there, providing that
>we are careful about memory issues - remember the USB layer allocates memory
>to do any transaction, so the scanner has to complete its state save before
>we do any interrupt disabling/memory alloc freezing.

That's why we have that first step that is run before any device is blocked
and with interrupts still flowing. Devices that need memory for either state
save or for requests used during power down are supposed to allocate them
at this point.

>Thats still just ordering and maybe two passes

Well, 3 passes actually ;) I'd suggest you re-read Patrick's mochel latest
document, which describes the 3 step process. One first pass give a chance
to device to "prepare" for sleep, that is allocate anything they will need
without actually blocking or suspending anything. On the second pass, devices
are asked to suspend IOs, complete pending ones, and save state. This is
done with interrupts still enabled. The 3rd pass is called with interrupt
disabled and is where the actual shutdown of the device is supposed to happen.

In practice, a lot of drivers will only need to implement 1 or 2 of these
3 passes. But the flexibility has to be there for a device that need both\
to allocate memory and to suspend with interrupts disabled.

An additional idea we had was to make pass 2 somewhat asyncrhonous by
allowing some kind of "call my later" result code (basically, a device
would block it's queue, and return "call me later" while it still has
pending IOs). This is an optional "feature".

>> the HW or not when getting a new request). In cases where a mid-layer
>> enters the scene, like SCSI, that wants to do timeouts, then well...
>> we can let it timeout (just stop processing requests), or we can
>> have the midlayer go to sleep as well :) That later solution
>> may cause some interesting ordering issues however...
>
>For scsi you have to complete the pending commands, you don't know what the
>transaction granularity is in some cases and half completing the sequence
>won't help you. In addition the upper layers have to queue additional scsi
>commands to do stuff like cd drawer locking and to ask the drive firmware
>to enter powerdown modes

Yes. SCSI is a problem as the current SCSI layer is tricky enough to
make this difficult. I beleive that if the SCSI devices are childs of
the SCSI controller, then they can take care of suspending the device
(that is sending whatever command to lock the drawer or stop the disk
spinning) before the controller is actually going to sleep. In that
case, there's not much left to the controller, it isn't supposed to
have any command in queue nor receive any new one once all it's child
drivers have suspended.

>> For USB, for example, we can consider that when a device driver
>> (not a controller driver) suspend has been done, any URB it submits
>> can just be dropped (returned immediately with an error). We don't
>> need blocking here neither. Of course, that means we have the
>> framework to call devices' suspend/resume callbacks when the
>> controller is about to go to sleep.
>
>That will scramble large numbers of devices. Randomly erroring pending block
>writes is -not- civilised.

The drivers will have to be adapted for PM, whatever scheme we use. The
USB case is very similar to SCSI. The controller is a parent of all
devices. Devices will get the suspend before the controller, they will
have a chance to suspend requests and wait for pending ones to complete
(for example, the USB storage will have a chance to block new incoming
requests in the queue and wait for pending ones to complete) before
the USB controller is put to suspend.
That way, once we are reaching real USB controller suspend, we can safely
discard urbs as we are not supposed to get any until devices have been
resumed.

I currently don't do that on pmac (I just let URBs be handled by OHCI
and let OHCI fill the TD queues in memory, I only prevent the controller
from actually handling those queues). It's not good as some drivers will
get error messages due to their underlying device beeing suspended,
and won't understand why the URBs are going away.

In the case of USB devices that don't support suspend state, it's slightly
more tricky as on some HW, I may have to actually turn them off. That mean
the driver must deal with re-doing configuration & set-interface on
wakeup. That is again a driver matter, and a bit like the PCI case we
mentioned previously, we need some way to know if driver for a given
device can handle the requested power state or not. If not, we should
probably abort the suspend sequence, and find a clean interface to tell
userland about which device caused the failure so the user can deal with
it (rmmod the driver for example).


Ben.


2001-10-24 10:48:12

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> case, there's not much left to the controller, it isn't supposed to
> have any command in queue nor receive any new one once all it's child
> drivers have suspended.

scsi devices are children of the scsi subststem (sd, sg, sr, st, osst) not
of the controller. That is how the state flows anyway. Only sr/sd etc know
what the state is for a given device on power off as they may issue
multiple requests per action true transaction. sg would have to simply
refuse any suspend if open (think about cd-burning or even worse firmware
download)

So the scsi devices hang off sd, sr etc which in turn hang off scsi and
the controllers hang off scsi (and or the bus layers)

This one at least I think I do understand

2001-10-24 13:04:48

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>> case, there's not much left to the controller, it isn't supposed to
>> have any command in queue nor receive any new one once all it's child
>> drivers have suspended.
>
>scsi devices are children of the scsi subststem (sd, sg, sr, st, osst) not
>of the controller. That is how the state flows anyway. Only sr/sd etc know
>what the state is for a given device on power off as they may issue
>multiple requests per action true transaction. sg would have to simply
>refuse any suspend if open (think about cd-burning or even worse firmware
>download)
>
>So the scsi devices hang off sd, sr etc which in turn hang off scsi and
>the controllers hang off scsi (and or the bus layers)
>
>This one at least I think I do understand

The problem with subsystems is that they don't fit well in the
power tree. They aren't "devices" in that sense that they are
not exposing a struct device, and they spawn over several controllers
which means the dependency can quickly become unmanageable, especially
when SCSI starts beeing layered on top of USB or FireWire.

Also, the dependency issue is made worst if you let RAID enter into
the dance as I beleive ultimately, nothing would prevent a volume to
spawn over several devices from different controllers or even different
controller types.

So let's see if I properly understand what is needed in the SCSI case:

The parent is the controller. We can't do much about this since we need
that relationchip for ordering. By controller, it can be a real SCSI host,
but it can also be a virtual host exposed by an USB storage device or
a firewire SBP2 device.

The child of this controller has to be a struct device for each physical
device on the bus. (just one in the case of an USB storage). The struct
device for this child is +/- generic, possibly created by the generic
SCSI probe code.

This device might (must ?) have childs instanciated by whatever "client"
attach to a given SCSI device. Clients like sg would effectively refuse
suspend, while clients like sd would do standard disk spindown commands.
That mean there is not "one" PM node for the SCSI subsystem, but one
per instance of a given subsystem module.

Now, I'm not sure what would happen with RAID. If we need to have logical
volumes be child of the sd "client", then we have to face the fact that
a given child may have multiple parents... welcome to the power graph !
But do we really need logical volumes to be part of the PM tree or
can blocking of requests at the sd layer be enough ? Remember we are
in pass2, we have already done memory allocation, we are supposed to
no longer swap nor do any disk/storage related activity.

A tricky issue indeed...


Ben.


2001-10-24 13:19:29

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> Now, I'm not sure what would happen with RAID. If we need to have logical
> volumes be child of the sd "client", then we have to face the fact that
> a given child may have multiple parents... welcome to the power graph !
> But do we really need logical volumes to be part of the PM tree or
> can blocking of requests at the sd layer be enough ? Remember we are
> in pass2, we have already done memory allocation, we are supposed to
> no longer swap nor do any disk/storage related activity.

Assuming you want to synchronize the raid before suspend - a reasonably
policy but not essential then you'd have to shut down the raid before
sd, then sd would let the devices shut down which lets the controller
shutdown

2001-10-24 15:18:53

by Jonathan Lundell

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

At 10:57 AM +0100 10/24/01, Alan Cox wrote:
> > the HW or not when getting a new request). In cases where a mid-layer
>> enters the scene, like SCSI, that wants to do timeouts, then well...
>> we can let it timeout (just stop processing requests), or we can
>> have the midlayer go to sleep as well :) That later solution
>> may cause some interesting ordering issues however...
>
>For scsi you have to complete the pending commands, you don't know what the
>transaction granularity is in some cases and half completing the sequence
>won't help you. In addition the upper layers have to queue additional scsi
>commands to do stuff like cd drawer locking and to ask the drive firmware
>to enter powerdown modes
>
>> For USB, for example, we can consider that when a device driver
>> (not a controller driver) suspend has been done, any URB it submits
>> can just be dropped (returned immediately with an error). We don't
>> need blocking here neither. Of course, that means we have the
>> framework to call devices' suspend/resume callbacks when the
> > controller is about to go to sleep.
>
>That will scramble large numbers of devices. Randomly erroring pending block
>writes is -not- civilised.

In our "extreme prejudice" suspend (this is in the context of masking
& recovering from a fault in a fault-tolerant machine) we have cases
in which completion of pending commands isn't possible. Our solution
is to issue a SCSI bus reset, and terminate all outstanding commands
with an appropriate (retryable) error. This is especially easy to
implement in drivers that use SCSI bus reset as a routine (though
last resort) error recovery mechanism, since the requisite logic is
already in place. Not pretty, I suppose, but effective.

One model we've considered (but haven't implemented yet) is to make
parents in the device tree responsible for suspending their children,
so the suspend propagates down the tree and each node "knows" how to
suspend its children, assuming any special action is required. So a
SCSI HBA, for example, would be asked by its bus parent to suspend,
and in turn would suspend its SCSI device children before suspending
itself. I'm not quite sure how virtual device layers like md would
fit into this scheme, since they can cut across device and power
hierarchies.

At 11:54 AM +0100 10/24/01, Alan Cox wrote:
>So the scsi devices hang off sd, sr etc which in turn hang off scsi and
>the controllers hang off scsi (and or the bus layers)

Our first implementation was under Solaris 2.x (SPARC) in which the
parent->child relationship is bus->hba->sd. scsi isn't in the tree;
it's more of an interface layer between hba & sd. fwiw.
--
/Jonathan Lundell.

2001-10-24 15:44:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Wed, 24 Oct 2001, Alan Cox wrote:
>
> That will scramble large numbers of devices. Randomly erroring pending block
> writes is -not- civilised.

Note that one thing in suspending the machine that has _nothing_ to do
with the actual device tree is that higher layers have to suspend whatever
it is they are doing anyway.

Ie part of the suspend action (which is unrelated to the driver model) is
to stop all regularly scheduled activity - not necessarily flushing all
dirty buffers, but certainly waiting for all pending IO. That's a much
higher level thing that the device though - the devices themselves should
never ever see this (except in the sense that they don't see new requests
coming in).

There are other "higher-level" issues: while a device "prepare to suspend"
call might block for some device information, that does not mean that it
can allocate memory with GFP_KERNEL, for example: when we shut off device
X, the disk may have been prepared for shutdown already, and the VM layer
cannot do any IO. So the suspend (and resume) function have to use
GFP_NOIO for their allocations - _regardless_ of any other device issues.

So sure, there are tons of issues here, but none of them have, in my
opinion, anything to do with the device model itself. More just normal
implementation details.

Linus

2001-10-24 15:53:33

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> call might block for some device information, that does not mean that it
> can allocate memory with GFP_KERNEL, for example: when we shut off device
> X, the disk may have been prepared for shutdown already, and the VM layer
> cannot do any IO. So the suspend (and resume) function have to use
> GFP_NOIO for their allocations - _regardless_ of any other device issues.

So I have to write a whole extra set of code paths to duplicate normal
functionality during power off

> So sure, there are tons of issues here, but none of them have, in my
> opinion, anything to do with the device model itself. More just normal
> implementation details.

My concern is that we need to make the implementation details simple. eg
so that simple things like "save state" can be done before we get into
"no this, no that , no the other" situations. Also so that for the many
drivers where freezing the system once we have irqs off is easier (a lot
of sound for example is easiest done by disable irq, disable dma engine,
copy registers, return) can be done late and with small amounts of code

2001-10-24 16:00:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Wed, 24 Oct 2001, Alan Cox wrote:
>
> > call might block for some device information, that does not mean that it
> > can allocate memory with GFP_KERNEL, for example: when we shut off device
> > X, the disk may have been prepared for shutdown already, and the VM layer
> > cannot do any IO. So the suspend (and resume) function have to use
> > GFP_NOIO for their allocations - _regardless_ of any other device issues.
>
> So I have to write a whole extra set of code paths to duplicate normal
> functionality during power off

If that ends up being a problem, we can just make alloc_pages turn off the
IO bits on suspend. Easy enough..

Although I think you're making the problem bigger than it is. Most of the
suspend stuff should not need any "normal functionality" at all.

Linus

2001-10-24 16:17:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Wed, 24 Oct 2001, Benjamin Herrenschmidt wrote:
> >
> >So the scsi devices hang off sd, sr etc which in turn hang off scsi and
> >the controllers hang off scsi (and or the bus layers)
> >
> >This one at least I think I do understand
>
> The problem with subsystems is that they don't fit well in the
> power tree. They aren't "devices" in that sense that they are
> not exposing a struct device, and they spawn over several controllers
> which means the dependency can quickly become unmanageable, especially
> when SCSI starts beeing layered on top of USB or FireWire.

Why would you _ever_ get "sg.c" and other crap involved in the suspend
process?

The device tree is for _device_ suspend, not for "subsystem suspend". The
SCSI subsystem is a piece of cr*p, but even if it was perfect it should
never get involved with the act of suspension.

We should not have pending IO, but that's for a totally different reason:
the first thing the much much MUCH higher levels of suspend should be
doing is to make sure that user apps are "quiescent". And that isn't done
by getting involved with sg.c or anything similar, but by basically
stopping all user apps (think of the equivalent of a "kill -STOP -1", but
done internally in the kernel without actually using a signal).

> Also, the dependency issue is made worst if you let RAID enter into
> the dance as I beleive ultimately, nothing would prevent a volume to
> spawn over several devices from different controllers or even different
> controller types.

Why would you get RAID involved? There is no _IO_ involved in suspending:
we just stop doing what we're doing, and leave it at that. We don't try to
flush state, we just freeze the machine.

The act of "suspend" should basically be: shut off the SCSI controller,
screw all devices, reset the bus on resume.

The act of suspend on USB should be to turn off the host controller and
remove power from devices. End of story. Nothing fancy.

If somebody removes a disk or equivalent while we're suspended, that's
_his_ problem, and is exactly the same as removing a disk while the disk
is running. Either the subsystem (like USB) already handles it, or it
doesn't. Suspend is _not_ an excuse to do anything that isn't done at
run-time.

So suspend is _not_ supposed to be equivalent of a full clean shutdown
with just users not seeing it. That's way too expensive to be practical.
Remember: the main point of suspend is to have a laptop go to sleep, and
come back up on the order of a few _seconds_.

And if there are desktops which would like to suspend but cannot because
they aren't strictly designed for it, then tough - we should not try to
design a heavy suspend for hardware that doesn't live with it well.

Also, realize that the act of suspension is STARTED BY THE USER. Which
means that before the kernel suspends, you _can_ have user programs that
basically take disk arrays off-line etc if that is what you want. But
that's not ae kernel suspend issue.

Linus

2001-10-24 16:21:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Wed, 24 Oct 2001, Alan Cox wrote:
>
> Assuming you want to synchronize the raid before suspend - a reasonably
> policy but not essential then you'd have to shut down the raid before
> sd, then sd would let the devices shut down which lets the controller
> shutdown

I will _refuse_ to have a kernel suspend that synchronizes the raid etc.
That would make suspend/resume potentially take a _loong_ time.

If you want to synchronize your raid thing, make the user-level thing that
triggers the suspend do it. Same goes for things like "sync network
filesystems" etc. This is not a kernel level issue, and the kernel
shouldn't even try to do it.

If somebody has pending stuff over NFS and suspends, and when it comes
back it's not on the network any more, that is 100% equivalent to removing
a PCMCIA network card while running. It's supposed to work - but if you
lose data that's YOUR problem, not the kernels.

Linus

2001-10-24 16:36:18

by Michael H. Warfield

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Wed, Oct 24, 2001 at 09:19:45AM -0700, Linus Torvalds wrote:

> On Wed, 24 Oct 2001, Alan Cox wrote:

> > Assuming you want to synchronize the raid before suspend - a reasonably
> > policy but not essential then you'd have to shut down the raid before
> > sd, then sd would let the devices shut down which lets the controller
> > shutdown

> I will _refuse_ to have a kernel suspend that synchronizes the raid etc.
> That would make suspend/resume potentially take a _loong_ time.

If you have Magic SysRq enabled, would that do the job prior
to suspend? Typically with Pavel's swsusp package, I hit the Alt-SysRq-s
before hitting Alt-SysRq-d to suspend him. Does Alt-SysRq-s synchronize
a raid? Of course, at that point, the choice to take the "_loong_ time"
is in user space - meat space, user space - since I chose to hit that
key combination.

> If you want to synchronize your raid thing, make the user-level thing that
> triggers the suspend do it. Same goes for things like "sync network
> filesystems" etc. This is not a kernel level issue, and the kernel
> shouldn't even try to do it.

What does the Alt-SysRq-s combination do about networks then?

> If somebody has pending stuff over NFS and suspends, and when it comes
> back it's not on the network any more, that is 100% equivalent to removing
> a PCMCIA network card while running. It's supposed to work - but if you
> lose data that's YOUR problem, not the kernels.

> Linus

Mike
--
Michael H. Warfield | (770) 985-6132 | [email protected]
/\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/
NIC whois: MHW9 | An optimist believes we live in the best of all
PGP Key: 0xDF1DD471 | possible worlds. A pessimist is sure of it!

2001-10-24 16:46:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Wed, 24 Oct 2001, Michael H. Warfield wrote:
>
> > I will _refuse_ to have a kernel suspend that synchronizes the raid etc.
> > That would make suspend/resume potentially take a _loong_ time.
>
> If you have Magic SysRq enabled, would that do the job prior
> to suspend? Typically with Pavel's swsusp package, I hit the Alt-SysRq-s
> before hitting Alt-SysRq-d to suspend him. Does Alt-SysRq-s synchronize
> a raid? Of course, at that point, the choice to take the "_loong_ time"
> is in user space - meat space, user space - since I chose to hit that
> key combination.

Sure. I only refuse to have it be "integrated" into the suspend - but it's
certainly perfectly fine to have "combination events", whether by having
special keystrokes that starts them or by having scripts or programs that
first do the sync and then do "echo 3 > /proc/acpi/sleep" or whatever.

> What does the Alt-SysRq-s combination do about networks then?

I think it just does a "fsync_dev()", which will do the right thing for
network filesystems too.

But let's make an example: let's assume that I'm working on my laptop, and
the NFS server goes down so I decide to take a break. Should I not be able
to suspend, only because the sync won't finish?

That's the wrong answer. By _default_ I should just suspend, and when I
come back it will continue to try to write back the data (not by any
magical suspend/resume means, but just because that's what NFS does anyway
when the server hasn't answered)

Linus

2001-10-24 16:52:47

by Xavier Bestel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

le mer 24-10-2001 ? 18:15, Linus Torvalds a ?crit :
> Also, realize that the act of suspension is STARTED BY THE USER. Which

... or triggered by some kind of inactivity timer, or low battery
condition.

Xav

2001-10-24 16:58:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On 24 Oct 2001, Xavier Bestel wrote:
>
> le mer 24-10-2001 ? 18:15, Linus Torvalds a ?crit :
> > Also, realize that the act of suspension is STARTED BY THE USER. Which
>
> ... or triggered by some kind of inactivity timer, or low battery
> condition.

Note that even when that happens, it's not supposed to be the kernel start
_starts_ the activity of suspension.

An inactivity timer or low battery notification will just notify the
proper deamon, and the policy on what to do should be in user space. For
example, on low battery you might want to set up a X window warning the
user that the machine _will_ suspend in five seconds. And the kernel
certainly won't do that.

So as far as the kernel is concerned, a suspend is _always_ started by
"the user". Of course, the whole point with computers is that many things
can be automated, and "the user" may not be a human sitting at the
machine.

Linus

2001-10-24 16:57:37

by Patrick Mochel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On 24 Oct 2001, Xavier Bestel wrote:

> le mer 24-10-2001 ? 18:15, Linus Torvalds a ?crit :
> > Also, realize that the act of suspension is STARTED BY THE USER. Which
>
> ... or triggered by some kind of inactivity timer, or low battery
> condition.

...which should be done in userspace.

-pat

2001-10-24 17:02:47

by Mike Anderson

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Benjamin Herrenschmidt [[email protected]] wrote:
> Now, I'm not sure what would happen with RAID. If we need to have logical
> volumes be child of the sd "client", then we have to face the fact that
> a given child may have multiple parents... welcome to the power graph !

You do not have to add RAID to need to worry about multiple parents. If we
want to correctly represent devices that have multiple paths (i.e twin
tailed SCSI, fibre channel, multi-ported devices, etc) we should have a
solution to handle this. Some O/S's have moved to directed graphs to
address the multiple parent issue. Exposing only one block / character
device per real physical device would reduce O/S resources (major /
minors, structs) and provide a single request queue.

The current model of a scsi_device having a single parent and being
attached to the scsi_host host_queue has made adding multi-path support
to Linux below the SCSI lower level driver difficult.

> But do we really need logical volumes to be part of the PM tree or
> can blocking of requests at the sd layer be enough ? Remember we are
> in pass2, we have already done memory allocation, we are supposed to
> no longer swap nor do any disk/storage related activity.
>
> A tricky issue indeed...
>
>
> Ben.

-Mike
--
Michael Anderson
[email protected]

2001-10-24 17:34:05

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>Why would you _ever_ get "sg.c" and other crap involved in the suspend
>process?

Ahhh... ;)

>The device tree is for _device_ suspend, not for "subsystem suspend". The
>SCSI subsystem is a piece of cr*p, but even if it was perfect it should
>never get involved with the act of suspension.

I agree I'd like subsystems to avoid polluting the PM tree (or device tree).
If there are a few cases where a subsystem needs to know a driver it's using
is asleep, it's probably up to the interface of this susbystem to provide
a function to be called by the driver when it's going to suspend mode.

I can't really tell how it would be done for SCSI, I admit I got quickly
lost in the current drivers/scsi midlayer trying to figure out how
it layered things, but Andre told me it should be rewritten. I beleive
the proper way here would be to have an _interface_ (no mid-layer)
exposed by drivers who can receive SCSI commands (them beeing either USB,
ATAPI, FireWire or real SCSI devices).

However, Alan makes a point about state information. For example, the
"generic" CD-ROM driver for SCSI-protocol (those CD-ROMs beeing either
ATAPI, real SCSI, ...) would be the only one to know some state information
like rotation speed or whatever configuration info a given device may
support (placing ourselves in an ideal world of course, where all CDs
using the SCSI protocol share the same driver). So it must be some
way involved in the PM process.

I don't know what people plan for SCSI & ATAPI overhaul in 2.5, but
depending on how that scheme is done will impact the PM process.

>We should not have pending IO, but that's for a totally different reason:
>the first thing the much much MUCH higher levels of suspend should be
>doing is to make sure that user apps are "quiescent". And that isn't done
>by getting involved with sg.c or anything similar, but by basically
>stopping all user apps (think of the equivalent of a "kill -STOP -1", but
>done internally in the kernel without actually using a signal).

Is this necessary ? That would definitely make things easier to implement
to forget about incoming requests, but I'm not sure it's the right way.
In fact, is it really working ? You could well have in-kernel threads
triggering IOs or launching new userland stuffs (what happens if a not
yet suspended driver for, let's say USB, see a new device coming in
and starts /sbin/hotplug). Some filesystems have garbage collector threads
that can do IOs eventually. There are various kind of IOs that can in fact
be triggered entirely within the kernel.

Also, if you don't stop userland, wakeup is amazingly fast since
userland is up very soon, a sound app (mp3 player for example) will
be blocked until the driver it's write()'ing to is woken up,
a swapping app is blocked until the IO requests for that page is
completed (which means if it was blocked in a block driver queue,
once the block driver have resumed operations), etc...
Currently, on pmacs, i don't have the tree but I do have some
kind of priority mecanism. Userland is woken up as soon as the
disk is ready. I think scheduling only have to be disabled between
step 2 and 3 of the sleep process, that is after all IOs have been
blocked and before shutting down devices (that is before the step
that runs with IRQs off).

I really don't think it's _that_ difficult to properly do this blocking.
For things like sound drivers, a simple semaphore is plenty enough. For
network drivers, stopping the output queue is a one-line thing,
for block devices, it usually a matter of marking ourselves busy and
letting requests pile on our queue. The should be no risk of blocking
the sleep process itself this way since we made sure we did all allocations
needed by drivers prior to starting the actual blocking of requests, so we
won't block swap out.

Userland apps that hack on hardware directly like X will have been
suspended earlier (possibly via the existing /dev/apmbios interface
but we should rather define a new cleaner way to have PM control
from userland.

I have all of this more or less working on pmac laptops. I don't have
the new device model, so I handle dependencies manually with a priority
mecanism, but it's already good enough to let me resume userland before
ADB and sound, and possibly stuffs. (Which is nice since my sound chips
usually need one or 2 second to recalibrate and ADB need a few seconds to
probe the bus, all this happens asynchronously).

>> Also, the dependency issue is made worst if you let RAID enter into
>> the dance as I beleive ultimately, nothing would prevent a volume to
>> spawn over several devices from different controllers or even different
>> controller types.
>
>Why would you get RAID involved? There is no _IO_ involved in suspending:
>we just stop doing what we're doing, and leave it at that. We don't try to
>flush state, we just freeze the machine.

Ok.

>The act of "suspend" should basically be: shut off the SCSI controller,
>screw all devices, reset the bus on resume.

Well, some devices will need some state saving. You may need to save and
restore some speed setting for example. You want pending requests to
be properly terminated before you shut down, etc...

>The act of suspend on USB should be to turn off the host controller and
>remove power from devices. End of story. Nothing fancy.

Ehh... no, you want to put the USB bus into suspend state, which isn't
the same ;) Or you lose the ability to wake up the machine from the
USB keyboard, which on some iMacs is I think, the only way.

So before doing so, you need to iterate all childs of the controller
(that's fine, they have struct device entries with PM functions in them),
and tell them to suspend. Some devices need to be sent a command for
enabling remote wakeup. Others may need other housekeeping before the
bus goes to suspend.

But there's nothing magic there, nor anything fancy.
Just USB drivers have a struct device for each USB device, and are
notified of suspend normally as part of the device tree walk.
They are childs of the USB controller, we are guaranteed the controller
won't be down before devices.

I agree that is this is properly done, then we know the controller
won't have to deal with pending or new incoming IOs once it's its turn
to go sleeping, and can just enter it's suspend state immediately.

>If somebody removes a disk or equivalent while we're suspended, that's
>_his_ problem, and is exactly the same as removing a disk while the disk
>is running. Either the subsystem (like USB) already handles it, or it
>doesn't. Suspend is _not_ an excuse to do anything that isn't done at
>run-time.

Yup.

>So suspend is _not_ supposed to be equivalent of a full clean shutdown
>with just users not seeing it. That's way too expensive to be practical.
>Remember: the main point of suspend is to have a laptop go to sleep, and
>come back up on the order of a few _seconds_.

Yup.

>And if there are desktops which would like to suspend but cannot because
>they aren't strictly designed for it, then tough - we should not try to
>design a heavy suspend for hardware that doesn't live with it well.
>
>Also, realize that the act of suspension is STARTED BY THE USER. Which
>means that before the kernel suspends, you _can_ have user programs that
>basically take disk arrays off-line etc if that is what you want. But
>that's not ae kernel suspend issue.

Yes. We do that on pmac as well. The lid of the laptop is monitored by
a userland daemon, which runs scripts before and after sending the real
suspend ioctl to our PM driver.

There is also /dev/apm_bios which allows for the suspend process to
notify (and wait for ack) userland apps that requested X (in our case,
X so it stop banging the HW, properly save it's own state and properly
resume the card on wakeup).

However, that interface could as well be completely userland (X
requesting notifications from the PM daemon).

Ben.

2001-10-24 17:56:35

by Andrew Grover

[permalink] [raw]
Subject: RE: [RFC] New Driver Model for 2.5

> From: Benjamin Herrenschmidt [mailto:[email protected]]
> I have all of this more or less working on pmac laptops. I don't have
> the new device model, so I handle dependencies manually with
> a priority
> mecanism, but it's already good enough to let me resume
> userland before
> ADB and sound, and possibly stuffs. (Which is nice since my
> sound chips
> usually need one or 2 second to recalibrate and ADB need a
> few seconds to
> probe the bus, all this happens asynchronously).

Awesome.

So non i386 archs do not have the problem with the video bios having to run
on resume, or did you have to handle this somehow?

Regards -- Andy

2001-10-24 18:45:36

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: RE: [RFC] New Driver Model for 2.5

>
>Awesome.
>
>So non i386 archs do not have the problem with the video bios having to run
>on resume, or did you have to handle this somehow?

Fortunately, Mac laptops don't shut the chip down, the PM microcontroller will
just suspend the clock to it. fbdev's are mandatory on macs, and so we use the
fbdev for mach64 or r128 (the 2 types of chips you find on mac laptops, radeon
is coming soon however) to save a few things and put the chip in D2 mode
(or vendor specific suspend mode for mach64).

The problem do exist with Mac desktops as they power off the PCI and AGP
slots. That's the main reason why I don't add support for those in Linux
currently. We need some way to revive the card, which can be either done
with a chip-specific init sequence (in the fbdev), a small forth emulator
with enough of Open Firmware environement to run the OF driver for the
card, or a shell to run the MacOS driver for the card. All of these
solutions are tricky however.

Ben.


2001-10-24 22:39:14

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> So as far as the kernel is concerned, a suspend is _always_ started by
> "the user". Of course, the whole point with computers is that many thin=
> gs
> can be automated, and "the user" may not be a human sitting at the
> machine.

How does that apply to the equivalent of an APM critical shutdown - do we
still vector that via userspace ?

2001-10-24 22:35:24

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> >The device tree is for _device_ suspend, not for "subsystem suspend". The
> >SCSI subsystem is a piece of cr*p, but even if it was perfect it should
> >never get involved with the act of suspension.
>
> I agree I'd like subsystems to avoid polluting the PM tree (or device tree).
> If there are a few cases where a subsystem needs to know a driver it's using
> is asleep, it's probably up to the interface of this susbystem to provide
> a function to be called by the driver when it's going to suspend mode.

I don't think it is a big problem. We can add virtual nodes. They way I
see it we either
a) put in grungy subsystem hacks
b) register virtual device nodes for subsystems when needed

b feels cleaner

> I really don't think it's _that_ difficult to properly do this blocking.
> For things like sound drivers, a simple semaphore is plenty enough. For

Sound is more easily handled by not blocking user space but waiting until
the final IRQ off moment and grabbing the registers. That avoids a lot
of ugly locking gunge. It literally comes down to

case suspending
kmalloc buffer
done
case final suspend point
turn off DMA
readl
readl
readl
readl
...
done

2001-10-24 22:42:14

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> If you want to synchronize your raid thing, make the user-level thing that
> triggers the suspend do it. Same goes for things like "sync network
> filesystems" etc. This is not a kernel level issue, and the kernel
> shouldn't even try to do it.

Makes good sense - I agree

2001-10-24 22:44:34

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> Why would you _ever_ get "sg.c" and other crap involved in the suspend
> process?
>
> The device tree is for _device_ suspend, not for "subsystem suspend". The
> SCSI subsystem is a piece of cr*p, but even if it was perfect it should
> never get involved with the act of suspension.

Well I don't want my laptop to suspend during a CD burn or firmware update.
The device itself doesn't know anything about how busy it is since its
just sending packets, only the subsystem driver controller it does

> by getting involved with sg.c or anything similar, but by basically
> stopping all user apps (think of the equivalent of a "kill -STOP -1", but
> done internally in the kernel without actually using a signal).

Stopping all user apps really tends to ruin the cd and the firmware
update.

> Remember: the main point of suspend is to have a laptop go to sleep, and
> come back up on the order of a few _seconds_.

It also has to avoid unpleasant situations

> Also, realize that the act of suspension is STARTED BY THE USER. Which
> means that before the kernel suspends, you _can_ have user programs that
> basically take disk arrays off-line etc if that is what you want. But
> that's not ae kernel suspend issue.

There are certain practicalities here with trying to make user space dig
around in fuser innards or patching every cd burner. The sg layer is one
that has to get involved (be it as a driver call back or a virtual driver)

2001-10-24 22:44:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Wed, 24 Oct 2001, Alan Cox wrote:
>
> I don't think it is a big problem. We can add virtual nodes. They way I
> see it we either
> a) put in grungy subsystem hacks
> b) register virtual device nodes for subsystems when needed
>
> b feels cleaner

I agree. I would personally see us using _more_ "virtual device node"
things already: right now we have things like SuperIO chips that contain
both a serial line and a parallel port (and...), and some drivers do
really ugly things with them - keep them as one "struct pci_dev", and then
have two drivers sharing the device.

It would be much cleaner to have _one_ driver for such SuperIO chips (a
"multinode" driver), which just creates two virtual pci_dev structures,
and lets the regular serial driver handle the "virtual serial device" etc.

That has the advantage of:
- not needing special hacks in various serial/parallel drivers
- the devices show up naturally and logically in whatever user mode
"device m nager" tree

So the device nodes do not have to match the physical tree. The physical
device tree only sets up the initial physical scanning, and obviously
limits _reality_ ;)

Linus

2001-10-25 04:15:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On Wed, 24 Oct 2001, Alan Cox wrote:
>
> Well I don't want my laptop to suspend during a CD burn or firmware update.
> The device itself doesn't know anything about how busy it is since its
> just sending packets, only the subsystem driver controller it does

But that's _your_ problem. Not the kernels.

If you have a acpi deamon that decides to make the machine go to sleep
while burning a CD, that's nothign to do with the kernel at all.

It has nothing to do with sg.c either, for that matter.

> > Remember: the main point of suspend is to have a laptop go to sleep, and
> > come back up on the order of a few _seconds_.
>
> It also has to avoid unpleasant situations

Absolutely NOT.

The kernel does not set policy. If the user says "suspend now", then we
suspend now. Whether a CD burn or anything else is going on is totally
irrelevant.

> There are certain practicalities here with trying to make user space dig
> around in fuser innards or patching every cd burner. The sg layer is one
> that has to get involved (be it as a driver call back or a virtual driver)

Not a way in hell. If the sg layer wants to export a "/proc/sgbusy",
that's its problem.

But if I say "suspend", and the kernel refuses, I will kill the offending
piece of crap from sg.c before you can blink an eye.

Linus

2001-10-25 07:58:48

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>> I really don't think it's _that_ difficult to properly do this blocking.
>> For things like sound drivers, a simple semaphore is plenty enough. For
>
>Sound is more easily handled by not blocking user space but waiting until
>the final IRQ off moment and grabbing the registers. That avoids a lot
>of ugly locking gunge. It literally comes down to

My point about using a semaphore was to avoid getting mixer ioctls
banging the HW while it is shut down.

Ben.


2001-10-25 08:04:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>
>I don't think it is a big problem. We can add virtual nodes. They way I
>see it we either
> a) put in grungy subsystem hacks
> b) register virtual device nodes for subsystems when needed
>
>b feels cleaner

Ok, provided that there is not one big "SCSI subsystem" virtual node,
that would screw up the entire dep. hierarchy, but rather virtual nodes
created by SCSI "clients" on the fly as childs of their devices. That
looks fine to me ;)

An example would be:

* SCSI host
|
|
* SCSI disk device
|
|
* "sd" node

In this case, "sg" could add itself when opened, and eventually cause
sleep requests to be rejected for example.

Well, in fact, I don't think there is real need for the "SCSI disk device"
node, but that depends pretty much on the new SCSI architecture.

Ben.


2001-10-25 08:09:19

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>In this case, "sg" could add itself when opened, and eventually cause
>sleep requests to be rejected for example.

Well, looks like Linus won't let this one pass ;) A /proc/sgbusy would
eventually be ok, but I'd rather start defining a proper interface
to the PM daemon in userland for apps to request that the machine
doesn't go to sleep. That would be used, among other things, by
CD burners & firmware updaters. No need to hack thousands of apps,
I beleive if we get a patch implementing support for that in cdrecord,
then all burners software will magically start getting it ;)

Ben.




2001-10-25 08:27:15

by Rob Turk

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

"Linus Torvalds" <[email protected]> wrote in message
news:cistron.Pine.LNX.4.33.0110240901350.8049-100000@penguin.transmeta.com..
.
>
> On Wed, 24 Oct 2001, Benjamin Herrenschmidt wrote:
> > >
> > >So the scsi devices hang off sd, sr etc which in turn hang off scsi and
> > >the controllers hang off scsi (and or the bus layers)
> > >
> > >This one at least I think I do understand
> >
> > The problem with subsystems is that they don't fit well in the
> > power tree. They aren't "devices" in that sense that they are
> > not exposing a struct device, and they spawn over several controllers
> > which means the dependency can quickly become unmanageable, especially
> > when SCSI starts beeing layered on top of USB or FireWire.
>
> Why would you _ever_ get "sg.c" and other crap involved in the suspend
> process?
>
> The device tree is for _device_ suspend, not for "subsystem suspend". The
> SCSI subsystem is a piece of cr*p, but even if it was perfect it should
> never get involved with the act of suspension.
>
> We should not have pending IO, but that's for a totally different reason:
> the first thing the much much MUCH higher levels of suspend should be
> doing is to make sure that user apps are "quiescent". And that isn't done
> by getting involved with sg.c or anything similar, but by basically
> stopping all user apps (think of the equivalent of a "kill -STOP -1", but
> done internally in the kernel without actually using a signal).
>
> > Also, the dependency issue is made worst if you let RAID enter into
> > the dance as I beleive ultimately, nothing would prevent a volume to
> > spawn over several devices from different controllers or even different
> > controller types.
>
> Why would you get RAID involved? There is no _IO_ involved in suspending:
> we just stop doing what we're doing, and leave it at that. We don't try to
> flush state, we just freeze the machine.
>
> The act of "suspend" should basically be: shut off the SCSI controller,
> screw all devices, reset the bus on resume.
>

Doing so will create havoc on sequential devices, such as tape drives. If
your system simply suspends, then all is well. Any data that isn't flushed
yet is buffered inside the tapedrive. But when the system resumes and resets
the SCSI bus, it will cause all data in the tape drive to be lost, and for
most tape systems it will also re-position them at LBOT. Any running
tar/dump/whatever tape process would not survive such a suspend-resume
cycle.

Another more subtle issue is state information that exists between the SCSI
controller and the target devices. At some point they might have negotiated
synchronous and/or wide transfer parameters. This information must be
preserved, or you'll observe lockups, data corruption and the likes. Since
these parameters are maintained at the lowest driver level, they should know
about suspend. The low-level driver must know to re-negotiate these
parameters when it comes back to life.

Rob





2001-10-25 09:15:49

by ebiederman

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Benjamin Herrenschmidt <[email protected]> writes:

> >> case, there's not much left to the controller, it isn't supposed to
> >> have any command in queue nor receive any new one once all it's child
> >> drivers have suspended.
> >
> >scsi devices are children of the scsi subststem (sd, sg, sr, st, osst) not
> >of the controller. That is how the state flows anyway. Only sr/sd etc know
> >what the state is for a given device on power off as they may issue
> >multiple requests per action true transaction. sg would have to simply
> >refuse any suspend if open (think about cd-burning or even worse firmware
> >download)
> >
> >So the scsi devices hang off sd, sr etc which in turn hang off scsi and
> >the controllers hang off scsi (and or the bus layers)
> >
> >This one at least I think I do understand
>
> The problem with subsystems is that they don't fit well in the
> power tree. They aren't "devices" in that sense that they are
> not exposing a struct device, and they spawn over several controllers
> which means the dependency can quickly become unmanageable, especially
> when SCSI starts beeing layered on top of USB or FireWire.
>
> Also, the dependency issue is made worst if you let RAID enter into
> the dance as I beleive ultimately, nothing would prevent a volume to
> spawn over several devices from different controllers or even different
> controller types.

On the dependency case for x86 I have a fun common example.
To shut off the cpu, or the whole motherboard I need to talk to the
southbridge. To talk to the southbridge, I need to talk to the northbridge.

So at least to some extent shutting down busses is a really different
case from shutting down devices. And only in some cases can a tree
model it at all.

Equally fun are temperature monitors that appear on both the lpc/isa bus
and the i2c bus.

Or another fun common one. To shut down the interrupt controller, I first
need to shut down every device that thinks it can generate interrupts.
But my interrupt controller is way out on my pci->isa bridge. So I
can't shut that device down.

Sorry this whole device tree idea for shutdown ordering doesn't seem
to match my idea of reality.

Now I need to take a little time out and see what the code that is
being discussed will actually do about situations like the above.

> A tricky issue indeed...

Agreed.

Eric

2001-10-25 09:31:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On 25 Oct 2001, Eric W. Biederman wrote:
>
> Or another fun common one. To shut down the interrupt controller, I first
> need to shut down every device that thinks it can generate interrupts.
> But my interrupt controller is way out on my pci->isa bridge. So I
> can't shut that device down.
>
> Sorry this whole device tree idea for shutdown ordering doesn't seem
> to match my idea of reality.

Your _examples_ do not match any reality.

Don't worry about things like the CPU shutdown: you have to have special
code for it anyway.

Let's face it, the device tree is for _devices_. It's for shutting down a
network card before we shut down the PCI bridge that is in front of it.

The issue of "core shutdown" is not covered - and isn't _meant_ to be
covered. That's the problem of the architecture-specific code. There is no
point in having a device tree for that, because it's going to be very much
architecture-specific anyway (ie on x86 we may have to just blindly trust
some silly APCI table data etc).

Linus

2001-10-25 09:48:47

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>> Or another fun common one. To shut down the interrupt controller, I first
>> need to shut down every device that thinks it can generate interrupts.
>> But my interrupt controller is way out on my pci->isa bridge. So I
>> can't shut that device down.
>>
>> Sorry this whole device tree idea for shutdown ordering doesn't seem
>> to match my idea of reality.
>
>Your _examples_ do not match any reality.
>
>Don't worry about things like the CPU shutdown: you have to have special
>code for it anyway.
>
>Let's face it, the device tree is for _devices_. It's for shutting down a
>network card before we shut down the PCI bridge that is in front of it.
>
>The issue of "core shutdown" is not covered - and isn't _meant_ to be
>covered. That's the problem of the architecture-specific code. There is no
>point in having a device tree for that, because it's going to be very much
>architecture-specific anyway (ie on x86 we may have to just blindly trust
>some silly APCI table data etc).

Definitelly. I have similar issues with pmacs, clocks generation
and interrupt controller are in Apple's mac-io ASIC which is on PCI,
so this ASIC can't be part of the normal PM tree and has to be handled
as part of the "core" PM code. This kind of issue will still happen, the
new scheme won't "magically" make PM work on every single laptop out
there, there will still be some corner cases to deal with, but at least
these will be limited to real corner cases and most "normal" drivers
will fit in the new, saner, mecanism.

Ben.


2001-10-25 10:02:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>
>Doing so will create havoc on sequential devices, such as tape drives. If
>your system simply suspends, then all is well. Any data that isn't flushed
>yet is buffered inside the tapedrive. But when the system resumes and resets
>the SCSI bus, it will cause all data in the tape drive to be lost, and for
>most tape systems it will also re-position them at LBOT. Any running
>tar/dump/whatever tape process would not survive such a suspend-resume
>cycle.
>
>Another more subtle issue is state information that exists between the SCSI
>controller and the target devices. At some point they might have negotiated
>synchronous and/or wide transfer parameters. This information must be
>preserved, or you'll observe lockups, data corruption and the likes. Since
>these parameters are maintained at the lowest driver level, they should know
>about suspend. The low-level driver must know to re-negotiate these
>parameters when it comes back to life.

This can be handled by having st (or sd, or whatever "client driver" decide
to take over a SCSI device) register a struct device node that is a child
of the actual SCSI device.

In fact, I'm wondering if we need a struct device node at all for the
SCSI device on the bus. The SCSI controller (or USB/storage or
FireWire SBP2) will expose SCSI devices, that is "interface" to
which you can feed SCSI requests, but do those really need to have
a structure device associated ? One possibility would be to only do
so once attached to a "client" driver like st, sd, sg, ...
The "client" would then create that structure.

But...

If we still want "unclaimed" devices to have a representation in the
device tree (because, for example, userland wants to know about them,
eventually in order to "instanciate" an sg driver), then we could have
the SCSI subsystem create a simple skeletton struct device when the
devices are probed, and have the client driver just populate this with
more infos & PM hooks once attached to the device.
I don't think there's a need to have 2 struct device stacked.

But it's mostly a matter of taste ;)

Thinking more about it, I think I prefer the second solution. That is
to have SCSI create "standard" struct device for all devices probed
on the bus, thus ensuring they are visible from the userland
representation of the device-tree, and then eventually have drivers like
sd, st, etc... add entries & PM hooks to those devices if needed when
attached to them. But doing that, or just having them create a virtual
node as a child of the device is mostly a matter of taste, I beleive.

Ben.



2001-10-25 10:02:47

by Helge Hafting

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Rob Turk wrote:

> Doing so will create havoc on sequential devices, such as tape drives. If
> your system simply suspends, then all is well. Any data that isn't flushed
> yet is buffered inside the tapedrive. But when the system resumes and resets
> the SCSI bus, it will cause all data in the tape drive to be lost, and for
> most tape systems it will also re-position them at LBOT. Any running
> tar/dump/whatever tape process would not survive such a suspend-resume
> cycle.
>
Well, why reset the scsi bus on resume then?
That seems unnecessary. At suspend time the devices simply
don't get more requests. (Except perhaps spin-down
requests for disks.) Then nothing much happens. Eventually
the system wakes up, and requests appear again. First spin-up
requests, then ordinary io.

Quite a few scsi bioses have an option for not resetting
the bus when booting. Less delay, and necessary for those
few with a shared scsi bus. Seems a reset won't be
necessary for suspend/resume either, which is supposed to
be a lighter operation than a reboot.

If your scsi adapter don't support this - it isn't
suspend/resume compatible the way I see it.

Helge Hafting

2001-10-25 10:23:47

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

"Linus Torvalds" <[email protected]> writes:

> On 25 Oct 2001, Eric W. Biederman wrote:
> >
> > Or another fun common one. To shut down the interrupt controller, I first
> > need to shut down every device that thinks it can generate interrupts.
> > But my interrupt controller is way out on my pci->isa bridge. So I
> > can't shut that device down.
> >
> > Sorry this whole device tree idea for shutdown ordering doesn't seem
> > to match my idea of reality.
>
> Your _examples_ do not match any reality.

I'll go as far as agreeing they do not matching any _practical_ reality.

> Don't worry about things like the CPU shutdown: you have to have special
> code for it anyway.

But that is the case I plan on coding....

> Let's face it, the device tree is for _devices_. It's for shutting down a
> network card before we shut down the PCI bridge that is in front of it.
>
> The issue of "core shutdown" is not covered - and isn't _meant_ to be
> covered.

O.k. I'll step back and let you guys handle the normal cases. I rarely
get past "core startup" and "core shutdown".

> That's the problem of the architecture-specific code. There is no
> point in having a device tree for that, because it's going to be very much
> architecture-specific anyway (ie on x86 we may have to just blindly trust
> some silly APCI table data etc).

I'm doing my best to provide a real world alternative to ACPI on some
boards.

My perspective is coming from linuxBIOS, or in general GPL'd
firmware, so it is a little different.

But at this point in the conversation it looks like I should just back
off, let the core API get the easy cases correct. And then come back
and figure out how to handle the truly weird cases.

Eric

2001-10-25 11:01:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


On 25 Oct 2001, Eric W. Biederman wrote:
>
> > That's the problem of the architecture-specific code. There is no
> > point in having a device tree for that, because it's going to be very much
> > architecture-specific anyway (ie on x86 we may have to just blindly trust
> > some silly APCI table data etc).
>
> I'm doing my best to provide a real world alternative to ACPI on some
> boards.

That will be much appreciated. ACPI is not all that wonderful to say the
least. With enough knowledge of the hardware you can mostly do a better
job (the problem is "enough knowledge", especially on most laptops where
most of the GPIO signals etc are pretty much ad-hoc and not defined by the
chipsets but by the board layout person.. And we can't query the board
revision even if they gave us the information ;)

> My perspective is coming from linuxBIOS, or in general GPL'd
> firmware, so it is a little different.

Understood.

Linus

2001-10-25 12:16:47

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> >Sound is more easily handled by not blocking user space but waiting until
> >the final IRQ off moment and grabbing the registers. That avoids a lot
> >of ugly locking gunge. It literally comes down to
>
> My point about using a semaphore was to avoid getting mixer ioctls
> banging the HW while it is shut down.

Yes I can follow that - you want to avoid the aclink being shut down while
active. That seems to be just part of the ordering. I'd also put the ac97
save/restore in the ac97_codec.c stuff - lets write it once 8)

2001-10-25 12:16:47

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> In this case, "sg" could add itself when opened, and eventually cause
> sleep requests to be rejected for example.

I think SG is the only really special case one reading over the code.
Disk and CD might want to issue a couple of things (cache flush, unlock
media type stuff) but nothing tricky.

> Well, in fact, I don't think there is real need for the "SCSI disk device"
> node, but that depends pretty much on the new SCSI architecture.

Sure

2001-10-25 12:36:42

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> If you have a acpi deamon that decides to make the machine go to sleep
> while burning a CD, that's nothign to do with the kernel at all.

One job kernel drivers have is to say "I can't safely sleep at this moment"
Even windows/XP beta gets this right.

> The kernel does not set policy. If the user says "suspend now", then we
> suspend now. Whether a CD burn or anything else is going on is totally
> irrelevant.

I know what the end user viewpoint on that would be. In a sense I do
agree with you - but that would assume we could re-invent every single
scsi generic driver, figure out how to make /proc/sg/%d/... work and the
like

> But if I say "suspend", and the kernel refuses, I will kill the offending
> piece of crap from sg.c before you can blink an eye.

Thats fine by me. Anyone wanting to be able to burn cds safely can run a
-ac kernel tree

Alan

2001-10-25 13:57:03

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>> My point about using a semaphore was to avoid getting mixer ioctls
>> banging the HW while it is shut down.
>
>Yes I can follow that - you want to avoid the aclink being shut down while
>active. That seems to be just part of the ordering. I'd also put the ac97
>save/restore in the ac97_codec.c stuff - lets write it once 8)

Not exactly ;) Since sleep/resume on pmac is somewhat asynchronous, things
like sound (which in my case can take 1 to 2 seconds to come back to
to calibration delay of the chip) is done async. So userland is already
running again, and may be hitting the driver with various mixer ioctls,
while my HW isn't yet ready to get them (nothing fancy here).

I do also have the problem of having the sound chip on an i2c bus on some
machiones, and so the problem of dependencies between the i2c controller
and the sound driver (samples use a separate i2s bus), but this is also
easily fixed by either having the sound chip a child of the i2c controller,
or just shutting down the i2c controller as part of the platform code
which is what I do now.

I did various experiments doing CD and/or MP3 playback and putting the
machine to sleep. The "smoothest" result I obtained was using this
semaphore on driver entrypoints. This "cleanly" suspends the sound app
on it's next call to the sleeping sound driver and resume it when the
sound driver is ready. Since the sound chip takes forever to come back,
it's long enough for disk & cd to be fully back up, and the player
not to skip when resumed :)

Ben.



2001-10-25 14:25:28

by Victor Yodaiken

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

On Thu, Oct 25, 2001 at 10:27:11AM +0200, Rob Turk wrote:
> > The act of "suspend" should basically be: shut off the SCSI controller,
> > screw all devices, reset the bus on resume.
> >
>
> Doing so will create havoc on sequential devices, such as tape drives. If

I'm failing to imagine a good case for suspending a system that has a
tape drive on it.




2001-10-25 14:44:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Victor Yodaiken wrote:
>
> On Thu, Oct 25, 2001 at 10:27:11AM +0200, Rob Turk wrote:
> > > The act of "suspend" should basically be: shut off the SCSI controller,
> > > screw all devices, reset the bus on resume.
> > >
> >
> > Doing so will create havoc on sequential devices, such as tape drives. If
>
> I'm failing to imagine a good case for suspending a system that has a
> tape drive on it.

I've often seen user workstations with tape driver.

I fail to see the need to suspend such a system while using the tape
drive, though :)

--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-25 14:45:09

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Victor Yodaiken wrote:
>
> On Thu, Oct 25, 2001 at 10:27:11AM +0200, Rob Turk wrote:
> > > The act of "suspend" should basically be: shut off the SCSI controller,
> > > screw all devices, reset the bus on resume.
> > >
> >
> > Doing so will create havoc on sequential devices, such as tape drives. If
>
> I'm failing to imagine a good case for suspending a system that has a
> tape drive on it.

I've often seen user workstations with tape drives. Very uncommon these
days, agreed.

I fail to see the need to suspend such a system while using the tape
drive, though :)

--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2001-10-25 15:25:00

by Rob Turk

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


"Victor Yodaiken" <[email protected]> wrote in message
news:cistron.20011025082001.B764@hq2...
> On Thu, Oct 25, 2001 at 10:27:11AM +0200, Rob Turk wrote:
> > > The act of "suspend" should basically be: shut off the SCSI controller,
> > > screw all devices, reset the bus on resume.
> > >
> >
> > Doing so will create havoc on sequential devices, such as tape drives. If
>
> I'm failing to imagine a good case for suspending a system that has a
> tape drive on it.
>

Well, maybe the tape example wasn't all that good. The state information
(wide/sync negotiation) still needs to be retained for all SCSI devices though.

Rob




2001-10-25 15:44:40

by Jonathan Lundell

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

At 5:22 PM +0200 10/25/01, Rob Turk wrote:
> > I'm failing to imagine a good case for suspending a system that has a
>> tape drive on it.
>>
>
>Well, maybe the tape example wasn't all that good. The state information
>(wide/sync negotiation) still needs to be retained for all SCSI
>devices though.

Any driver that uses SCSI bus reset for last-resort error recovery
(and I think it's pretty typical) needs to be able to renegotiate the
connection. Maybe even after a SCSI device reset; I don't recall. So
initiating that negotiation as part of (or after) resume doesn't seem
all that burdensome.

You need that anyway for "deep sleep" that powers down devices completely.
--
/Jonathan Lundell.

2001-10-25 17:48:40

by David Lang

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

let alone a reason for a suspend to be triggered while running the tape.

David Lang

On Thu, 25 Oct 2001, Victor Yodaiken wrote:

> Date: Thu, 25 Oct 2001 08:20:01 -0600
> From: Victor Yodaiken <[email protected]>
> To: Rob Turk <[email protected]>
> Cc: [email protected]
> Subject: Re: [RFC] New Driver Model for 2.5
>
> On Thu, Oct 25, 2001 at 10:27:11AM +0200, Rob Turk wrote:
> > > The act of "suspend" should basically be: shut off the SCSI controller,
> > > screw all devices, reset the bus on resume.
> > >
> >
> > Doing so will create havoc on sequential devices, such as tape drives. If
>
> I'm failing to imagine a good case for suspending a system that has a
> tape drive on it.
>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-10-25 21:14:05

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Hi!

> > The act of "suspend" should basically be: shut off the SCSI controller,
> > screw all devices, reset the bus on resume.
> >
>
> Doing so will create havoc on sequential devices, such as tape drives. If
> your system simply suspends, then all is well. Any data that isn't flushed
> yet is buffered inside the tapedrive. But when the system resumes and resets
> the SCSI bus, it will cause all data in the tape drive to be lost, and for
> most tape systems it will also re-position them at LBOT. Any running
> tar/dump/whatever tape process would not survive such a suspend-resume
> cycle.

Then there's something wrong with st.

Imagine EMI comes and SCSI gets reset. That should not mean tar
failing, right? So you have st broken in first place.
Pavel
--
STOP THE WAR! Someone killed innocent Americans. That does not give
U.S. right to kill people in Afganistan.


2001-10-25 21:29:06

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Hi!

> >We should not have pending IO, but that's for a totally different reason:
> >the first thing the much much MUCH higher levels of suspend should be
> >doing is to make sure that user apps are "quiescent". And that isn't done
> >by getting involved with sg.c or anything similar, but by basically
> >stopping all user apps (think of the equivalent of a "kill -STOP -1", but
> >done internally in the kernel without actually using a signal).
>
> Is this necessary ? That would definitely make things easier to implement
> to forget about incoming requests, but I'm not sure it's the right way.
> In fact, is it really working ? You could well have in-kernel threads
> triggering IOs or launching new userland stuffs (what happens if a not
> yet suspended driver for, let's say USB, see a new device coming in
> and starts /sbin/hotplug). Some filesystems have garbage collector threads
> that can do IOs eventually. There are various kind of IOs that can in fact
> be triggered entirely within the kernel.

I need this for suspend to disk. I really need quiescent system if I
want to write to swap image, right?

I implemented a "refrigerator" mechanism, and manually marked threads
that should not be frozen. I'm doing my best to stop everyone who
might try to wake userland.

[I just mailed the patch as 2.4.13 - swsusp. Tell me if you want a
copy.]

Pavel
--
STOP THE WAR! Someone killed innocent Americans. That does not give
U.S. right to kill people in Afganistan.


2001-10-25 21:34:45

by Rob Turk

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5


"Pavel Machek" <[email protected]> wrote in message
news:[email protected]...
> Hi!
>
> > > The act of "suspend" should basically be: shut off the SCSI controller,
> > > screw all devices, reset the bus on resume.
> > >
> >
> > Doing so will create havoc on sequential devices, such as tape drives. If
> > your system simply suspends, then all is well. Any data that isn't flushed
> > yet is buffered inside the tapedrive. But when the system resumes and resets
> > the SCSI bus, it will cause all data in the tape drive to be lost, and for
> > most tape systems it will also re-position them at LBOT. Any running
> > tar/dump/whatever tape process would not survive such a suspend-resume
> > cycle.
>
> Then there's something wrong with st.
>
> Imagine EMI comes and SCSI gets reset. That should not mean tar
> failing, right? So you have st broken in first place.
> Pavel
> --
No, it's an inherant part of the SCSI spec. A SCSI bus reset will cause many, if
not all tape devices to rewind to begin of tape. The st driver can detect this
(A SCSI Unit Attention will be returned on the next SCSI command), and try to
re-position the tape to it's previous location. Doing so is not easy, and on
many tape drives even impractical. On an almost full tape, a DLT drive would
take up to two hours to get back to where it last was. Too much for most
time-out mechanisms...

On a SCSI bus reset, tape related processes are better off passing the condition
upward to user space (tar, dump, whatever). Intelligent user space programs may
be able to recover, the dumb ones will fail.

By the way, EMI is very unlikely to cause a SCSI bus reset. It may cause
repeated parity errorsto the point that a host or devices decides to reset the
bus. If there's this much EMI, then you should use a better transport (HVD SCSI,
or fibre). But this part of the discussion is probably better helt at
comp.periphs.scsi 8-)

Rob




2001-10-25 21:57:58

by Xavier Bestel

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

le jeu 25-10-2001 ? 14:42, Alan Cox a ?crit :
> > If you have a acpi deamon that decides to make the machine go to sleep
> > while burning a CD, that's nothign to do with the kernel at all.
>
> One job kernel drivers have is to say "I can't safely sleep at this moment"
> Even windows/XP beta gets this right.

The other solution (let cdrecord tell I-dunno-how the PM daemon that
it's doing something "important") is IMHO better: the PM daemon could
judge if it should honor the suspend request depending on its priority
(inactivity, power button or low battery) and the running "important"
jobs.

Xav

2001-10-25 22:54:05

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

>> One job kernel drivers have is to say "I can't safely sleep at this moment"
>> Even windows/XP beta gets this right.
>
>The other solution (let cdrecord tell I-dunno-how the PM daemon that
>it's doing something "important") is IMHO better: the PM daemon could
>judge if it should honor the suspend request depending on its priority
>(inactivity, power button or low battery) and the running "important"
>jobs.

The cdrecord case is a high level issue, and scsi is a mess ;)

We are not yet at a point where we can be more constructive
than what was already said. Ultimately we need to move a bit
forward with the real implementation and see how some problems
show up. The architecture as it was designed so far is light,
and most of the debate is not around it, it's around how it
should be used by drivers & kernel subsystems ;)

Also, there will always be corner cases where a kernel driver will
be able to override whatever policy was set. Drivers can register
to the PM layer and can return errors. It have to be exceptional,
but I beleive some critical flash algorithm or whatever would
benefit from it.

I would however vote for making that a bit simpler by providing a
kernel-global sleep rw semaphore. The idea here is that drivers, file
systems, or whatever bits of kernel code that, for a given momment,
won't afford beeing put to sleep will take a read lock on it. The
sleep code will take a write lock. The read lockers won't make the
PM code fail. It will just block a bit, waiting for the critical
operations to complete.

I'm thinking about things like jffs2 on embedded. That (nice) flash
file system has a background thread that does garbage collecting
of your flash in order to be able to erase unused parts. It would
benefit greatly embedded boards that implement system sleep if
that thread could make sure sleep won't happen in the middle of
moving blocks, or whatever atomicity it may want for its operations.

A firmware flashing facility in a driver may use that mecanism too.

It's a corner case, it's not meant to be used a lot, but it nice
to have it.

Ben.


2001-10-25 23:50:03

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> The cdrecord case is a high level issue, and scsi is a mess ;)

Grin

> We are not yet at a point where we can be more constructive
> than what was already said. Ultimately we need to move a bit
> forward with the real implementation and see how some problems
> show up. The architecture as it was designed so far is light,
> and most of the debate is not around it, it's around how it
> should be used by drivers & kernel subsystems ;)

I think I understand how to handle this and avoid races. Linus idea of
/proc files so you can ask "what is busy" solves most of it. Then the
policy daemon can make a choice about suspending or not.

If we make the proc file a large bitmask of events then the policy daemon
call to kernel becomes

"suspend even if [event-mask] set"

This means that a daemon call that races the start of say a CD burn will
fail and the daemon can rescan, rethink and if need be reissue the request
sanely.

2001-10-26 11:35:43

by Helge Hafting

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

Alan Cox wrote:

>
> > But if I say "suspend", and the kernel refuses, I will kill the offending
> > piece of crap from sg.c before you can blink an eye.
>
> Thats fine by me. Anyone wanting to be able to burn cds safely can run a
> -ac kernel tree

Telling the kernel to suspend while burning a CD is
on the same level as ejecting the CD while burning.
It has to go wrong. Someone explicitly asking for
trouble might as well get it.

The really dumb users is probably using a GUI tool
for either activity, that one may of course refuse
to ruin the burn.

Helge Hafting

2001-10-26 12:36:23

by Alan

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

> Telling the kernel to suspend while burning a CD is
> on the same level as ejecting the CD while burning.
> It has to go wrong. Someone explicitly asking for
> trouble might as well get it.

It need not be someone asking for trouble. It might just be a ten minute
"nothing happened" timeout that starts the decision making.

> The really dumb users is probably using a GUI tool
> for either activity, that one may of course refuse
> to ruin the burn.

The current GUI tools don't know anything about 2.5 power management, and
in some cases don't know when the driver has done needed work or not.

Alan

2001-10-29 19:33:26

by kaih

[permalink] [raw]
Subject: Re: [RFC] New Driver Model for 2.5

[email protected] (Padraig Brady) wrote on 22.10.01 in <[email protected]>:

> For ethernet cards (or anything else) on the PCI bus you can use
> the following to specify an order:
> ftp://platan.vc.cvut.cz/pub/linux/pciorder.patch-2.4.12-ac1.gz
> This allows you to pass the following parameter to the kernel:
> pciorder=Bus:Node.Fn,Bus:Node.Fn,... e.g.
> pciorder=0:0d.0,0:0b.0,0:0a.0

That doesn't look useful.

MfG Kai