Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Wed, 17 Oct 2001 20:07:53 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Wed, 17 Oct 2001 20:07:45 -0400 Received: from marty.infinity.powertie.org ([63.105.29.14]:28435 "HELO marty.infinity.powertie.org") by vger.kernel.org with SMTP id ; Wed, 17 Oct 2001 20:07:29 -0400 Date: Wed, 17 Oct 2001 16:52:29 -0700 (PDT) From: Patrick Mochel To: linux-kernel@vger.kernel.org Subject: [RFC] New Driver Model for 2.5 Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org One July afternoon, while hacking on the pm_dev layer for the purpose of system-wide power management support, I decided that I was quite tired of trying to make this layer look like a tree and feel like a tree, but not have any real integration with the actual device drivers.. I had read the accounts of what the goals were for 2.5. And, after some conversations with Linus and the (gasp) ACPI guys, I realized that I had a good chunk of the infrastructural code written; it was a matter of working out a few crucial details and massaging it in nicely. I have had the chance this week (after moving and vacationing) to update the (read: write some) documentation for it. I will not go into details, and will let the document speak for itself. With all luck, this should go into the early stages of 2.5, and allow a significant cleanup of many drivers. Such a model will also allow for neat tricks like full device power management support, and Plug N Play capabilities. In order to support the new driver model, I have written a small in-memory filesystem, called ddfs, to export a unified interface to userland. It is mentioned in the doc, and is pretty self-explanatory. More information will be available soon. There is code available for the model and ddfs at: http://kernel.org/pub/linux/kernel/people/mochel/device/ but there are some fairly large caveats concerning it. First, I feel comfortable with the device layer code and the ddfs code. Though, the PCI code is still work in progress. I am still working out some of the finer details concerning it. Next is the environment under which I developed it all. It was on an ia32 box, with only PCI support, and using ACPI. The latter didn't have too much of an effect on the development, but there are a few items explicitly inspired by it.. I am hoping both the PCI code, and the structure and in general can be further improved based on the input of the driver maintainers. This model is not final, and may be way off from what most people actually want. It has gotten tentative blessing from all those that have seen it, though they number but a few. It's definitely not the only solution... That said, enjoy; and have at it. -pat ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The (New) Linux Kernel Driver Model Version 0.01 17 October 2001 Overview ~~~~~~~~ This driver model is a unification of all the current, disparate driver models that are currently in the kernel. It is intended is to augment the bus-specific drivers for bridges and devices by consolidating a set of data and operations into globally accessible data structures. Current driver models implement some sort of tree-like structure (sometimes just a list) for the devices they control. But, there is no linkage between the different bus types. A common data structure can provide this linkage with little overhead: when a bus driver discovers a particular device, it can insert it into the global tree as well as its local tree. In fact, the local tree becomes just a subset of the global tree. Common data fields can also be moved out of the local bus models into the global model. Some of the manipulation of these fields can also be consolidated. Most likely, manipulation functions will become a set of helper functions, which the bus drivers wrap around to include any bus-specific items. The common device and bridge interface currently reflects the goals of the modern PC: namely the ability to do seamless Plug and Play, power management, and hot plug. (The model dictated by Intel and Microsoft (read: ACPI) ensures us that any device in the system may fit any of these criteria.) In reality, not every bus will be able to support such operations. But, most buses will support a majority of those operations, and all future buses will. In other words, a bus that doesn't support an operation is the exception, instead of the other way around. Drivers ~~~~~~~ The callbacks for bridges and devices are intended to be singular for a particular type of bus. For each type of bus that has support compiled in the kernel, there should be one statically allocated structure with the appropriate callbacks that each device (or bridge) of that type share. Each bus layer should implement the callbacks for these drivers. It then forwards the calls on to the device-specific callbacks. This means that device-specific drivers must still implement callbacks for each operation. But, they are not called from the top level driver layer. This does add another layer of indirection for calling one of these functions, but there are benefits that are believed to outweigh this slowdown. First, it prevents device-specific drivers from having to know about the global device layer. This speeds up integration time incredibly. It also allows drivers to be more portable across kernel versions. Note that the former was intentional, the latter is an added bonus. Second, this added indirection allows the bus to perform any additional logic necessary for its child devices. A bus layer may add additional information to the call, or translate it into something meaningful for its children. This could be done in the driver, but if it happens for every object of a particular type, it is best done at a higher level. Recap ~~~~~ Instances of devices and bridges are allocated dynamically as the system discovers their existence. Their fields describe the individual object. Drivers - in the global sense - are statically allocated and singular for a particular type of bus. They describe a set of operations that every type of bus could implement, the implementation following the bus's semantics. Downstream Access ~~~~~~~~~~~~~~~~~ Common data fields have been moved out of individual bus layers into a common data structure. But, these fields must still be accessed by the bus layers, and sometimes by the device-specific drivers. Other bus layers are encouraged to do what has been done for the PCI layer. struct pci_dev now looks like this: struct pci_dev { ... struct device device; }; Note first that it is statically allocated. This means only one allocation on device discovery. Note also that it is at the _end_ of struct pci_dev. This is to make people think about what they're doing when switching between the bus driver and the global driver; and to prevent against mindless casts between the two. The PCI bus layer freely accesses the fields of struct device. It knows about the structure of struct pci_dev, and it should know the structure of struct device. PCI devices that have been converted generally do not touch the fields of struct device. More precisely, device-specific drivers should not touch fields of struct device unless there is a strong compelling reason to do so. This abstraction is prevention of unnecessary pain during transitional phases. If the name of the field changes or is removed, then every downstream driver will break. On the other hand, if only the bus layer (and not the device layer) accesses struct device, it is only those that need to change. User Interface ~~~~~~~~~~~~~~ By virtue of having a complete hierarchical view of all the devices in the system, exporting a complete hierarchical view to userspace becomes relatively easy. Whenever a device is inserted into the tree, a file or directory can be created for it. In this model, a directory is created for each bridge and each device. When it is created, it is populated with a set of default files, first at the global layer, then at the bus layer. The device layer may then add its own files. These files export data about the driver and can be used to modify behavior of the driver or even device. For example, at the global layer, a file named 'status' is created for each device. When read, it reports to the user the name of the device, its bus ID, its current power state, and the name of the driver its using. By writing to this file, you can have control over the device. By writing "suspend 3" to this file, one could place the device into power state "3". Basically, by writing to this file, the user has access to the operations defined in struct device_driver. The PCI layer also adds default files. For devices, it adds a "resource" file and a "wake" file. The former reports the BAR information for the device; the latter reports the wake capabilities of the device. The device layer could also add files for device-specific data reporting and control. The dentry to the device's directory is kept in struct device. It also keeps a linked list of all the files in the directory, with pointers to their read and write callbacks. This allows the driver layer to maintain full control of its destiny. If it desired to override the default behavior of a file, or simply remove it, it could easily do so. (It is assumed that the files added upstream will always be a known quantity.) These features were initially implemented using procfs. However, after one conversation with Linus, a new filesystem - ddfs - was created to implement these features. It is an in-memory filesystem, based heavily off of ramfs, though it uses procfs as inspiration for its callback functionality. Device Structures ~~~~~~~~~~~~~~~~~ struct device { struct list_head bus_list; struct io_bus *parent; struct io_bus *subordinate; char name[DEVICE_NAME_SIZE]; char bus_id[BUS_ID_SIZE]; struct dentry *dentry; struct list_head files; struct semaphore lock; struct device_driver *driver; void *driver_data; void *platform_data; u32 current_state; unsigned char *saved_state; }; bus_list: List of all devices on a particular bus; i.e. the device's siblings parent: The parent bridge for the device. subordinate: If the device is a bridge itself, this points to the struct io_bus that is created for it. name: Human readable (descriptive) name of device. E.g. "Intel EEPro 100" bus_id: Parsable (yet ASCII) bus id. E.g. "00:04.00" (PCI Bus 0, Device 4, Function 0). It is necessary to have a searchable bus id for each device; making it ASCII allows us to use it for its directory name without translating it. dentry: Pointer to driver's ddfs directory. files: Linked list of all the files that a driver has in its ddfs directory. lock: Driver specific lock. driver: Pointer to a struct device_driver, the common operations for each device. See next section. driver_data: Private data for the driver. Much like the PCI implementation of this field, this allows device-specific drivers to keep a pointer to a device-specific data. platform_data: Data that the platform (firmware) provides about the device. For example, the ACPI BIOS or EFI may have additional information about the device that is not directly mappable to any existing kernel data structure. It also allows the platform driver (e.g. ACPI) to a driver without the driver having to have explicit knowledge of (atrocities like) ACPI. current_state: Current power state of the device. For PCI and other modern devices, this is 0-3, though it's not necessarily limited to those values. saved_state: Pointer to driver-specific set of saved state. Having it here allows modules to be unloaded on system suspend and reloaded on resume and maintain state across transitions. It also allows generic drivers to maintain state across system state transitions. (I've implemented a generic PCI driver for devices that don't have a device-specific driver. Instead of managing some vector of saved state for each device the generic driver supports, it can simply store it here.) struct device_driver { int (*probe) (struct device *dev); int (*remove) (struct device *dev); int (*init) (struct device *dev); int (*shutdown) (struct device *dev); int (*save_state) (struct device *dev, u32 state); int (*restore_state)(struct device *dev); int (*suspend) (struct device *dev, u32 state); int (*resume) (struct device *dev); } probe: Check for device existence and associate driver with it. remove: Dissociate driver with device. Releases device so that it could be used by another driver. Also, if it is a hotplug device (hotplug PCI, Cardbus), an ejection event could take place here. init: Initialise the device - allocate resources, irqs, etc. shutdown: "De-initialise" the device - release resources, free memory, etc. save_state: Save current device state before entering suspend state. restore_state: Restore device state, after coming back from suspend state. suspend: Physically enter suspend state. resume: Physically leave suspend state and re-initialise hardware. Initially, the probe/remove sequence followed the PCI semantics exactly, but have since been broken up into a four-stage process: probe(), remove(), init(), and shutdown(). While it's not entirely necessary in all environments, breaking them up so each routine does only one thing makes sense. Hot-pluggable devices may also benefit from this model, especially ones that can be subjected to suprise removals - only the remove function would be called, and the driver could easily know if the there was still hardware there to shutdown. Drivers that are controlling failing, or buggy, hardware, by allowing the user to trigger a removal of the driver from userspace, without trying to shutdown down the device. In each case that remove() is called without a shutdown(), it's important to note that resources will still need to be freed; it's only the hardware that cannot be assumed to be present. Suspend/resume transitions are broken into four stages as well to provide graceful recovery from a failed suspend attempt; and to ensure that state gets stored in a non-volatile location before the system (and its devices) are suspended. When a suspend transition is triggered, the device tree is walked first to save the state of all the devices in the system. Once this is complete, the saved state, now residing in memory, can be written to some non-volatile location, like a disk partition or network location. The device tree is then walked again to suspend all of the devices. This guarantees that the device controlling the location to write the state is still powered on while you have a snapshot of the system state. If a device is in a critical I/O transaction, or for some other reason cannot stand to be suspended, it notify the kernel by failing in the save state step. At this point, state can either be restored, or dropped, for all the devices that had been already been touched, and execution may resume. No devices will have been powered off at this point, making it much easier to recover. The resume transition is broken up into two steps mainly to stress the singularity of each step: resume() powers on the device and reinitialises it; restore_state() restores the device and bus-specific registers of the device. resume() will happen with interrupts disabled; restore_state() with them enabled. Bus Structures ~~~~~~~~~~~~~~ struct io_bus { struct list_head node; struct io_bus *parent; struct list_head children; struct list_head devices; struct list_head bus_list; struct device *self; struct dentry *dentry; struct list_head files; char name[DEVICE_NAME_SIZE]; char bus_id[BUS_ID_SIZE]; struct bus_driver *driver; }; node: Bus's node in sibling list (its parent's list of child buses). parent: Pointer to parent bridge. children: List of subordinate buses. In the children, this correlates to their 'node' field. devices: List of devices on the bus this bridge controls. This field corresponds to the 'bus_list' field in each child device. bus_list: Each type of bus keeps a list of all bridges that it finds. This is the bridges entry in that list. self: Pointer to the struct device for this bridge. dentry: Every bus also gets a ddfs directory for which to add files to, as well as child device directories. Actually, every bridge will have two directories - one for the bridge device, and one for the subordinate device. files: Each bus also gets a list of the files that are in the ddfs directory, for the same reasons as the devices - to have explicit control over the behavior and easy access to each file that any higher layers may have added. name: Human readable ASCII name of bus. bus_id: Machine readable (though ASCII) description of position on parent bus. driver: Pointer to operations for bus. struct bus_driver { char name[16]; struct list_head node; int (*scan) (struct io_bus*); int (*rescan) (struct io_bus*); int (*add_device) (struct io_bus*, char*); int (*remove_device)(struct io_bus*, struct device*); int (*add_bus) (struct io_bus*, char*); int (*remove_bus) (struct io_bus*, struct io_bus*); }; name: ASCII name of bus. node: List of buses of this type in system. scan: Search the bus for devices. This is meant to be done only once - when the bridge is initially discovered. rescan: Search the bus again and look for changes. I.e. check for device insertion or removal. add_device: Trigger a device insertion at a particular location. remove_device: Trigger the removal of a particular device. add_bus: Trigger insertion of a new bridge device (and child bus) at a particular location on the bus. remove_bus: Remove a particular bridge and subordinate bus. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/