diff options
| author | Linus Torvalds <torvalds@g5.osdl.org> | 2006-09-26 11:49:46 -0700 | 
|---|---|---|
| committer | Linus Torvalds <torvalds@g5.osdl.org> | 2006-09-26 11:49:46 -0700 | 
| commit | dd77a4ee0f3981693d4229aa1d57cea9e526ff47 (patch) | |
| tree | cb486be20b950201103a03636cbb1e1d180f0098 /Documentation | |
| parent | e8216dee838c09776680a6f1a2e54d81f3cdfa14 (diff) | |
| parent | 7e9f4b2d3e21e87c26025810413ef1592834e63b (diff) | |
| download | linux-dd77a4ee0f3981693d4229aa1d57cea9e526ff47.tar.bz2 | |
Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (47 commits)
  Driver core: Don't call put methods while holding a spinlock
  Driver core: Remove unneeded routines from driver core
  Driver core: Fix potential deadlock in driver core
  PCI: enable driver multi-threaded probe
  Driver Core: add ability for drivers to do a threaded probe
  sysfs: add proper sysfs_init() prototype
  drivers/base: check errors
  drivers/base: Platform notify needs to occur before drivers attach to the device
  v4l-dev2: handle __must_check
  add CONFIG_ENABLE_MUST_CHECK
  add __must_check to device management code
  Driver core: fixed add_bind_files() definition
  Driver core: fix comments in drivers/base/power/resume.c
  sysfs_remove_bin_file: no return value, dump_stack on error
  kobject: must_check fixes
  Driver core: add ability for devices to create and remove bin files
  Class: add support for class interfaces for devices
  Driver core: create devices/virtual/ tree
  Driver core: add device_rename function
  Driver core: add ability for classes to handle devices properly
  ...
Diffstat (limited to 'Documentation')
| -rw-r--r-- | Documentation/ABI/removed/devfs (renamed from Documentation/ABI/obsolete/devfs) | 5 | ||||
| -rw-r--r-- | Documentation/ABI/testing/sysfs-power | 88 | ||||
| -rw-r--r-- | Documentation/feature-removal-schedule.txt | 27 | ||||
| -rw-r--r-- | Documentation/power/devices.txt | 725 | 
4 files changed, 652 insertions, 193 deletions
diff --git a/Documentation/ABI/obsolete/devfs b/Documentation/ABI/removed/devfs index b8b87399bc8f..8195c4e0d0a1 100644 --- a/Documentation/ABI/obsolete/devfs +++ b/Documentation/ABI/removed/devfs @@ -1,13 +1,12 @@  What:		devfs -Date:		July 2005 +Date:		July 2005 (scheduled), finally removed in kernel v2.6.18  Contact:	Greg Kroah-Hartman <gregkh@suse.de>  Description:  	devfs has been unmaintained for a number of years, has unfixable  	races, contains a naming policy within the kernel that is  	against the LSB, and can be replaced by using udev. -	The files fs/devfs/*, include/linux/devfs_fs*.h will be removed, +	The files fs/devfs/*, include/linux/devfs_fs*.h were removed,  	along with the the assorted devfs function calls throughout the  	kernel tree.  Users: - diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power new file mode 100644 index 000000000000..d882f8093871 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-power @@ -0,0 +1,88 @@ +What:		/sys/power/ +Date:		August 2006 +Contact:	Rafael J. Wysocki <rjw@sisk.pl> +Description: +		The /sys/power directory will contain files that will +		provide a unified interface to the power management +		subsystem. + +What:		/sys/power/state +Date:		August 2006 +Contact:	Rafael J. Wysocki <rjw@sisk.pl> +Description: +		The /sys/power/state file controls the system power state. +		Reading from this file returns what states are supported, +		which is hard-coded to 'standby' (Power-On Suspend), 'mem' +		(Suspend-to-RAM), and 'disk' (Suspend-to-Disk). + +		Writing to this file one of these strings causes the system to +		transition into that state. Please see the file +		Documentation/power/states.txt for a description of each of +		these states. + +What:		/sys/power/disk +Date:		August 2006 +Contact:	Rafael J. Wysocki <rjw@sisk.pl> +Description: +		The /sys/power/disk file controls the operating mode of the +		suspend-to-disk mechanism.  Reading from this file returns +		the name of the method by which the system will be put to +		sleep on the next suspend.  There are four methods supported: +		'firmware' - means that the memory image will be saved to disk +		by some firmware, in which case we also assume that the +		firmware will handle the system suspend. +		'platform' - the memory image will be saved by the kernel and +		the system will be put to sleep by the platform driver (e.g. +		ACPI or other PM registers). +		'shutdown' - the memory image will be saved by the kernel and +		the system will be powered off. +		'reboot' - the memory image will be saved by the kernel and +		the system will be rebooted. + +		The suspend-to-disk method may be chosen by writing to this +		file one of the accepted strings: + +		'firmware' +		'platform' +		'shutdown' +		'reboot' + +		It will only change to 'firmware' or 'platform' if the system +		supports that. + +What:		/sys/power/image_size +Date:		August 2006 +Contact:	Rafael J. Wysocki <rjw@sisk.pl> +Description: +		The /sys/power/image_size file controls the size of the image +		created by the suspend-to-disk mechanism.  It can be written a +		string representing a non-negative integer that will be used +		as an upper limit of the image size, in bytes.  The kernel's +		suspend-to-disk code will do its best to ensure the image size +		will not exceed this number.  However, if it turns out to be +		impossible, the kernel will try to suspend anyway using the +		smallest image possible.  In particular, if "0" is written to +		this file, the suspend image will be as small as possible. + +		Reading from this file will display the current image size +		limit, which is set to 500 MB by default. + +What:		/sys/power/pm_trace +Date:		August 2006 +Contact:	Rafael J. Wysocki <rjw@sisk.pl> +Description: +		The /sys/power/pm_trace file controls the code which saves the +		last PM event point in the RTC across reboots, so that you can +		debug a machine that just hangs during suspend (or more +		commonly, during resume).  Namely, the RTC is only used to save +		the last PM event point if this file contains '1'.  Initially +		it contains '0' which may be changed to '1' by writing a +		string representing a nonzero integer into it. + +		To use this debugging feature you should attempt to suspend +		the machine, then reboot it and run + +		dmesg -s 1000000 | grep 'hash matches' + +		CAUTION: Using it will cause your machine's real-time (CMOS) +		clock to be set to a random invalid time after a resume. diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 552507fe9a7e..611acc32fdf5 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -6,6 +6,21 @@ be removed from this file.  --------------------------- +What:	/sys/devices/.../power/state +	dev->power.power_state +	dpm_runtime_{suspend,resume)() +When:	July 2007 +Why:	Broken design for runtime control over driver power states, confusing +	driver-internal runtime power management with:  mechanisms to support +	system-wide sleep state transitions; event codes that distinguish +	different phases of swsusp "sleep" transitions; and userspace policy +	inputs.  This framework was never widely used, and most attempts to +	use it were broken.  Drivers should instead be exposing domain-specific +	interfaces either to kernel or to userspace. +Who:	Pavel Machek <pavel@suse.cz> + +--------------------------- +  What:	RAW driver (CONFIG_RAW_DRIVER)  When:	December 2005  Why:	declared obsolete since kernel 2.6.3 @@ -294,3 +309,15 @@ Why:	The frame diverter is included in most distribution kernels, but is  	It is not clear if anyone is still using it.  Who:	Stephen Hemminger <shemminger@osdl.org> +--------------------------- + + +What:	PHYSDEVPATH, PHYSDEVBUS, PHYSDEVDRIVER in the uevent environment +When:	Oktober 2008 +Why:	The stacking of class devices makes these values misleading and +	inconsistent. +	Class devices should not carry any of these properties, and bus +	devices have SUBSYTEM and DRIVER as a replacement. +Who:	Kay Sievers <kay.sievers@suse.de> + +--------------------------- diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index fba1e05c47c7..d0e79d5820a5 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt @@ -1,208 +1,553 @@ +Most of the code in Linux is device drivers, so most of the Linux power +management code is also driver-specific.  Most drivers will do very little; +others, especially for platforms with small batteries (like cell phones), +will do a lot. + +This writeup gives an overview of how drivers interact with system-wide +power management goals, emphasizing the models and interfaces that are +shared by everything that hooks up to the driver model core.  Read it as +background for the domain-specific work you'd do with any specific driver. + + +Two Models for Device Power Management +====================================== +Drivers will use one or both of these models to put devices into low-power +states: + +    System Sleep model: +	Drivers can enter low power states as part of entering system-wide +	low-power states like "suspend-to-ram", or (mostly for systems with +	disks) "hibernate" (suspend-to-disk). + +	This is something that device, bus, and class drivers collaborate on +	by implementing various role-specific suspend and resume methods to +	cleanly power down hardware and software subsystems, then reactivate +	them without loss of data. + +	Some drivers can manage hardware wakeup events, which make the system +	leave that low-power state.  This feature may be disabled using the +	relevant /sys/devices/.../power/wakeup file; enabling it may cost some +	power usage, but let the whole system enter low power states more often. + +    Runtime Power Management model: +	Drivers may also enter low power states while the system is running, +	independently of other power management activity.  Upstream drivers +	will normally not know (or care) if the device is in some low power +	state when issuing requests; the driver will auto-resume anything +	that's needed when it gets a request. + +	This doesn't have, or need much infrastructure; it's just something you +	should do when writing your drivers.  For example, clk_disable() unused +	clocks as part of minimizing power drain for currently-unused hardware. +	Of course, sometimes clusters of drivers will collaborate with each +	other, which could involve task-specific power management. + +There's not a lot to be said about those low power states except that they +are very system-specific, and often device-specific.  Also, that if enough +drivers put themselves into low power states (at "runtime"), the effect may be +the same as entering some system-wide low-power state (system sleep) ... and +that synergies exist, so that several drivers using runtime pm might put the +system into a state where even deeper power saving options are available. + +Most suspended devices will have quiesced all I/O:  no more DMA or irqs, no +more data read or written, and requests from upstream drivers are no longer +accepted.  A given bus or platform may have different requirements though. + +Examples of hardware wakeup events include an alarm from a real time clock, +network wake-on-LAN packets, keyboard or mouse activity, and media insertion +or removal (for PCMCIA, MMC/SD, USB, and so on). + + +Interfaces for Entering System Sleep States +=========================================== +Most of the programming interfaces a device driver needs to know about +relate to that first model:  entering a system-wide low power state, +rather than just minimizing power consumption by one device. + + +Bus Driver Methods +------------------ +The core methods to suspend and resume devices reside in struct bus_type. +These are mostly of interest to people writing infrastructure for busses +like PCI or USB, or because they define the primitives that device drivers +may need to apply in domain-specific ways to their devices: -Device Power Management +struct bus_type { +	... +	int  (*suspend)(struct device *dev, pm_message_t state); +	int  (*suspend_late)(struct device *dev, pm_message_t state); +	int  (*resume_early)(struct device *dev); +	int  (*resume)(struct device *dev); +}; -Device power management encompasses two areas - the ability to save -state and transition a device to a low-power state when the system is -entering a low-power state; and the ability to transition a device to -a low-power state while the system is running (and independently of -any other power management activity).  +Bus drivers implement those methods as appropriate for the hardware and +the drivers using it; PCI works differently from USB, and so on.  Not many +people write bus drivers; most driver code is a "device driver" that +builds on top of bus-specific framework code. + +For more information on these driver calls, see the description later; +they are called in phases for every device, respecting the parent-child +sequencing in the driver model tree.  Note that as this is being written, +only the suspend() and resume() are widely available; not many bus drivers +leverage all of those phases, or pass them down to lower driver levels. + + +/sys/devices/.../power/wakeup files +----------------------------------- +All devices in the driver model have two flags to control handling of +wakeup events, which are hardware signals that can force the device and/or +system out of a low power state.  These are initialized by bus or device +driver code using device_init_wakeup(dev,can_wakeup). + +The "can_wakeup" flag just records whether the device (and its driver) can +physically support wakeup events.  When that flag is clear, the sysfs +"wakeup" file is empty, and device_may_wakeup() returns false. + +For devices that can issue wakeup events, a separate flag controls whether +that device should try to use its wakeup mechanism.  The initial value of +device_may_wakeup() will be true, so that the device's "wakeup" file holds +the value "enabled".  Userspace can change that to "disabled" so that +device_may_wakeup() returns false; or change it back to "enabled" (so that +it returns true again). + + +EXAMPLE:  PCI Device Driver Methods +----------------------------------- +PCI framework software calls these methods when the PCI device driver bound +to a device device has provided them: + +struct pci_driver { +	... +	int  (*suspend)(struct pci_device *pdev, pm_message_t state); +	int  (*suspend_late)(struct pci_device *pdev, pm_message_t state); + +	int  (*resume_early)(struct pci_device *pdev); +	int  (*resume)(struct pci_device *pdev); +}; +Drivers will implement those methods, and call PCI-specific procedures +like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and +pci_restore_state() to manage PCI-specific mechanisms.  (PCI config space +could be saved during driver probe, if it weren't for the fact that some +systems rely on userspace tweaking using setpci.)  Devices are suspended +before their bridges enter low power states, and likewise bridges resume +before their devices. + + +Upper Layers of Driver Stacks +----------------------------- +Device drivers generally have at least two interfaces, and the methods +sketched above are the ones which apply to the lower level (nearer PCI, USB, +or other bus hardware).  The network and block layers are examples of upper +level interfaces, as is a character device talking to userspace. + +Power management requests normally need to flow through those upper levels, +which often use domain-oriented requests like "blank that screen".  In +some cases those upper levels will have power management intelligence that +relates to end-user activity, or other devices that work in cooperation. + +When those interfaces are structured using class interfaces, there is a +standard way to have the upper layer stop issuing requests to a given +class device (and restart later): + +struct class { +	... +	int  (*suspend)(struct device *dev, pm_message_t state); +	int  (*resume)(struct device *dev); +}; -Methods +Those calls are issued in specific phases of the process by which the +system enters a low power "suspend" state, or resumes from it. + + +Calling Drivers to Enter System Sleep States +============================================ +When the system enters a low power state, each device's driver is asked +to suspend the device by putting it into state compatible with the target +system state.  That's usually some version of "off", but the details are +system-specific.  Also, wakeup-enabled devices will usually stay partly +functional in order to wake the system. + +When the system leaves that low power state, the device's driver is asked +to resume it.  The suspend and resume operations always go together, and +both are multi-phase operations. + +For simple drivers, suspend might quiesce the device using the class code +and then turn its hardware as "off" as possible with late_suspend.  The +matching resume calls would then completely reinitialize the hardware +before reactivating its class I/O queues. + +More power-aware drivers drivers will use more than one device low power +state, either at runtime or during system sleep states, and might trigger +system wakeup events. + + +Call Sequence Guarantees +------------------------ +To ensure that bridges and similar links needed to talk to a device are +available when the device is suspended or resumed, the device tree is +walked in a bottom-up order to suspend devices.  A top-down order is +used to resume those devices. + +The ordering of the device tree is defined by the order in which devices +get registered:  a child can never be registered, probed or resumed before +its parent; and can't be removed or suspended after that parent. + +The policy is that the device tree should match hardware bus topology. +(Or at least the control bus, for devices which use multiple busses.) + + +Suspending Devices +------------------ +Suspending a given device is done in several phases.  Suspending the +system always includes every phase, executing calls for every device +before the next phase begins.  Not all busses or classes support all +these callbacks; and not all drivers use all the callbacks. + +The phases are seen by driver notifications issued in this order: + +   1	class.suspend(dev, message) is called after tasks are frozen, for +	devices associated with a class that has such a method.  This +	method may sleep. + +	Since I/O activity usually comes from such higher layers, this is +	a good place to quiesce all drivers of a given type (and keep such +	code out of those drivers). + +   2	bus.suspend(dev, message) is called next.  This method may sleep, +	and is often morphed into a device driver call with bus-specific +	parameters and/or rules. + +	This call should handle parts of device suspend logic that require +	sleeping.  It probably does work to quiesce the device which hasn't +	been abstracted into class.suspend() or bus.suspend_late(). + +   3	bus.suspend_late(dev, message) is called with IRQs disabled, and +	with only one CPU active.  Until the bus.resume_early() phase +	completes (see later), IRQs are not enabled again.  This method +	won't be exposed by all busses; for message based busses like USB, +	I2C, or SPI, device interactions normally require IRQs.  This bus +	call may be morphed into a driver call with bus-specific parameters. + +	This call might save low level hardware state that might otherwise +	be lost in the upcoming low power state, and actually put the +	device into a low power state ... so that in some cases the device +	may stay partly usable until this late.  This "late" call may also +	help when coping with hardware that behaves badly. + +The pm_message_t parameter is currently used to refine those semantics +(described later). + +At the end of those phases, drivers should normally have stopped all I/O +transactions (DMA, IRQs), saved enough state that they can re-initialize +or restore previous state (as needed by the hardware), and placed the +device into a low-power state.  On many platforms they will also use +clk_disable() to gate off one or more clock sources; sometimes they will +also switch off power supplies, or reduce voltages.  Drivers which have +runtime PM support may already have performed some or all of the steps +needed to prepare for the upcoming system sleep state. + +When any driver sees that its device_can_wakeup(dev), it should make sure +to use the relevant hardware signals to trigger a system wakeup event. +For example, enable_irq_wake() might identify GPIO signals hooked up to +a switch or other external hardware, and pci_enable_wake() does something +similar for PCI's PME# signal. + +If a driver (or bus, or class) fails it suspend method, the system won't +enter the desired low power state; it will resume all the devices it's +suspended so far. + +Note that drivers may need to perform different actions based on the target +system lowpower/sleep state.  At this writing, there are only platform +specific APIs through which drivers could determine those target states. + + +Device Low Power (suspend) States +--------------------------------- +Device low-power states aren't very standard.  One device might only handle +"on" and "off, while another might support a dozen different versions of +"on" (how many engines are active?), plus a state that gets back to "on" +faster than from a full "off". + +Some busses define rules about what different suspend states mean.  PCI +gives one example:  after the suspend sequence completes, a non-legacy +PCI device may not perform DMA or issue IRQs, and any wakeup events it +issues would be issued through the PME# bus signal.  Plus, there are +several PCI-standard device states, some of which are optional. + +In contrast, integrated system-on-chip processors often use irqs as the +wakeup event sources (so drivers would call enable_irq_wake) and might +be able to treat DMA completion as a wakeup event (sometimes DMA can stay +active too, it'd only be the CPU and some peripherals that sleep). + +Some details here may be platform-specific.  Systems may have devices that +can be fully active in certain sleep states, such as an LCD display that's +refreshed using DMA while most of the system is sleeping lightly ... and +its frame buffer might even be updated by a DSP or other non-Linux CPU while +the Linux control processor stays idle. + +Moreover, the specific actions taken may depend on the target system state. +One target system state might allow a given device to be very operational; +another might require a hard shut down with re-initialization on resume. +And two different target systems might use the same device in different +ways; the aforementioned LCD might be active in one product's "standby", +but a different product using the same SOC might work differently. + + +Meaning of pm_message_t.event +----------------------------- +Parameters to suspend calls include the device affected and a message of +type pm_message_t, which has one field:  the event.  If driver does not +recognize the event code, suspend calls may abort the request and return +a negative errno.  However, most drivers will be fine if they implement +PM_EVENT_SUSPEND semantics for all messages. + +The event codes are used to refine the goal of suspending the device, and +mostly matter when creating or resuming system memory image snapshots, as +used with suspend-to-disk: + +    PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power +	state.  When used with system sleep states like "suspend-to-RAM" or +	"standby", the upcoming resume() call will often be able to rely on +	state kept in hardware, or issue system wakeup events.  When used +	instead with suspend-to-disk, few devices support this capability; +	most are completely powered off. + +    PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into +	any low power mode.  A system snapshot is about to be taken, often +	followed by a call to the driver's resume() method.  Neither wakeup +	events nor DMA are allowed. + +    PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume() +	will restore a suspend-to-disk snapshot from a different kernel image. +	Drivers that are smart enough to look at their hardware state during +	resume() processing need that state to be correct ... a PRETHAW could +	be used to invalidate that state (by resetting the device), like a +	shutdown() invocation would before a kexec() or system halt.  Other +	drivers might handle this the same way as PM_EVENT_FREEZE.  Neither +	wakeup events nor DMA are allowed. + +To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or +the similarly named APM states, only PM_EVENT_SUSPEND is used; for "Suspend +to Disk" (STD, hibernate, ACPI S4), all of those event codes are used. + +There's also PM_EVENT_ON, a value which never appears as a suspend event +but is sometimes used to record the "not suspended" device state. + + +Resuming Devices +---------------- +Resuming is done in multiple phases, much like suspending, with all +devices processing each phase's calls before the next phase begins. + +The phases are seen by driver notifications issued in this order: + +   1	bus.resume_early(dev) is called with IRQs disabled, and with +   	only one CPU active.  As with bus.suspend_late(), this method +	won't be supported on busses that require IRQs in order to +	interact with devices. + +	This reverses the effects of bus.suspend_late(). + +   2	bus.resume(dev) is called next.  This may be morphed into a device +   	driver call with bus-specific parameters; implementations may sleep. + +	This reverses the effects of bus.suspend(). + +   3	class.resume(dev) is called for devices associated with a class +	that has such a method.  Implementations may sleep. + +	This reverses the effects of class.suspend(), and would usually +	reactivate the device's I/O queue. + +At the end of those phases, drivers should normally be as functional as +they were before suspending:  I/O can be performed using DMA and IRQs, and +the relevant clocks are gated on.  The device need not be "fully on"; it +might be in a runtime lowpower/suspend state that acts as if it were. + +However, the details here may again be platform-specific.  For example, +some systems support multiple "run" states, and the mode in effect at +the end of resume() might not be the one which preceded suspension. +That means availability of certain clocks or power supplies changed, +which could easily affect how a driver works. + + +Drivers need to be able to handle hardware which has been reset since the +suspend methods were called, for example by complete reinitialization. +This may be the hardest part, and the one most protected by NDA'd documents +and chip errata.  It's simplest if the hardware state hasn't changed since +the suspend() was called, but that can't always be guaranteed. + +Drivers must also be prepared to notice that the device has been removed +while the system was powered off, whenever that's physically possible. +PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses +where common Linux platforms will see such removal.  Details of how drivers +will notice and handle such removals are currently bus-specific, and often +involve a separate thread. -The methods to suspend and resume devices reside in struct bus_type:  -struct bus_type { -       ... -       int             (*suspend)(struct device * dev, pm_message_t state); -       int             (*resume)(struct device * dev); -}; +Note that the bus-specific runtime PM wakeup mechanism can exist, and might +be defined to share some of the same driver code as for system wakeup.  For +example, a bus-specific device driver's resume() method might be used there, +so it wouldn't only be called from bus.resume() during system-wide wakeup. +See bus-specific information about how runtime wakeup events are handled. -Each bus driver is responsible implementing these methods, translating -the call into a bus-specific request and forwarding the call to the -bus-specific drivers. For example, PCI drivers implement suspend() and -resume() methods in struct pci_driver. The PCI core is simply -responsible for translating the pointers to PCI-specific ones and -calling the low-level driver. - -This is done to a) ease transition to the new power management methods -and leverage the existing PM code in various bus drivers; b) allow -buses to implement generic and default PM routines for devices, and c) -make the flow of execution obvious to the reader.  - - -System Power Management - -When the system enters a low-power state, the device tree is walked in -a depth-first fashion to transition each device into a low-power -state. The ordering of the device tree is guaranteed by the order in -which devices get registered - children are never registered before -their ancestors, and devices are placed at the back of the list when -registered. By walking the list in reverse order, we are guaranteed to -suspend devices in the proper order.  - -Devices are suspended once with interrupts enabled. Drivers are -expected to stop I/O transactions, save device state, and place the -device into a low-power state. Drivers may sleep, allocate memory, -etc. at will.  - -Some devices are broken and will inevitably have problems powering -down or disabling themselves with interrupts enabled. For these -special cases, they may return -EAGAIN. This will put the device on a -list to be taken care of later. When interrupts are disabled, before -we enter the low-power state, their drivers are called again to put -their device to sleep.  - -On resume, the devices that returned -EAGAIN will be called to power -themselves back on with interrupts disabled. Once interrupts have been -re-enabled, the rest of the drivers will be called to resume their -devices. On resume, a driver is responsible for powering back on each -device, restoring state, and re-enabling I/O transactions for that -device.  +System Devices +--------------  System devices follow a slightly different API, which can be found in  	include/linux/sysdev.h  	drivers/base/sys.c -System devices will only be suspended with interrupts disabled, and -after all other devices have been suspended. On resume, they will be -resumed before any other devices, and also with interrupts disabled. +System devices will only be suspended with interrupts disabled, and after +all other devices have been suspended.  On resume, they will be resumed +before any other devices, and also with interrupts disabled. +That is, IRQs are disabled, the suspend_late() phase begins, then the +sysdev_driver.suspend() phase, and the system enters a sleep state.  Then +the sysdev_driver.resume() phase begins, followed by the resume_early() +phase, after which IRQs are enabled. -Runtime Power Management - -Many devices are able to dynamically power down while the system is -still running. This feature is useful for devices that are not being -used, and can offer significant power savings on a running system.  - -In each device's directory, there is a 'power' directory, which -contains at least a 'state' file. Reading from this file displays what -power state the device is currently in. Writing to this file initiates -a transition to the specified power state, which must be a decimal in -the range 1-3, inclusive; or 0 for 'On'. +Code to actually enter and exit the system-wide low power state sometimes +involves hardware details that are only known to the boot firmware, and +may leave a CPU running software (from SRAM or flash memory) that monitors +the system and manages its wakeup sequence. -The PM core will call the ->suspend() method in the bus_type object -that the device belongs to if the specified state is not 0, or -->resume() if it is.  -Nothing will happen if the specified state is the same state the -device is currently in.  - -If the device is already in a low-power state, and the specified state -is another, but different, low-power state, the ->resume() method will -first be called to power the device back on, then ->suspend() will be -called again with the new state.  - -The driver is responsible for saving the working state of the device -and putting it into the low-power state specified. If this was -successful, it returns 0, and the device's power_state field is -updated.  - -The driver must take care to know whether or not it is able to -properly resume the device, including all step of reinitialization -necessary. (This is the hardest part, and the one most protected by -NDA'd documents).  - -The driver must also take care not to suspend a device that is -currently in use. It is their responsibility to provide their own -exclusion mechanisms. - -The runtime power transition happens with interrupts enabled. If a -device cannot support being powered down with interrupts, it may -return -EAGAIN (as it would during a system power management -transition),  but it will _not_ be called again, and the transaction -will fail. - -There is currently no way to know what states a device or driver -supports a priori. This will change in the future.  - -pm_message_t meaning - -pm_message_t has two fields. event ("major"), and flags.  If driver -does not know event code, it aborts the request, returning error. Some -drivers may need to deal with special cases based on the actual type -of suspend operation being done at the system level. This is why -there are flags. - -Event codes are: - -ON -- no need to do anything except special cases like broken -HW. - -# NOTIFICATION -- pretty much same as ON? - -FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from -scratch. That probably means stop accepting upstream requests, the -actual policy of what to do with them being specific to a given -driver. It's acceptable for a network driver to just drop packets -while a block driver is expected to block the queue so no request is -lost. (Use IDE as an example on how to do that). FREEZE requires no -power state change, and it's expected for drivers to be able to -quickly transition back to operating state. - -SUSPEND -- like FREEZE, but also put hardware into low-power state. If -there's need to distinguish several levels of sleep, additional flag -is probably best way to do that. - -Transitions are only from a resumed state to a suspended state, never -between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen, -FREEZE -> SUSPEND or SUSPEND -> FREEZE can not). - -All events are: - -[NOTE NOTE NOTE: If you are driver author, you should not care; you -should only look at event, and ignore flags.] - -#Prepare for suspend -- userland is still running but we are going to -#enter suspend state. This gives drivers chance to load firmware from -#disk and store it in memory, or do other activities taht require -#operating userland, ability to kmalloc GFP_KERNEL, etc... All of these -#are forbiden once the suspend dance is started.. event = ON, flags = -#PREPARE_TO_SUSPEND - -Apm standby -- prepare for APM event. Quiesce devices to make life -easier for APM BIOS. event = FREEZE, flags = APM_STANDBY - -Apm suspend -- same as APM_STANDBY, but it we should probably avoid -spinning down disks. event = FREEZE, flags = APM_SUSPEND - -System halt, reboot -- quiesce devices to make life easier for BIOS. event -= FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT - -System shutdown -- at least disks need to be spun down, or data may be -lost. Quiesce devices, just to make life easier for BIOS. event = -FREEZE, flags = SYSTEM_SHUTDOWN - -Kexec    -- turn off DMAs and put hardware into some state where new -kernel can take over. event = FREEZE, flags = KEXEC - -Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake -may need to be enabled on some devices. This actually has at least 3 -subtypes, system can reboot, enter S4 and enter S5 at the end of -swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT, -SYSTEM_SHUTDOWN, SYSTEM_S4 - -Suspend to ram  -- put devices into low power state. event = SUSPEND, -flags = SUSPEND_TO_RAM - -Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put -devices into low power mode, but you must be able to reinitialize -device from scratch in resume method. This has two flavors, its done -once on suspending kernel, once on resuming kernel. event = FREEZE, -flags = DURING_SUSPEND or DURING_RESUME - -Device detach requested from /sys -- deinitialize device; proably same as -SYSTEM_SHUTDOWN, I do not understand this one too much. probably event -= FREEZE, flags = DEV_DETACH. - -#These are not really events sent: -# -#System fully on -- device is working normally; this is probably never -#passed to suspend() method... event = ON, flags = 0 -# -#Ready after resume -- userland is now running, again. Time to free any -#memory you ate during prepare to suspend... event = ON, flags = -#READY_AFTER_RESUME -# +Runtime Power Management +======================== +Many devices are able to dynamically power down while the system is still +running. This feature is useful for devices that are not being used, and +can offer significant power savings on a running system.  These devices +often support a range of runtime power states, which might use names such +as "off", "sleep", "idle", "active", and so on.  Those states will in some +cases (like PCI) be partially constrained by a bus the device uses, and will +usually include hardware states that are also used in system sleep states. + +However, note that if a driver puts a device into a runtime low power state +and the system then goes into a system-wide sleep state, it normally ought +to resume into that runtime low power state rather than "full on".  Such +distinctions would be part of the driver-internal state machine for that +hardware; the whole point of runtime power management is to be sure that +drivers are decoupled in that way from the state machine governing phases +of the system-wide power/sleep state transitions. + + +Power Saving Techniques +----------------------- +Normally runtime power management is handled by the drivers without specific +userspace or kernel intervention, by device-aware use of techniques like: + +    Using information provided by other system layers +	- stay deeply "off" except between open() and close() +	- if transceiver/PHY indicates "nobody connected", stay "off" +	- application protocols may include power commands or hints + +    Using fewer CPU cycles +	- using DMA instead of PIO +	- removing timers, or making them lower frequency +	- shortening "hot" code paths +	- eliminating cache misses +	- (sometimes) offloading work to device firmware + +    Reducing other resource costs +	- gating off unused clocks in software (or hardware) +	- switching off unused power supplies +	- eliminating (or delaying/merging) IRQs +	- tuning DMA to use word and/or burst modes + +    Using device-specific low power states +	- using lower voltages +	- avoiding needless DMA transfers + +Read your hardware documentation carefully to see the opportunities that +may be available.  If you can, measure the actual power usage and check +it against the budget established for your project. + + +Examples:  USB hosts, system timer, system CPU +---------------------------------------------- +USB host controllers make interesting, if complex, examples.  In many cases +these have no work to do:  no USB devices are connected, or all of them are +in the USB "suspend" state.  Linux host controller drivers can then disable +periodic DMA transfers that would otherwise be a constant power drain on the +memory subsystem, and enter a suspend state.  In power-aware controllers, +entering that suspend state may disable the clock used with USB signaling, +saving a certain amount of power. + +The controller will be woken from that state (with an IRQ) by changes to the +signal state on the data lines of a given port, for example by an existing +peripheral requesting "remote wakeup" or by plugging a new peripheral.  The +same wakeup mechanism usually works from "standby" sleep states, and on some +systems also from "suspend to RAM" (or even "suspend to disk") states. +(Except that ACPI may be involved instead of normal IRQs, on some hardware.) + +System devices like timers and CPUs may have special roles in the platform +power management scheme.  For example, system timers using a "dynamic tick" +approach don't just save CPU cycles (by eliminating needless timer IRQs), +but they may also open the door to using lower power CPU "idle" states that +cost more than a jiffie to enter and exit.  On x86 systems these are states +like "C3"; note that periodic DMA transfers from a USB host controller will +also prevent entry to a C3 state, much like a periodic timer IRQ. + +That kind of runtime mechanism interaction is common.  "System On Chip" (SOC) +processors often have low power idle modes that can't be entered unless +certain medium-speed clocks (often 12 or 48 MHz) are gated off.  When the +drivers gate those clocks effectively, then the system idle task may be able +to use the lower power idle modes and thereby increase battery life. + +If the CPU can have a "cpufreq" driver, there also may be opportunities +to shift to lower voltage settings and reduce the power cost of executing +a given number of instructions.  (Without voltage adjustment, it's rare +for cpufreq to save much power; the cost-per-instruction must go down.) + + +/sys/devices/.../power/state files +================================== +For now you can also test some of this functionality using sysfs. + +	DEPRECATED:  USE "power/state" ONLY FOR DRIVER TESTING, AND +	AVOID USING dev->power.power_state IN DRIVERS. + +	THESE WILL BE REMOVED.  IF THE "power/state" FILE GETS REPLACED, +	IT WILL BECOME SOMETHING COUPLED TO THE BUS OR DRIVER. + +In each device's directory, there is a 'power' directory, which contains +at least a 'state' file.  The value of this field is effectively boolean, +PM_EVENT_ON or PM_EVENT_SUSPEND. + +   *	Reading from this file displays a value corresponding to +	the power.power_state.event field.  All nonzero values are +	displayed as "2", corresponding to a low power state; zero +	is displayed as "0", corresponding to normal operation. + +   *	Writing to this file initiates a transition using the +   	specified event code number; only '0', '2', and '3' are +	accepted (without a newline); '2' and '3' are both +	mapped to PM_EVENT_SUSPEND. + +On writes, the PM core relies on that recorded event code and the device/bus +capabilities to determine whether it uses a partial suspend() or resume() +sequence to change things so that the recorded event corresponds to the +numeric parameter. + +   -	If the bus requires the irqs-disabled suspend_late()/resume_early() +	phases, writes fail because those operations are not supported here. + +   -	If the recorded value is the expected value, nothing is done. + +   -	If the recorded value is nonzero, the device is partially resumed, +	using the bus.resume() and/or class.resume() methods. + +   -	If the target value is nonzero, the device is partially suspended, +	using the class.suspend() and/or bus.suspend() methods and the +	PM_EVENT_SUSPEND message. + +Drivers have no way to tell whether their suspend() and resume() calls +have come through the sysfs power/state file or as part of entering a +system sleep state, except that when accessed through sysfs the normal +parent/child sequencing rules are ignored.  Drivers (such as bus, bridge, +or hub drivers) which expose child devices may need to enforce those rules +on their own.  |