LinuxLists.cc - Runtime PM and the block layer

2010-08-23 19:17:42

Subject: Runtime PM and the block layer

Jens:

I want to implement runtime power management for the SCSI sd driver.
The idea is that the device should automatically be suspended after a
certain amount of time spent idle.

The basic outline is simple enough. If the device is in low power when
a request arrives, delay handling the request until the device can be
brought back to high power. When a request completes and the request
queue is empty, schedule a runtime-suspend for the appropriate time in
the future.

The difficulty is that I don't know the right way these things should
interact with the request-queue management. A request can be deferred
by making the prep_req_fn return BLKPREP_DEFER, right? But then what
happens to the request and to the queue? How does the runtime-resume
routine tell the block layer that the deferred request should be
restarted?

How does this all relate to the queue being stopped or plugged?

Another thing: The runtime-resume routine needs to send its own
commands to the device (to spin up a drive, for example). These
commands must be sent before anything on the request queue, and they
must be handled right away even though the normal requests on the queue
are still deferred.

What's the right way to do all this?

Thanks,

Alan Stern

2010-08-23 19:53:38

by Jens Axboe

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On 08/23/2010 09:17 PM, Alan Stern wrote:
> Jens:
>
> I want to implement runtime power management for the SCSI sd driver.
> The idea is that the device should automatically be suspended after a
> certain amount of time spent idle.
>
> The basic outline is simple enough. If the device is in low power when
> a request arrives, delay handling the request until the device can be
> brought back to high power. When a request completes and the request
> queue is empty, schedule a runtime-suspend for the appropriate time in
> the future.

So if it's in low power mode, you need to defer because you want to
issue some special request first to bring it back to life?

> The difficulty is that I don't know the right way these things should
> interact with the request-queue management. A request can be deferred
> by making the prep_req_fn return BLKPREP_DEFER, right? But then what

Right, that is used for resource starvation. So usually very short
conditions.

> happens to the request and to the queue? How does the runtime-resume
> routine tell the block layer that the deferred request should be
> restarted?

Internally, it uses the block queue plugging to set a timer to defer a
bit. That's purely implementation detail and it will change in the
not-so-distant future if I kill the per-queue plugging. The effect will
still be the same though, the action will be automatically retried after
some defined interval.

> How does this all relate to the queue being stopped or plugged?

A stopped queue is usually the driver telling the block layer to bugger
off for a while, and the driver will tell us when it's ok to resume
operations. So we can't control that part. Plugging we can control. But
if the device is plugged, the driver is idle _and_ we have IO pending.
So you would not be entering a lower power mode at that point, and the
driver should already be in an operationel state; when it got plugged,
we should have issued the special req to send it into live mode.

> Another thing: The runtime-resume routine needs to send its own
> commands to the device (to spin up a drive, for example). These
> commands must be sent before anything on the request queue, and they
> must be handled right away even though the normal requests on the queue
> are still deferred.

We can flag those requests as being of some category that is allowed to
bypass the sleep state of the device. Handling right away can be
accomplished by just inserting at the front and having that flag set.

> What's the right way to do all this?

It needs to be done carefully. A queue can go in and out of idle/busy
state extremely fast. I did quite a few tricks on the queue timeout
handling to ensure that it didn't have much overhead on a per-rq basis.
So we could probably add an idle timer that is set to some suitable
timeout for this and would be added when the queue first goes empty. If
new requests come in, just let it simmer and defer checking the state to
when it actually fires. If nothing has happened, issue a new
q->power_mode(new_state) callback that would then queue a suitable
request to change the power state of the device. Queueing a new request
could check the state and issue a q->power_mode(RUNNING) or similar call
to bring things back to life.

Just a few ideas...

--
Jens Axboe

2010-08-23 21:51:38

by Alan Stern

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On Mon, 23 Aug 2010, Jens Axboe wrote:

> On 08/23/2010 09:17 PM, Alan Stern wrote:
> > Jens:
> >
> > I want to implement runtime power management for the SCSI sd driver.
> > The idea is that the device should automatically be suspended after a
> > certain amount of time spent idle.
> >
> > The basic outline is simple enough. If the device is in low power when
> > a request arrives, delay handling the request until the device can be
> > brought back to high power. When a request completes and the request
> > queue is empty, schedule a runtime-suspend for the appropriate time in
> > the future.
>
> So if it's in low power mode, you need to defer because you want to
> issue some special request first to bring it back to life?

Exactly. And also because if the device is in low-power mode then its
parent might be in low-power too, meaning that we would have to wait
for both the parent and the device to return to full power before
sending the request.

The PM framework is set up so that power-state changes are always done
in process context -- meaning in this case that a workqueue would be
needed. The PM core has a special workqueue for just this purpose.
But obviously a prep function can't sit around and wait for the work to
get done.

> > The difficulty is that I don't know the right way these things should
> > interact with the request-queue management. A request can be deferred
> > by making the prep_req_fn return BLKPREP_DEFER, right? But then what
>
> Right, that is used for resource starvation. So usually very short
> conditions.
>
> > happens to the request and to the queue? How does the runtime-resume
> > routine tell the block layer that the deferred request should be
> > restarted?
>
> Internally, it uses the block queue plugging to set a timer to defer a
> bit. That's purely implementation detail and it will change in the
> not-so-distant future if I kill the per-queue plugging. The effect will
> still be the same though, the action will be automatically retried after
> some defined interval.

Hmm. That doesn't sound quite like what I need. Ideally the request
would go back to the head of the queue and stay there until the driver
tells the block layer to let it through (when the device is ready to
accept it).

> > How does this all relate to the queue being stopped or plugged?
>
> A stopped queue is usually the driver telling the block layer to bugger
> off for a while, and the driver will tell us when it's ok to resume
> operations.

Yes, that sounds more like it. Put the request back on the queue
and stop the queue. If the prep fn calls blk_stop_queue() and then
returns BLKPREP_DEFER, will that do it?

> So we can't control that part. Plugging we can control. But

I probably didn't make it clear in the earlier message: The changes
to implement all this PM stuff will go in the driver, with nothing (or
almost nothing) changed in the block layer. Hence stopping the queue
_is_ under my control.

Unless you think it would be better to change the block layer
instead...

> if the device is plugged, the driver is idle _and_ we have IO pending.
> So you would not be entering a lower power mode at that point, and the
> driver should already be in an operationel state; when it got plugged,
> we should have issued the special req to send it into live mode.

Plugging doesn't seem like the right mechanism for this.

> > Another thing: The runtime-resume routine needs to send its own
> > commands to the device (to spin up a drive, for example). These
> > commands must be sent before anything on the request queue, and they
> > must be handled right away even though the normal requests on the queue
> > are still deferred.
>
> We can flag those requests as being of some category that is allowed to
> bypass the sleep state of the device. Handling right away can be
> accomplished by just inserting at the front and having that flag set.

Okay, good. But if the queue is stopped when the requests are
inserted at the front (with the flag set), will they be allowed to go
through to the driver? In other words, is there a way to force certain
requests to be processed even while the queue is stopped?

> > What's the right way to do all this?
>
> It needs to be done carefully. A queue can go in and out of idle/busy
> state extremely fast. I did quite a few tricks on the queue timeout
> handling to ensure that it didn't have much overhead on a per-rq basis.
> So we could probably add an idle timer that is set to some suitable
> timeout for this and would be added when the queue first goes empty. If
> new requests come in, just let it simmer and defer checking the state to
> when it actually fires. If nothing has happened, issue a new
> q->power_mode(new_state) callback that would then queue a suitable
> request to change the power state of the device. Queueing a new request
> could check the state and issue a q->power_mode(RUNNING) or similar call
> to bring things back to life.
>
> Just a few ideas...

The idle-time management can be handled in a couple of different ways,
and the PM core already contains routines to do it. I'm not worried
about that (I have a very clear understanding of the PM core). The
interactions with the block layer are where I need help.

Speaking of which... What is this q->power_mode stuff? I haven't run
across it before and it doesn't seem to be mentioned in
include/linux/blkdev.h. Is it connected with request_pm_state? I
don't know what that is either, or how it is meant to be used.

Alan Stern

2010-08-24 13:15:14

by Raj Kumar

[permalink] [raw]

Subject: Runtime power management during system resume

Hi Alan,

I have implemented the run time power management in my drivers. I have one
issue regarding System resume.

When the system sleep is triggered as it is mentioned that Power management
core will increment the power_usage counter during prepare and decrements when complete
is called.

Now I have few questions:

1) When the system resume is done, it does not increase the power_usage counter.
right?

So Does then the driver need to update the power_usage counter with run time power management
core and again set it to active means RPM_ACTIVE?

2) Suppose device is active, means its power_usage counter is already one, Now during system
sleep, does the driver first suspend it with run time power management core and then continue
System suspend?

3) Because I have seen the code of power management core and I did not see the that during
system suspend, run time power management status is updated means RPM_SUSPENDED.
right?

What do you think?

Regards
Raj

-

2010-08-24 13:38:11

by Jens Axboe

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On 2010-08-23 23:51, Alan Stern wrote:
>>> happens to the request and to the queue? How does the runtime-resume
>>> routine tell the block layer that the deferred request should be
>>> restarted?
>>
>> Internally, it uses the block queue plugging to set a timer to defer a
>> bit. That's purely implementation detail and it will change in the
>> not-so-distant future if I kill the per-queue plugging. The effect will
>> still be the same though, the action will be automatically retried after
>> some defined interval.
>
> Hmm. That doesn't sound quite like what I need. Ideally the request
> would go back to the head of the queue and stay there until the driver
> tells the block layer to let it through (when the device is ready to
> accept it).

It depends on where you want to handle it. If you want the driver to
reject it, then we don't have to change the block layer bits a lot. We
could add a DEFER_AND_STOP or something, which would never retry and it
would stop the queue. If the driver passed that back, then it would be
responsible for starting the queue at some point in the future.

>>> How does this all relate to the queue being stopped or plugged?
>>
>> A stopped queue is usually the driver telling the block layer to bugger
>> off for a while, and the driver will tell us when it's ok to resume
>> operations.
>
> Yes, that sounds more like it. Put the request back on the queue
> and stop the queue. If the prep fn calls blk_stop_queue() and then
> returns BLKPREP_DEFER, will that do it?

I think it will be a lot cleaner to add specific support for this, as
per the DEFER_AND_STOP above.

>> So we can't control that part. Plugging we can control. But
>
> I probably didn't make it clear in the earlier message: The changes
> to implement all this PM stuff will go in the driver, with nothing (or
> almost nothing) changed in the block layer. Hence stopping the queue
> _is_ under my control.
>
> Unless you think it would be better to change the block layer
> instead...

Doing it in the driver is fine. We can always make things more generic
and share them across drivers if there's sharing to be had there.

It also means we don't need special request types that are allowed to
bypass certain queue states, since the driver will track the state and
know what to defer and what to pass through.

>> It needs to be done carefully. A queue can go in and out of idle/busy
>> state extremely fast. I did quite a few tricks on the queue timeout
>> handling to ensure that it didn't have much overhead on a per-rq basis.
>> So we could probably add an idle timer that is set to some suitable
>> timeout for this and would be added when the queue first goes empty. If
>> new requests come in, just let it simmer and defer checking the state to
>> when it actually fires. If nothing has happened, issue a new
>> q->power_mode(new_state) callback that would then queue a suitable
>> request to change the power state of the device. Queueing a new request
>> could check the state and issue a q->power_mode(RUNNING) or similar call
>> to bring things back to life.
>>
>> Just a few ideas...
>
> The idle-time management can be handled in a couple of different ways,
> and the PM core already contains routines to do it. I'm not worried
> about that (I have a very clear understanding of the PM core). The
> interactions with the block layer are where I need help.
>
> Speaking of which... What is this q->power_mode stuff? I haven't run
> across it before and it doesn't seem to be mentioned in
> include/linux/blkdev.h. Is it connected with request_pm_state? I
> don't know what that is either, or how it is meant to be used.

->power_mode() was just a suggested way to implement this, it doesn't
exist. But if you want to push it to the driver, then great, less work
for me :-)

Sounds like all you need is a way to return BLKPREP_DEFER_AND_STOP and
have the block layer stop the queue for you. When you need to restart,
you would insert a special request at the head of the queue and call
blk_start_queue() to get things going again.

The only missing bit would then be the idle detection. That would need
to be in the block layer itself, and the scheme I described should be
fine for that still.

--
Jens Axboe

2010-08-24 14:30:28

by Alan Stern

[permalink] [raw]

Subject: Re: Runtime power management during system resume

On Tue, 24 Aug 2010, Raj Kumar wrote:

> Hi Alan,
>
> I have implemented the run time power management in my drivers. I have one
> issue regarding System resume.
>
> When the system sleep is triggered as it is mentioned that Power management
> core will increment the power_usage counter during prepare and decrements when complete
> is called.
>
> Now I have few questions:
>
> 1) When the system resume is done, it does not increase the power_usage counter.
> right?

That's right.

> So Does then the driver need to update the power_usage counter with run time power management
> core and again set it to active means RPM_ACTIVE?

Read section 6 of Documentation/power/runtime_pm.h. It explains this.

> 2) Suppose device is active, means its power_usage counter is already one, Now during system
> sleep, does the driver first suspend it with run time power management core and then continue
> System suspend?

No.

> 3) Because I have seen the code of power management core and I did not see the that during
> system suspend, run time power management status is updated means RPM_SUSPENDED.
> right?

I don't understand your question.

Alan Stern

2010-08-24 14:42:30

by Alan Stern

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On Tue, 24 Aug 2010, Jens Axboe wrote:

> > Hmm. That doesn't sound quite like what I need. Ideally the request
> > would go back to the head of the queue and stay there until the driver
> > tells the block layer to let it through (when the device is ready to
> > accept it).
>
> It depends on where you want to handle it. If you want the driver to
> reject it, then we don't have to change the block layer bits a lot. We
> could add a DEFER_AND_STOP or something, which would never retry and it
> would stop the queue. If the driver passed that back, then it would be
> responsible for starting the queue at some point in the future.
>
> >>> How does this all relate to the queue being stopped or plugged?
> >>
> >> A stopped queue is usually the driver telling the block layer to bugger
> >> off for a while, and the driver will tell us when it's ok to resume
> >> operations.
> >
> > Yes, that sounds more like it. Put the request back on the queue
> > and stop the queue. If the prep fn calls blk_stop_queue() and then
> > returns BLKPREP_DEFER, will that do it?
>
> I think it will be a lot cleaner to add specific support for this, as
> per the DEFER_AND_STOP above.

Okay, good. I'll try to implement that and see how it goes.

> Sounds like all you need is a way to return BLKPREP_DEFER_AND_STOP and
> have the block layer stop the queue for you. When you need to restart,
> you would insert a special request at the head of the queue and call
> blk_start_queue() to get things going again.

Yes.

Suppose the driver needs to send two of these special requests before
going back to normal operation. Won't restarting the queue for the
first special request also cause the following regular request to be
passed to the driver before the second special request can be inserted?
Of course, the driver could cope with this simply by returning another
BLKPREP_DEFER_AND_STOP.

> The only missing bit would then be the idle detection. That would need
> to be in the block layer itself, and the scheme I described should be
> fine for that still.

Are you sure it needs to be in the block layer? Is there no way for
the driver's completion handler to tell whether the queue is now empty?
Certainly it already has enough information to know whether the device
is still busy processing another request. When the device is no longer
busy and the queue is empty, that's when the idle timer should be
started or restarted.

Alan Stern

2010-08-24 17:08:58

by Jens Axboe

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On 2010-08-24 16:42, Alan Stern wrote:
>> Sounds like all you need is a way to return BLKPREP_DEFER_AND_STOP and
>> have the block layer stop the queue for you. When you need to restart,
>> you would insert a special request at the head of the queue and call
>> blk_start_queue() to get things going again.
>
> Yes.
>
> Suppose the driver needs to send two of these special requests before
> going back to normal operation. Won't restarting the queue for the
> first special request also cause the following regular request to be
> passed to the driver before the second special request can be inserted?
> Of course, the driver could cope with this simply by returning another
> BLKPREP_DEFER_AND_STOP.

For that special request, you are sure to have some ->end_io() hook to
know when it's complete. When that triggers, you queue the 2nd special
request. And so on, for how many you need.

>> The only missing bit would then be the idle detection. That would need
>> to be in the block layer itself, and the scheme I described should be
>> fine for that still.
>
> Are you sure it needs to be in the block layer? Is there no way for
> the driver's completion handler to tell whether the queue is now empty?
> Certainly it already has enough information to know whether the device
> is still busy processing another request. When the device is no longer
> busy and the queue is empty, that's when the idle timer should be
> started or restarted.

To some extent there is, but there can be context outside of the queue
it doesn't know about. That is the case for the plugging rework, for
instance. That also removes the queue_empty() call. Then there's
blk_fetch_request(), but that may return NULL while there's IO pending
in the block layer - so not reliable for that either. The block layer is
tracking this state anyway, if you are leaving it to the driver then it
would have to check everytime it completes the last request it has. It's
cheaper to do in the block layer.

--
Jens Axboe

2010-08-24 20:06:38

by Alan Stern

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On Tue, 24 Aug 2010, Jens Axboe wrote:

> On 2010-08-24 16:42, Alan Stern wrote:
> >> Sounds like all you need is a way to return BLKPREP_DEFER_AND_STOP and
> >> have the block layer stop the queue for you. When you need to restart,
> >> you would insert a special request at the head of the queue and call
> >> blk_start_queue() to get things going again.
> >
> > Yes.
> >
> > Suppose the driver needs to send two of these special requests before
> > going back to normal operation. Won't restarting the queue for the
> > first special request also cause the following regular request to be
> > passed to the driver before the second special request can be inserted?
> > Of course, the driver could cope with this simply by returning another
> > BLKPREP_DEFER_AND_STOP.
>
> For that special request, you are sure to have some ->end_io() hook to
> know when it's complete. When that triggers, you queue the 2nd special
> request. And so on, for how many you need.

That's not what I meant. Suppose the driver wants to carry out special
requests A and B before carrying out request R, which is initially at
the head of the queue. The driver inserts A at the front, calls
blk_start_queue(), and inserts B at the front when A completes.
What's to prevent the block layer from sending R to the driver while A
is running?

> >> The only missing bit would then be the idle detection. That would need
> >> to be in the block layer itself, and the scheme I described should be
> >> fine for that still.
> >
> > Are you sure it needs to be in the block layer? Is there no way for
> > the driver's completion handler to tell whether the queue is now empty?
> > Certainly it already has enough information to know whether the device
> > is still busy processing another request. When the device is no longer
> > busy and the queue is empty, that's when the idle timer should be
> > started or restarted.
>
> To some extent there is, but there can be context outside of the queue
> it doesn't know about. That is the case for the plugging rework, for
> instance. That also removes the queue_empty() call. Then there's
> blk_fetch_request(), but that may return NULL while there's IO pending
> in the block layer - so not reliable for that either. The block layer is
> tracking this state anyway, if you are leaving it to the driver then it
> would have to check everytime it completes the last request it has. It's
> cheaper to do in the block layer.

I see. You're suggesting we add a new "power_mode" or "queue_idle"
callback to the request_queue struct, and make the block layer invoke
this callback whenever a request completes and there are no other
requests pending or in flight. Right? And similarly, invoke the
callback (with a different argument) when the first request gets added
to an otherwise empty queue.

That would suit my needs.

Alan Stern

2010-08-24 20:10:37

by Jens Axboe

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On 08/24/2010 10:06 PM, Alan Stern wrote:
> On Tue, 24 Aug 2010, Jens Axboe wrote:
>
>> On 2010-08-24 16:42, Alan Stern wrote:
>>>> Sounds like all you need is a way to return BLKPREP_DEFER_AND_STOP and
>>>> have the block layer stop the queue for you. When you need to restart,
>>>> you would insert a special request at the head of the queue and call
>>>> blk_start_queue() to get things going again.
>>>
>>> Yes.
>>>
>>> Suppose the driver needs to send two of these special requests before
>>> going back to normal operation. Won't restarting the queue for the
>>> first special request also cause the following regular request to be
>>> passed to the driver before the second special request can be inserted?
>>> Of course, the driver could cope with this simply by returning another
>>> BLKPREP_DEFER_AND_STOP.
>>
>> For that special request, you are sure to have some ->end_io() hook to
>> know when it's complete. When that triggers, you queue the 2nd special
>> request. And so on, for how many you need.
>
> That's not what I meant. Suppose the driver wants to carry out special
> requests A and B before carrying out request R, which is initially at
> the head of the queue. The driver inserts A at the front, calls
> blk_start_queue(), and inserts B at the front when A completes.
> What's to prevent the block layer from sending R to the driver while A
> is running?

Nothing, you will have to maintain that state and defer when
appropriate. Which should happen automatically, since you would not be
switching your state to running until request B has completed anyway.

>>>> The only missing bit would then be the idle detection. That would need
>>>> to be in the block layer itself, and the scheme I described should be
>>>> fine for that still.
>>>
>>> Are you sure it needs to be in the block layer? Is there no way for
>>> the driver's completion handler to tell whether the queue is now empty?
>>> Certainly it already has enough information to know whether the device
>>> is still busy processing another request. When the device is no longer
>>> busy and the queue is empty, that's when the idle timer should be
>>> started or restarted.
>>
>> To some extent there is, but there can be context outside of the queue
>> it doesn't know about. That is the case for the plugging rework, for
>> instance. That also removes the queue_empty() call. Then there's
>> blk_fetch_request(), but that may return NULL while there's IO pending
>> in the block layer - so not reliable for that either. The block layer is
>> tracking this state anyway, if you are leaving it to the driver then it
>> would have to check everytime it completes the last request it has. It's
>> cheaper to do in the block layer.
>
> I see. You're suggesting we add a new "power_mode" or "queue_idle"
> callback to the request_queue struct, and make the block layer invoke
> this callback whenever a request completes and there are no other
> requests pending or in flight. Right? And similarly, invoke the
> callback (with a different argument) when the first request gets added
> to an otherwise empty queue.
>
> That would suit my needs.

Yep, that is what I'm suggesting.

--
Jens Axboe

2010-08-24 21:09:09

by Alan Stern

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On Tue, 24 Aug 2010, Jens Axboe wrote:

> > I see. You're suggesting we add a new "power_mode" or "queue_idle"
> > callback to the request_queue struct, and make the block layer invoke
> > this callback whenever a request completes and there are no other
> > requests pending or in flight. Right? And similarly, invoke the
> > callback (with a different argument) when the first request gets added
> > to an otherwise empty queue.
> >
> > That would suit my needs.
>
> Yep, that is what I'm suggesting.

All right, I'll work on this and get back to you when I need more help.
Thanks for the advice.

Alan Stern

2010-08-30 16:32:13

by Alan Stern

[permalink] [raw]

Subject: Re: Runtime PM and the block layer

On Tue, 24 Aug 2010, Jens Axboe wrote:

> > Unless you think it would be better to change the block layer
> > instead...
>
> Doing it in the driver is fine. We can always make things more generic
> and share them across drivers if there's sharing to be had there.

After giving this some thought, I have decided that it would be best to
implement much of this in the block layer. It's a simpler approach and
it offers greater generality.

The changes would be fairly small. Two additional fields will be added
to struct request_queue: a PM status (active, suspending, suspended,
resuming) and a pointer to the queue's struct device (for carrying out
PM operations). Actually I'm a little surprised there isn't already a
pointer to the struct device; it seems like a very natural thing to
have.

There also will be four new functions for drivers/subsystems to call at
the beginning and end of their suspend and resume routines.

> It also means we don't need special request types that are allowed to
> bypass certain queue states, since the driver will track the state and
> know what to defer and what to pass through.

It turns out there already are a couple of special request types for
this: REQ_TYPE_PM_SUSPEND and REQ_TYPE_PM_RESUME. It's not clear why
two different types are needed, but blk_execute_rq_nowait() contains a
clue:

/* the queue is stopped so it won't be plugged+unplugged */
if (rq->cmd_type == REQ_TYPE_PM_RESUME)
q->request_fn(q);

The purpose for this is unclear. It seems to have been added
specifically for the IDE driver, which is the only driver using these
request types. (In fact, the entire request_pm_state structure also
isn't used anywhere else -- which indicates that it should be defined
in a private header for IDE alone instead of in blkdev.h.) Maybe it
won't be needed after these changes.

My idea is that a queue shouldn't need to be explicitly stopped when
its device is suspended. Instead, blk_peek_request() can check the
queue state and simply return NULL if the queue is suspending,
suspended, or resuming and the request type isn't REQ_TYPE_PM_SUSPEND
or _RESUME. That should work, since blk_peek_request() is the only
path for moving requests from the queue to the driver, right?

> The only missing bit would then be the idle detection. That would need
> to be in the block layer itself, and the scheme I described should be
> fine for that still.

This is where I will need help. From what I gather, a request's path
through the block layer starts at __elv_add_request() and ends at
blk_finish_request(). Updating a counter at these points should be
good enough -- except for elv_merge() and possibly other things I don't
know about. Not to mention any changes you may be planning.

Basically I just need to call some new routines when a request is first
added to an idle queue and when a queue becomes idle because the last
request has completed. Can you suggest the best way to do this?

Alan Stern