2008-11-01 16:54:44

by Alan Stern

[permalink] [raw]
Subject: Problems with the block-layer timeouts

James and Jens:

I spent most of the day yesterday debugging some tricky problems in the
new block-layer timeout scheme. Clearly it is in need of more work.

A major reason for these problems was that there doesn't seem to be a
clear a idea of when the timeout period should begin. In
blk_add_timer() a comment says:

* Each request has its own timer, and as it is added to the queue, we
* set up the timer.

On the other hand, elv_next_request() says:

* We are now handing the request to the hardware,
* add the timeout handler

(Note that this comment is wrong for an additional reason.
elv_next_request() doesn't hand the request to the hardware; it hands
the request to a driver. What the driver chooses to do with the
request is its own affair.)

So when should the timeout begin? The most logical time is when the
driver does send the request to the hardware. Of course, the block
core has no way to know when that happens, so a suitable proxy might be
when the request is removed from the block queue. (On the other hand,
are there drivers which don't bother to dequeue a request until it has
completed?) Either way, both the comments above and the actual code
should be changed.


The real source of the problems I encountered is in the SCSI midlayer.
scsi_request_fn() can call elv_next_request() long before it is ready
to begin executing the request. In particular, it does so before
checking scsi_dev_queue_ready(). So if the lower-level driver can
handle up to N simultaneous requests, the midlayer will call
elv_next_request() N+1 times. The last request has to wait until one
of the first N completes. Surely this waiting period doesn't deserve
to be counted as part of the last request's timeout.

In my case N was 1, and request #1 ended up being requeued. Requeuing
restarts the timer; as a result, the timer for request #2 expired
before request #1's second incarnation timed out. And this was before
request #2 had even begun!

That's not so terribly bad in itself. However the scsi_times_out()
routine is completely unprepared to handle timeouts for requests that
haven't yet been dispatched to the LLD. Or, for that matter, requests
which have been returned by the LLD but were requeued and have not yet
been sent back down.

So what happened was that the midlayer tried to abort a request which
hadn't started yet. Or, depending on the exact timing, it found itself
being asked to abort two requests when only one was running, so it gave
up and did nothing. One way leads to processes hanging, the other way
leads to a system crash.

How should this be fixed? It would help to call scsi_dev_queue_ready()
before elv_next_request(), but that's not sufficient.
scsi_times_out() needs to recognize that a timeout for a non-running
request can be handled by directly returning BLK_EH_HANDLED. Right?


While I'm on the subject, there are a few related items that could be
improved. In my tests, I was generating I/O requests simply by doing

dd if=/dev/sda ...

I don't know where the timeouts for these requests are determined, but
they were set to 60 seconds. That seems much too long.

In blk_del_timer(), there's no reason to test q->rq_timed_out_fn. If
the method pointer is NULL then req->deadline would be 0 anyway. In
addition, req->deadline should be set to 0 and the end of the routine,
just in case req gets requeued.

In blk_add_timer(), the line

expiry = round_jiffies(req->deadline);

is not optimal. round_jiffies() will sometimes round a value _down_ to
the nearest second. But blk_rq_timed_out_timer() tests whether
req->deadline is in the past -- and if the deadline was rounded down
then this won't be true the first time through. You wind up getting an
unnecessary timer interrupt. Instead there should be a
round_jiffies_up() utility routine, and it should be used in both
blk_add_timer() and blk_rq_timed_out_timer().

Alan Stern


2008-11-02 20:35:59

by Mike Anderson

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

Alan Stern <[email protected]> wrote:
> I spent most of the day yesterday debugging some tricky problems in the
> new block-layer timeout scheme. Clearly it is in need of more work.
>
> A major reason for these problems was that there doesn't seem to be a
> clear a idea of when the timeout period should begin. In
> blk_add_timer() a comment says:

> How should this be fixed? It would help to call scsi_dev_queue_ready()
> before elv_next_request(), but that's not sufficient.
> scsi_times_out() needs to recognize that a timeout for a non-running
> request can be handled by directly returning BLK_EH_HANDLED. Right?
>
>

Tejun described a similar issue here.
http://thread.gmane.org/gmane.linux.ide/35603

And a fix to address the issue here.
http://thread.gmane.org/gmane.linux.ide/35725

Does the patch posted by Tejun address your issue?

> While I'm on the subject, there are a few related items that could be
> improved. In my tests, I was generating I/O requests simply by doing
>
> dd if=/dev/sda ...
>
> I don't know where the timeouts for these requests are determined, but
> they were set to 60 seconds. That seems much too long.
>

It is set by a udev rule and the value is historical.
http://thread.gmane.org/gmane.linux.scsi/45631/focus=45646


-andmike
--
Michael Anderson
[email protected]

2008-11-03 08:54:20

by Jens Axboe

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Sat, Nov 01 2008, Alan Stern wrote:
> James and Jens:
>
> I spent most of the day yesterday debugging some tricky problems in the
> new block-layer timeout scheme. Clearly it is in need of more work.

It is.

> A major reason for these problems was that there doesn't seem to be a
> clear a idea of when the timeout period should begin. In
> blk_add_timer() a comment says:
>
> * Each request has its own timer, and as it is added to the queue, we
> * set up the timer.
>
> On the other hand, elv_next_request() says:
>
> * We are now handing the request to the hardware,
> * add the timeout handler
>
> (Note that this comment is wrong for an additional reason.
> elv_next_request() doesn't hand the request to the hardware; it hands
> the request to a driver. What the driver chooses to do with the
> request is its own affair.)
>
> So when should the timeout begin? The most logical time is when the
> driver does send the request to the hardware. Of course, the block
> core has no way to know when that happens, so a suitable proxy might be
> when the request is removed from the block queue. (On the other hand,
> are there drivers which don't bother to dequeue a request until it has
> completed?) Either way, both the comments above and the actual code
> should be changed.
>
>
> The real source of the problems I encountered is in the SCSI midlayer.
> scsi_request_fn() can call elv_next_request() long before it is ready
> to begin executing the request. In particular, it does so before
> checking scsi_dev_queue_ready(). So if the lower-level driver can
> handle up to N simultaneous requests, the midlayer will call
> elv_next_request() N+1 times. The last request has to wait until one
> of the first N completes. Surely this waiting period doesn't deserve
> to be counted as part of the last request's timeout.
>
> In my case N was 1, and request #1 ended up being requeued. Requeuing
> restarts the timer; as a result, the timer for request #2 expired
> before request #1's second incarnation timed out. And this was before
> request #2 had even begun!
>
> That's not so terribly bad in itself. However the scsi_times_out()
> routine is completely unprepared to handle timeouts for requests that
> haven't yet been dispatched to the LLD. Or, for that matter, requests
> which have been returned by the LLD but were requeued and have not yet
> been sent back down.
>
> So what happened was that the midlayer tried to abort a request which
> hadn't started yet. Or, depending on the exact timing, it found itself
> being asked to abort two requests when only one was running, so it gave
> up and did nothing. One way leads to processes hanging, the other way
> leads to a system crash.
>
> How should this be fixed? It would help to call scsi_dev_queue_ready()
> before elv_next_request(), but that's not sufficient.
> scsi_times_out() needs to recognize that a timeout for a non-running
> request can be handled by directly returning BLK_EH_HANDLED. Right?

We already discussed this issue with Tejun. There's a hack in my tree
now that just moves the activate call to dequeue time, which works ok
for SCSI (but wont work for eg IDE). The real fix is to have a peek and
fetch interface for retrieving requests. We've actually wanted that for
some time, since the current 'peek and mark active' approach doesn't
even work well now since it'll both force pushing of requests to the
dispatch list and mark it unmergeable, since the block layer does not
whether the driver has started handling the request or not.

So, in summary, a short term fix will be merged soon and a longer term
fix will be right after.

> While I'm on the subject, there are a few related items that could be
> improved. In my tests, I was generating I/O requests simply by doing
>
> dd if=/dev/sda ...
>
> I don't know where the timeouts for these requests are determined, but
> they were set to 60 seconds. That seems much too long.

Fully agreed, as Mike mentioned this actually looks like a dumb udev
rule that didn't have any effect until this generic timeout work. For
normal IO, something in the 10 second range is a lot more appropriate.

> In blk_del_timer(), there's no reason to test q->rq_timed_out_fn. If
> the method pointer is NULL then req->deadline would be 0 anyway. In
> addition, req->deadline should be set to 0 and the end of the routine,
> just in case req gets requeued.
>
> In blk_add_timer(), the line
>
> expiry = round_jiffies(req->deadline);
>
> is not optimal. round_jiffies() will sometimes round a value _down_ to
> the nearest second. But blk_rq_timed_out_timer() tests whether
> req->deadline is in the past -- and if the deadline was rounded down
> then this won't be true the first time through. You wind up getting an
> unnecessary timer interrupt. Instead there should be a
> round_jiffies_up() utility routine, and it should be used in both
> blk_add_timer() and blk_rq_timed_out_timer().

Very good point, we do indeed want a round_jiffies_up() for this!

--
Jens Axboe

2008-11-03 14:18:13

by James Smart

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

Jens Axboe wrote:
>> While I'm on the subject, there are a few related items that could be
>> improved. In my tests, I was generating I/O requests simply by doing
>>
>> dd if=/dev/sda ...
>>
>> I don't know where the timeouts for these requests are determined, but
>> they were set to 60 seconds. That seems much too long.
>
> Fully agreed, as Mike mentioned this actually looks like a dumb udev
> rule that didn't have any effect until this generic timeout work. For
> normal IO, something in the 10 second range is a lot more appropriate.

Yes and no. For direct-attach storage with no other initiators, ok. But
for larger arrays, potentially with multiple initiators - no. I can
name several arrays that depend on a 30 second timeout, and a few that,
underload, require 60 seconds. I assume that there's usually "best
practices" guides for the integrators to ensure the defaults are set right.

-- james s

2008-11-03 15:59:21

by Alan Stern

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

Hi, Tejun! I ran across the same bug as you, but about a day later.

On Mon, 3 Nov 2008, Jens Axboe wrote:

> > So when should the timeout begin? The most logical time is when the
> > driver does send the request to the hardware. Of course, the block
> > core has no way to know when that happens, so a suitable proxy might be
> > when the request is removed from the block queue. (On the other hand,
> > are there drivers which don't bother to dequeue a request until it has
> > completed?) Either way, both the comments above and the actual code
> > should be changed.

> We already discussed this issue with Tejun. There's a hack in my tree
> now that just moves the activate call to dequeue time, which works ok
> for SCSI (but wont work for eg IDE). The real fix is to have a peek and
> fetch interface for retrieving requests. We've actually wanted that for
> some time, since the current 'peek and mark active' approach doesn't
> even work well now since it'll both force pushing of requests to the
> dispatch list and mark it unmergeable, since the block layer does not
> whether the driver has started handling the request or not.
>
> So, in summary, a short term fix will be merged soon and a longer term
> fix will be right after.

Even a "peek and fetch" interface might not be best, at least as far as
timer issues are concerned. Ideally, the timer shouldn't be started
until the SCSI midlayer knows that the request has successfully been
sent to the lower-level driver.

Therefore the best approach would be to EXPORT blk_add_timer(). It
should be called at the end of scsi_dispatch_cmd(), when the return
value from the queuecommand method is known to be 0.

With something like this, Mike's fix to end_that_request_last()
wouldn't be needed, since blkdev_dequeue_request() wouldn't
automatically start the timer. It seems silly to start the timer when
you know you're just going to stop it immediately afterwards.

Alan Stern

2008-11-03 16:40:17

by Tejun Heo

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

Hello, Alan Stern! :-)

Alan Stern wrote:
> Even a "peek and fetch" interface might not be best, at least as far as
> timer issues are concerned. Ideally, the timer shouldn't be started
> until the SCSI midlayer knows that the request has successfully been
> sent to the lower-level driver.
>
> Therefore the best approach would be to EXPORT blk_add_timer(). It
> should be called at the end of scsi_dispatch_cmd(), when the return
> value from the queuecommand method is known to be 0.
>
> With something like this, Mike's fix to end_that_request_last()
> wouldn't be needed, since blkdev_dequeue_request() wouldn't
> automatically start the timer. It seems silly to start the timer when
> you know you're just going to stop it immediately afterwards.

Block layer currently doesn't know when a request is actually being
issued. For timeout, blk_add_timer() can be exported but I think that
only aggravate the already highly fragmented block layer interface
(different users use it differently to the point of scary chaos). For
minor example, block tracing considers elv_next_request() as the command
issue point which isn't quite true for SCSI and many other drivers. For
that too, we can export the tracing interface but I don't think that's
the right direction. More stuff are scheduled to be moved to block
layer and exporting more and more implementation details to block layer
users will have hard time scaling.

I'm trying to convert all drivers to use the same command issue model -
elv_next_request() -> blkdev_dequeue_request() on actual issue ->
blk_end_request(). I have first draft of the conversion patchset but
it's gonna take me a few more days to review and test what I can as
several drivers (mostly legacy ones) are a bit tricky.

For the time being, SCSI layer is the only block layer timeout user and
completion w/o dequeuing is only for error cases in SCSI, so the
inefficiency there shouldn't matter too much.

Jens, what do you think?

Thanks.

--
tejun

2008-11-03 17:07:51

by Alan Stern

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Tue, 4 Nov 2008, Tejun Heo wrote:

> Hello, Alan Stern! :-)
>
> Alan Stern wrote:
> > Even a "peek and fetch" interface might not be best, at least as far as
> > timer issues are concerned. Ideally, the timer shouldn't be started
> > until the SCSI midlayer knows that the request has successfully been
> > sent to the lower-level driver.
> >
> > Therefore the best approach would be to EXPORT blk_add_timer(). It
> > should be called at the end of scsi_dispatch_cmd(), when the return
> > value from the queuecommand method is known to be 0.
> >
> > With something like this, Mike's fix to end_that_request_last()
> > wouldn't be needed, since blkdev_dequeue_request() wouldn't
> > automatically start the timer. It seems silly to start the timer when
> > you know you're just going to stop it immediately afterwards.
>
> Block layer currently doesn't know when a request is actually being
> issued. For timeout, blk_add_timer() can be exported but I think that
> only aggravate the already highly fragmented block layer interface
> (different users use it differently to the point of scary chaos). For
> minor example, block tracing considers elv_next_request() as the command
> issue point which isn't quite true for SCSI and many other drivers. For
> that too, we can export the tracing interface but I don't think that's
> the right direction. More stuff are scheduled to be moved to block
> layer and exporting more and more implementation details to block layer
> users will have hard time scaling.
>
> I'm trying to convert all drivers to use the same command issue model -
> elv_next_request() -> blkdev_dequeue_request() on actual issue ->
> blk_end_request(). I have first draft of the conversion patchset but
> it's gonna take me a few more days to review and test what I can as
> several drivers (mostly legacy ones) are a bit tricky.
>
> For the time being, SCSI layer is the only block layer timeout user and
> completion w/o dequeuing is only for error cases in SCSI, so the
> inefficiency there shouldn't matter too much.

In fact, I have changed my mind. Starting the timer after the command
has been sent to the low-level driver would mean that the command might
finish before the timer was started!

So never mind. I did confirm at least that your patch together with
Mike's fixed the problem I encountered last week.

I have a couple of small fixes for the block timer routines. They'll
get posted separately later on.

Alan Stern

2008-11-03 17:25:27

by Jens Axboe

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Mon, Nov 03 2008, James Smart wrote:
> Jens Axboe wrote:
> >>While I'm on the subject, there are a few related items that could be
> >>improved. In my tests, I was generating I/O requests simply by doing
> >>
> >> dd if=/dev/sda ...
> >>
> >>I don't know where the timeouts for these requests are determined, but
> >>they were set to 60 seconds. That seems much too long.
> >
> >Fully agreed, as Mike mentioned this actually looks like a dumb udev
> >rule that didn't have any effect until this generic timeout work. For
> >normal IO, something in the 10 second range is a lot more appropriate.
>
> Yes and no. For direct-attach storage with no other initiators, ok.
> But for larger arrays, potentially with multiple initiators - no. I
> can name several arrays that depend on a 30 second timeout, and a few
> that, underload, require 60 seconds. I assume that there's usually
> "best practices" guides for the integrators to ensure the defaults are
> set right.

Sure I agree, it depends on what kind of storage you have. What I mean
is that for a normal disk you want something like 10 seconds.

--
Jens Axboe

2008-11-03 17:28:29

by Jens Axboe

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Tue, Nov 04 2008, Tejun Heo wrote:
> I'm trying to convert all drivers to use the same command issue model -
> elv_next_request() -> blkdev_dequeue_request() on actual issue ->
> blk_end_request(). I have first draft of the conversion patchset but
> it's gonna take me a few more days to review and test what I can as
> several drivers (mostly legacy ones) are a bit tricky.

Don't do that, please. What we need to do is ensure that the model fits
with whatever the driver wants to do there, and for some drivers it's
actually a lot easier to rely on elv_next_request() returning the same
requests again instead of keeping track of it yourself. So any patch
that enforces the model that you must dequeue before starting the
request, will get a big nak from me.

What we need to do is add a 'peek' mode only, which doesn't start the
request. Along with something to mark it started that the driver can
use, or it can just use elv_next_request() of course.

> For the time being, SCSI layer is the only block layer timeout user and
> completion w/o dequeuing is only for error cases in SCSI, so the
> inefficiency there shouldn't matter too much.

Agree, for now we are ok since it's just SCSI.

--
Jens Axboe

2008-11-04 03:02:24

by Tejun Heo

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

Jens Axboe wrote:
> On Tue, Nov 04 2008, Tejun Heo wrote:
>> I'm trying to convert all drivers to use the same command issue model -
>> elv_next_request() -> blkdev_dequeue_request() on actual issue ->
>> blk_end_request(). I have first draft of the conversion patchset but
>> it's gonna take me a few more days to review and test what I can as
>> several drivers (mostly legacy ones) are a bit tricky.
>
> Don't do that, please. What we need to do is ensure that the model
> fits with whatever the driver wants to do there, and for some
> drivers it's actually a lot easier to rely on elv_next_request()
> returning the same requests again instead of keeping track of it
> yourself. So any patch that enforces the model that you must
> dequeue before starting the request, will get a big nak from me.

I audited every user of elv_next_request() and converting them to
follow a single model isn't that diffcult. Most drivers which peek at
the top of the queue already have the current rq pointer and they are
not difficult to convert as it's guaranteed that they process one
request at any given time and nothings else till the current request
is finished. Usually, the only thing that's required is to check
whether the previous request is completely done before fetching the
next one.

Model which fits to different driver's usage models is a good goal but
as with everything else it needs to be balanced with other goals, and
while enforcing single fetch/completion model to drivers doesn't cost
much to its users, it does hurt the block layer abstraction. I think
preemptive NACK is a bit too hasty. With sensible helper routines, I
think we can satisfy both block layer and the legacy users of it.

> What we need to do is add a 'peek' mode only, which doesn't start the
> request. Along with something to mark it started that the driver can
> use, or it can just use elv_next_request() of course.

I don't agree adding more interfaces is a good idea when the problem
is lack of uniformity. There is no inherent convenience or advantage
of having two different operation modes but there are quite some
number of rough edges added thinking about only one model.
BLK_TA_ISSUE was one minor example. Another one would be the prep
function and the DONTPREP flag and now the timer. Elevator accounting
becomes different for the different drivers. Padding adjustment
should be done right before low level driver issues the command and
undone on requeue but it can'be done that way now and so on.

Thanks.

--
tejun

2008-11-06 00:01:58

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Mon, 3 Nov 2008 09:52:48 +0100
Jens Axboe <[email protected]> wrote:

> > In blk_del_timer(), there's no reason to test q->rq_timed_out_fn. If
> > the method pointer is NULL then req->deadline would be 0 anyway. In
> > addition, req->deadline should be set to 0 and the end of the routine,
> > just in case req gets requeued.
> >
> > In blk_add_timer(), the line
> >
> > expiry = round_jiffies(req->deadline);
> >
> > is not optimal. round_jiffies() will sometimes round a value _down_ to
> > the nearest second. But blk_rq_timed_out_timer() tests whether
> > req->deadline is in the past -- and if the deadline was rounded down
> > then this won't be true the first time through. You wind up getting an
> > unnecessary timer interrupt. Instead there should be a
> > round_jiffies_up() utility routine, and it should be used in both
> > blk_add_timer() and blk_rq_timed_out_timer().
>
> Very good point, we do indeed want a round_jiffies_up() for this!

Just out of curiosity, why do we need to use round_jiffies here? We
didn't do that for SCSI, right?

2008-11-06 07:25:39

by Jens Axboe

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Thu, Nov 06 2008, FUJITA Tomonori wrote:
> On Mon, 3 Nov 2008 09:52:48 +0100
> Jens Axboe <[email protected]> wrote:
>
> > > In blk_del_timer(), there's no reason to test q->rq_timed_out_fn. If
> > > the method pointer is NULL then req->deadline would be 0 anyway. In
> > > addition, req->deadline should be set to 0 and the end of the routine,
> > > just in case req gets requeued.
> > >
> > > In blk_add_timer(), the line
> > >
> > > expiry = round_jiffies(req->deadline);
> > >
> > > is not optimal. round_jiffies() will sometimes round a value _down_ to
> > > the nearest second. But blk_rq_timed_out_timer() tests whether
> > > req->deadline is in the past -- and if the deadline was rounded down
> > > then this won't be true the first time through. You wind up getting an
> > > unnecessary timer interrupt. Instead there should be a
> > > round_jiffies_up() utility routine, and it should be used in both
> > > blk_add_timer() and blk_rq_timed_out_timer().
> >
> > Very good point, we do indeed want a round_jiffies_up() for this!
>
> Just out of curiosity, why do we need to use round_jiffies here? We
> didn't do that for SCSI, right?

We don't have to, but given that we don't care about exact timeouts, we
may as well. It's not a new thing, we've done that since pretty much the
beginning of the generic timeout development.

--
Jens Axboe

2008-11-07 04:06:22

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Thu, 6 Nov 2008 08:23:54 +0100
Jens Axboe <[email protected]> wrote:

> On Thu, Nov 06 2008, FUJITA Tomonori wrote:
> > On Mon, 3 Nov 2008 09:52:48 +0100
> > Jens Axboe <[email protected]> wrote:
> >
> > > > In blk_del_timer(), there's no reason to test q->rq_timed_out_fn. If
> > > > the method pointer is NULL then req->deadline would be 0 anyway. In
> > > > addition, req->deadline should be set to 0 and the end of the routine,
> > > > just in case req gets requeued.
> > > >
> > > > In blk_add_timer(), the line
> > > >
> > > > expiry = round_jiffies(req->deadline);
> > > >
> > > > is not optimal. round_jiffies() will sometimes round a value _down_ to
> > > > the nearest second. But blk_rq_timed_out_timer() tests whether
> > > > req->deadline is in the past -- and if the deadline was rounded down
> > > > then this won't be true the first time through. You wind up getting an
> > > > unnecessary timer interrupt. Instead there should be a
> > > > round_jiffies_up() utility routine, and it should be used in both
> > > > blk_add_timer() and blk_rq_timed_out_timer().
> > >
> > > Very good point, we do indeed want a round_jiffies_up() for this!
> >
> > Just out of curiosity, why do we need to use round_jiffies here? We
> > didn't do that for SCSI, right?
>
> We don't have to, but given that we don't care about exact timeouts, we
> may as well. It's not a new thing, we've done that since pretty much the
> beginning of the generic timeout development.

I'm not sure that the users of the timeout feature can control exact
timeouts because the block layer doesn't let the users directly play
with the timer. elv_dequeue_request() is not the exact time that the
users want to start the timer. Instead, the block layer hides the
details behind the elevator (note that as I said before, I think that
it's the right thing). So the round_jiffies in the block layer doesn't
make sense to me. I prefer remove them instead of adding a bunch of
round_jiffies_up_* (I bet that some of them will never be used).

2008-11-07 11:26:29

by Jens Axboe

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Fri, Nov 07 2008, FUJITA Tomonori wrote:
> On Thu, 6 Nov 2008 08:23:54 +0100
> Jens Axboe <[email protected]> wrote:
>
> > On Thu, Nov 06 2008, FUJITA Tomonori wrote:
> > > On Mon, 3 Nov 2008 09:52:48 +0100
> > > Jens Axboe <[email protected]> wrote:
> > >
> > > > > In blk_del_timer(), there's no reason to test q->rq_timed_out_fn. If
> > > > > the method pointer is NULL then req->deadline would be 0 anyway. In
> > > > > addition, req->deadline should be set to 0 and the end of the routine,
> > > > > just in case req gets requeued.
> > > > >
> > > > > In blk_add_timer(), the line
> > > > >
> > > > > expiry = round_jiffies(req->deadline);
> > > > >
> > > > > is not optimal. round_jiffies() will sometimes round a value _down_ to
> > > > > the nearest second. But blk_rq_timed_out_timer() tests whether
> > > > > req->deadline is in the past -- and if the deadline was rounded down
> > > > > then this won't be true the first time through. You wind up getting an
> > > > > unnecessary timer interrupt. Instead there should be a
> > > > > round_jiffies_up() utility routine, and it should be used in both
> > > > > blk_add_timer() and blk_rq_timed_out_timer().
> > > >
> > > > Very good point, we do indeed want a round_jiffies_up() for this!
> > >
> > > Just out of curiosity, why do we need to use round_jiffies here? We
> > > didn't do that for SCSI, right?
> >
> > We don't have to, but given that we don't care about exact timeouts, we
> > may as well. It's not a new thing, we've done that since pretty much the
> > beginning of the generic timeout development.
>
> I'm not sure that the users of the timeout feature can control exact
> timeouts because the block layer doesn't let the users directly play
> with the timer. elv_dequeue_request() is not the exact time that the
> users want to start the timer. Instead, the block layer hides the
> details behind the elevator (note that as I said before, I think that
> it's the right thing). So the round_jiffies in the block layer doesn't
> make sense to me. I prefer remove them instead of adding a bunch of
> round_jiffies_up_* (I bet that some of them will never be used).

I don't understand your concern, to be honest. We only need to round up
once, and that is when we add/mod the timer. And we do that simply to
play nice and group the timout with other timers, to save a bit of
power.

--
Jens Axboe

2008-11-11 06:58:57

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Fri, 7 Nov 2008 12:24:50 +0100
Jens Axboe <[email protected]> wrote:

> On Fri, Nov 07 2008, FUJITA Tomonori wrote:
> > On Thu, 6 Nov 2008 08:23:54 +0100
> > Jens Axboe <[email protected]> wrote:
> >
> > > On Thu, Nov 06 2008, FUJITA Tomonori wrote:
> > > > On Mon, 3 Nov 2008 09:52:48 +0100
> > > > Jens Axboe <[email protected]> wrote:
> > > >
> > > > > > In blk_del_timer(), there's no reason to test q->rq_timed_out_fn. If
> > > > > > the method pointer is NULL then req->deadline would be 0 anyway. In
> > > > > > addition, req->deadline should be set to 0 and the end of the routine,
> > > > > > just in case req gets requeued.
> > > > > >
> > > > > > In blk_add_timer(), the line
> > > > > >
> > > > > > expiry = round_jiffies(req->deadline);
> > > > > >
> > > > > > is not optimal. round_jiffies() will sometimes round a value _down_ to
> > > > > > the nearest second. But blk_rq_timed_out_timer() tests whether
> > > > > > req->deadline is in the past -- and if the deadline was rounded down
> > > > > > then this won't be true the first time through. You wind up getting an
> > > > > > unnecessary timer interrupt. Instead there should be a
> > > > > > round_jiffies_up() utility routine, and it should be used in both
> > > > > > blk_add_timer() and blk_rq_timed_out_timer().
> > > > >
> > > > > Very good point, we do indeed want a round_jiffies_up() for this!
> > > >
> > > > Just out of curiosity, why do we need to use round_jiffies here? We
> > > > didn't do that for SCSI, right?
> > >
> > > We don't have to, but given that we don't care about exact timeouts, we
> > > may as well. It's not a new thing, we've done that since pretty much the
> > > beginning of the generic timeout development.
> >
> > I'm not sure that the users of the timeout feature can control exact
> > timeouts because the block layer doesn't let the users directly play
> > with the timer. elv_dequeue_request() is not the exact time that the
> > users want to start the timer. Instead, the block layer hides the
> > details behind the elevator (note that as I said before, I think that
> > it's the right thing). So the round_jiffies in the block layer doesn't
> > make sense to me. I prefer remove them instead of adding a bunch of
> > round_jiffies_up_* (I bet that some of them will never be used).
>
> I don't understand your concern, to be honest. We only need to round up
> once, and that is when we add/mod the timer. And we do that simply to
> play nice and group the timout with other timers, to save a bit of
> power.

I don't worry about anything. I just think that these round_jiffies_up
are pointless because they were added for the block-layer users that
care about exact timeouts, however the block-layer doesn't export
blk_add_timer() so the block-layer users can't control the exact time
when the timer starts. So doing round_jiffies_up calculation per every
request doesn't make sense for me.

Anyway, it's trivial. If you like these round_jiffies_up, it's fine by
me.

2008-11-11 17:12:13

by Alan Stern

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Tue, 11 Nov 2008, FUJITA Tomonori wrote:

> I don't worry about anything. I just think that these round_jiffies_up
> are pointless because they were added for the block-layer users that
> care about exact timeouts, however the block-layer doesn't export
> blk_add_timer() so the block-layer users can't control the exact time
> when the timer starts. So doing round_jiffies_up calculation per every
> request doesn't make sense for me.

In fact the round_jiffies_up() routines were added for other users as
well as the block layer. However none of the others could be changed
until the routines were merged. Now that the routines are in the
mainline, you should see them start to be called in multiple places.

Also, the users of the block layer _don't_ care about exact timeouts.
That's an important aspect of round_jiffies() and round_jiffies_up() --
you don't use them if you want an exact timeout.

The reason for using round_jiffies() is to insure that the timeout
will occur at a 1-second boundary. If several timeouts are set for
about the same time and they all use round_jiffies() or
round_jiffies_up(), then they will all occur at the same tick instead
of spread out among several different ticks during the course of that
1-second interval. As a result, the system will need to wake up only
once to service all those timeouts, instead of waking up several
different times. It is a power-saving scheme.

Alan Stern

2008-11-11 19:21:22

by Jens Axboe

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Tue, Nov 11 2008, Alan Stern wrote:
> On Tue, 11 Nov 2008, FUJITA Tomonori wrote:
>
> > I don't worry about anything. I just think that these round_jiffies_up
> > are pointless because they were added for the block-layer users that
> > care about exact timeouts, however the block-layer doesn't export
> > blk_add_timer() so the block-layer users can't control the exact time
> > when the timer starts. So doing round_jiffies_up calculation per every
> > request doesn't make sense for me.
>
> In fact the round_jiffies_up() routines were added for other users as
> well as the block layer. However none of the others could be changed
> until the routines were merged. Now that the routines are in the
> mainline, you should see them start to be called in multiple places.
>
> Also, the users of the block layer _don't_ care about exact timeouts.
> That's an important aspect of round_jiffies() and round_jiffies_up() --
> you don't use them if you want an exact timeout.
>
> The reason for using round_jiffies() is to insure that the timeout
> will occur at a 1-second boundary. If several timeouts are set for
> about the same time and they all use round_jiffies() or
> round_jiffies_up(), then they will all occur at the same tick instead
> of spread out among several different ticks during the course of that
> 1-second interval. As a result, the system will need to wake up only
> once to service all those timeouts, instead of waking up several
> different times. It is a power-saving scheme.

I can't add anything else, can't say it any better either. The main
point of using round_jiffies_up() is to align with other timers. I don't
understand why you (Tomo) think that timeouts are exact? They really are
not, and within the same second is quite adequate here.

--
Jens Axboe

2008-11-12 02:09:23

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Tue, 11 Nov 2008 20:19:36 +0100
Jens Axboe <[email protected]> wrote:

> On Tue, Nov 11 2008, Alan Stern wrote:
> > On Tue, 11 Nov 2008, FUJITA Tomonori wrote:
> >
> > > I don't worry about anything. I just think that these round_jiffies_up
> > > are pointless because they were added for the block-layer users that
> > > care about exact timeouts, however the block-layer doesn't export
> > > blk_add_timer() so the block-layer users can't control the exact time
> > > when the timer starts. So doing round_jiffies_up calculation per every
> > > request doesn't make sense for me.
> >
> > In fact the round_jiffies_up() routines were added for other users as
> > well as the block layer. However none of the others could be changed
> > until the routines were merged. Now that the routines are in the
> > mainline, you should see them start to be called in multiple places.
> >
> > Also, the users of the block layer _don't_ care about exact timeouts.
> > That's an important aspect of round_jiffies() and round_jiffies_up() --
> > you don't use them if you want an exact timeout.
> >
> > The reason for using round_jiffies() is to insure that the timeout
> > will occur at a 1-second boundary. If several timeouts are set for
> > about the same time and they all use round_jiffies() or
> > round_jiffies_up(), then they will all occur at the same tick instead
> > of spread out among several different ticks during the course of that
> > 1-second interval. As a result, the system will need to wake up only
> > once to service all those timeouts, instead of waking up several
> > different times. It is a power-saving scheme.

Hmm, but for 99.9% of the cases, the timeout of the block layer
doesn't expire, the timeout rarely happens. The power-saving scheme
can be applied to only 0.1%, but at the cost of the round_jiffies
overhead per every request.

If I understand correctly, round_jiffies() is designed for timers that
will expire, such as periodic checking. The power-saving scheme nicely
works for such usages.


> I can't add anything else, can't say it any better either. The main
> point of using round_jiffies_up() is to align with other timers. I don't
> understand why you (Tomo) think that timeouts are exact? They really are
> not, and within the same second is quite adequate here.

My exact argument is for switching from round_jiffies() to
round_jiffies_up. But I wrote above, in the first place, the
round_jiffies didn't make sense to me.

2008-11-13 10:36:46

by Jens Axboe

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Wed, Nov 12 2008, FUJITA Tomonori wrote:
> On Tue, 11 Nov 2008 20:19:36 +0100
> Jens Axboe <[email protected]> wrote:
>
> > On Tue, Nov 11 2008, Alan Stern wrote:
> > > On Tue, 11 Nov 2008, FUJITA Tomonori wrote:
> > >
> > > > I don't worry about anything. I just think that these round_jiffies_up
> > > > are pointless because they were added for the block-layer users that
> > > > care about exact timeouts, however the block-layer doesn't export
> > > > blk_add_timer() so the block-layer users can't control the exact time
> > > > when the timer starts. So doing round_jiffies_up calculation per every
> > > > request doesn't make sense for me.
> > >
> > > In fact the round_jiffies_up() routines were added for other users as
> > > well as the block layer. However none of the others could be changed
> > > until the routines were merged. Now that the routines are in the
> > > mainline, you should see them start to be called in multiple places.
> > >
> > > Also, the users of the block layer _don't_ care about exact timeouts.
> > > That's an important aspect of round_jiffies() and round_jiffies_up() --
> > > you don't use them if you want an exact timeout.
> > >
> > > The reason for using round_jiffies() is to insure that the timeout
> > > will occur at a 1-second boundary. If several timeouts are set for
> > > about the same time and they all use round_jiffies() or
> > > round_jiffies_up(), then they will all occur at the same tick instead
> > > of spread out among several different ticks during the course of that
> > > 1-second interval. As a result, the system will need to wake up only
> > > once to service all those timeouts, instead of waking up several
> > > different times. It is a power-saving scheme.
>
> Hmm, but for 99.9% of the cases, the timeout of the block layer
> doesn't expire, the timeout rarely happens. The power-saving scheme
> can be applied to only 0.1%, but at the cost of the round_jiffies
> overhead per every request.
>
> If I understand correctly, round_jiffies() is designed for timers that
> will expire, such as periodic checking. The power-saving scheme nicely
> works for such usages.

Your understanding is correct. The overhead of round_jiffies() is not
large, though.

I want to get rid of this in blk_delete_timer():

if (list_empty(&q->timeout_list))
del_timer(&q->timeout);

though and simply let the timer run even if the list is empty, since for
sync sequential IO we'll be fiddling a much with the timer as we did
before unifying it. And then the timer will expire every x seconds
always and it becomes more important with the grouping.

--
Jens Axboe

2008-11-17 03:49:18

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: Problems with the block-layer timeouts

On Thu, 13 Nov 2008 11:34:56 +0100
Jens Axboe <[email protected]> wrote:

> On Wed, Nov 12 2008, FUJITA Tomonori wrote:
> > On Tue, 11 Nov 2008 20:19:36 +0100
> > Jens Axboe <[email protected]> wrote:
> >
> > > On Tue, Nov 11 2008, Alan Stern wrote:
> > > > On Tue, 11 Nov 2008, FUJITA Tomonori wrote:
> > > >
> > > > > I don't worry about anything. I just think that these round_jiffies_up
> > > > > are pointless because they were added for the block-layer users that
> > > > > care about exact timeouts, however the block-layer doesn't export
> > > > > blk_add_timer() so the block-layer users can't control the exact time
> > > > > when the timer starts. So doing round_jiffies_up calculation per every
> > > > > request doesn't make sense for me.
> > > >
> > > > In fact the round_jiffies_up() routines were added for other users as
> > > > well as the block layer. However none of the others could be changed
> > > > until the routines were merged. Now that the routines are in the
> > > > mainline, you should see them start to be called in multiple places.
> > > >
> > > > Also, the users of the block layer _don't_ care about exact timeouts.
> > > > That's an important aspect of round_jiffies() and round_jiffies_up() --
> > > > you don't use them if you want an exact timeout.
> > > >
> > > > The reason for using round_jiffies() is to insure that the timeout
> > > > will occur at a 1-second boundary. If several timeouts are set for
> > > > about the same time and they all use round_jiffies() or
> > > > round_jiffies_up(), then they will all occur at the same tick instead
> > > > of spread out among several different ticks during the course of that
> > > > 1-second interval. As a result, the system will need to wake up only
> > > > once to service all those timeouts, instead of waking up several
> > > > different times. It is a power-saving scheme.
> >
> > Hmm, but for 99.9% of the cases, the timeout of the block layer
> > doesn't expire, the timeout rarely happens. The power-saving scheme
> > can be applied to only 0.1%, but at the cost of the round_jiffies
> > overhead per every request.
> >
> > If I understand correctly, round_jiffies() is designed for timers that
> > will expire, such as periodic checking. The power-saving scheme nicely
> > works for such usages.
>
> Your understanding is correct. The overhead of round_jiffies() is not
> large, though.
>
> I want to get rid of this in blk_delete_timer():
>
> if (list_empty(&q->timeout_list))
> del_timer(&q->timeout);
>
> though and simply let the timer run even if the list is empty, since for
> sync sequential IO we'll be fiddling a much with the timer as we did
> before unifying it. And then the timer will expire every x seconds
> always and it becomes more important with the grouping.

I see. It depends on workloads but the above 'periodic expiration'
scheme might be better than the current one, I guess. It doesn't gives
large impact though.

Thanks for the clarification.