I have heard that a patch to reduce io_request_lock contention by
breaking it into per queue locks was released in the past. Does
anyone know where I could find this patch if it exists?
Thanks,
Jonathan
--
Jonathan Lahr
IBM Linux Technology Center
Beaverton, Oregon
[email protected]
503-578-3385
On Mon, Jul 09 2001, Jonathan Lahr wrote:
>
> I have heard that a patch to reduce io_request_lock contention by
> breaking it into per queue locks was released in the past. Does
> anyone know where I could find this patch if it exists?
I had a patch about a year ago that did it safely for the block layer
and IDE at least, and also for selected SCSI hba's. Some of the latter
variety are pretty hard and/or tedious to fixup, Eric Y has done some
work automating this process almost completely. Until that is done, the
general patch has no chance of being integrated.
It's also interesting to take a look at _why_ there's contention on the
io_request_lock. And fix those up first.
--
Jens Axboe
Hi Jens,
In article <[email protected]> you wrote:
> On Mon, Jul 09 2001, Jonathan Lahr wrote:
> It's also interesting to take a look at _why_ there's contention on the
> io_request_lock. And fix those up first.
> --
> Jens Axboe
Here are some lockmeter outputs for tiobench
(http://sourceforge.net/projects/tiobench), a simple benchmark
that we tried on ext2 filesystem. 4 concurrent threads doing
random/sequential read/write on 10MB files on a 4-way pIII 700MHz
machine with 1MB L2 cache -
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME
2.9% 26.7% 7.4us( 706us) 72us( 920us)( 1.9%) 1557496 73.3% 26.7% 0% io_request_lock
0.00% 34.9% 0.5us( 2.8us) 63us( 839us)(0.04%) 29478 65.1% 34.9% 0% __get_request_wait+0x98
2.6% 4.7% 17us( 706us) 69us( 740us)(0.13%) 617741 95.3% 4.7% 0% __make_request+0x110
0.07% 60.2% 0.5us( 4.0us) 72us( 920us)( 1.7%) 610820 39.8% 60.2% 0% blk_get_queue+0x10
0.09% 2.9% 6.6us( 55us) 102us( 746us)(0.01%) 55327 97.1% 2.9% 0% do_aic7xxx_isr+0x24
0.00% 3.7% 0.3us( 22us) 29us( 569us)(0.00%) 22602 96.3% 3.7% 0% generic_unplug_device+0x10
0.02% 4.9% 1.3us( 27us) 54us( 621us)(0.01%) 55382 95.1% 4.9% 0% scsi_dispatch_cmd+0x12c
0.02% 1.3% 1.2us( 8.0us) 23us( 554us)(0.00%) 55382 98.7% 1.3% 0% scsi_old_done+0x5b8
0.04% 3.2% 2.8us( 31us) 200us( 734us)(0.02%) 55382 96.8% 3.2% 0% scsi_queue_next_request+0x18
0.02% 1.4% 1.1us( 7.8us) 46us( 638us)(0.00%) 55382 98.6% 1.4% 0% scsi_request_fn+0x350
1557496*26.7%*72us makes it about 30 seconds of time waiting for
io_request_lock. That is nearly one-third of the total system time
(about 98 seconds). As number of CPUs increase, this will likely
worsen.
It also seems that __make_request() holds the lock for the largest
amount of time. This hold time isn't likely to change significantly
for a per-queue lock, but atleast it will not affect queueing i/o
requests to other devices. Besides, I am not sure if blk_get_queue()
really needs to grab the io_request_lock. blk_dev[] entries aren't
likely to be updated in an open device and hence it should be
safe to look up the queue of an open device. For mutual
exclusion in the device-specific queue() function, it might be
better to leave it to the driver instead of forcing the mutual
exclusion. For example, a driver might want to use a reader/writer
lock to lookup its device table for the queue. It also might make sense to
have separate mutual exclusion mechanism for block device
and scsi device level queues.
Thanks
Dipankar
--
Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
Jens Axboe [[email protected]] wrote:
> On Mon, Jul 09 2001, Jonathan Lahr wrote:
> >
> > I have heard that a patch to reduce io_request_lock contention by
> > breaking it into per queue locks was released in the past. Does
> > anyone know where I could find this patch if it exists?
>
> I had a patch about a year ago that did it safely for the block layer
> and IDE at least, and also for selected SCSI hba's. Some of the latter
> variety are pretty hard and/or tedious to fixup, Eric Y has done some
> work automating this process almost completely. Until that is done, the
> general patch has no chance of being integrated.
I am investigating reducing io_request_lock contention in the shorter term
if possible with smaller incremental modifications. So I'm first trying to
discover any previous work that might have been done toward this purpose.
--
Jonathan Lahr
IBM Linux Technology Center
Beaverton, Oregon
[email protected]
503-578-3385
The bit that I had automated was to essentially fix each and every
low-level SCSI driver such that each low-level driver would be responsible
for it's own locking. At this point the patches and the tool are on hold -
once the 2.5 kernel series gets underway, I can generate some fairly massive
patchsets.
-Eric
----- Original Message -----
From: "Jonathan Lahr" <[email protected]>
To: "Jens Axboe" <[email protected]>
Cc: <[email protected]>; <[email protected]>
Sent: Tuesday, July 10, 2001 3:49 PM
Subject: Re: io_request_lock patch?
> Jens Axboe [[email protected]] wrote:
> > On Mon, Jul 09 2001, Jonathan Lahr wrote:
> > >
> > > I have heard that a patch to reduce io_request_lock contention by
> > > breaking it into per queue locks was released in the past. Does
> > > anyone know where I could find this patch if it exists?
> >
> > I had a patch about a year ago that did it safely for the block layer
> > and IDE at least, and also for selected SCSI hba's. Some of the latter
> > variety are pretty hard and/or tedious to fixup, Eric Y has done some
> > work automating this process almost completely. Until that is done, the
> > general patch has no chance of being integrated.
>
> I am investigating reducing io_request_lock contention in the shorter term
> if possible with smaller incremental modifications. So I'm first trying
to
> discover any previous work that might have been done toward this purpose.
>
> --
> Jonathan Lahr
> IBM Linux Technology Center
> Beaverton, Oregon
> [email protected]
> 503-578-3385
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
>
The call to do_aic7xxx_isr appears that you are running the aic7xxx_old.c
code. This driver is using the io_request_lock to protect internal data.
The newer aic driver has its own lock. This is related to previous
comments by Jens and Eric about lower level use of this lock.
I would like to know why the request_freelist is going empty? Having
__get_request_wait being called alot would appear to be not optimal.
-Mike
Dipankar Sarma [[email protected]] wrote:
> Hi Jens,
>
> In article <[email protected]> you wrote:
> > On Mon, Jul 09 2001, Jonathan Lahr wrote:
> > It's also interesting to take a look at _why_ there's contention on the
> > io_request_lock. And fix those up first.
>
> > --
> > Jens Axboe
>
> Here are some lockmeter outputs for tiobench
> (http://sourceforge.net/projects/tiobench), a simple benchmark
> that we tried on ext2 filesystem. 4 concurrent threads doing
> random/sequential read/write on 10MB files on a 4-way pIII 700MHz
> machine with 1MB L2 cache -
>
> SPINLOCKS HOLD WAIT
> UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME
>
> 2.9% 26.7% 7.4us( 706us) 72us( 920us)( 1.9%) 1557496 73.3% 26.7% 0% io_request_lock
> 0.00% 34.9% 0.5us( 2.8us) 63us( 839us)(0.04%) 29478 65.1% 34.9% 0% __get_request_wait+0x98
> 2.6% 4.7% 17us( 706us) 69us( 740us)(0.13%) 617741 95.3% 4.7% 0% __make_request+0x110
> 0.07% 60.2% 0.5us( 4.0us) 72us( 920us)( 1.7%) 610820 39.8% 60.2% 0% blk_get_queue+0x10
> 0.09% 2.9% 6.6us( 55us) 102us( 746us)(0.01%) 55327 97.1% 2.9% 0% do_aic7xxx_isr+0x24
> 0.00% 3.7% 0.3us( 22us) 29us( 569us)(0.00%) 22602 96.3% 3.7% 0% generic_unplug_device+0x10
> 0.02% 4.9% 1.3us( 27us) 54us( 621us)(0.01%) 55382 95.1% 4.9% 0% scsi_dispatch_cmd+0x12c
> 0.02% 1.3% 1.2us( 8.0us) 23us( 554us)(0.00%) 55382 98.7% 1.3% 0% scsi_old_done+0x5b8
> 0.04% 3.2% 2.8us( 31us) 200us( 734us)(0.02%) 55382 96.8% 3.2% 0% scsi_queue_next_request+0x18
> 0.02% 1.4% 1.1us( 7.8us) 46us( 638us)(0.00%) 55382 98.6% 1.4% 0% scsi_request_fn+0x350
>
> 1557496*26.7%*72us makes it about 30 seconds of time waiting for
> io_request_lock. That is nearly one-third of the total system time
> (about 98 seconds). As number of CPUs increase, this will likely
> worsen.
>
> It also seems that __make_request() holds the lock for the largest
> amount of time. This hold time isn't likely to change significantly
> for a per-queue lock, but atleast it will not affect queueing i/o
> requests to other devices. Besides, I am not sure if blk_get_queue()
> really needs to grab the io_request_lock. blk_dev[] entries aren't
> likely to be updated in an open device and hence it should be
> safe to look up the queue of an open device. For mutual
> exclusion in the device-specific queue() function, it might be
> better to leave it to the driver instead of forcing the mutual
> exclusion. For example, a driver might want to use a reader/writer
> lock to lookup its device table for the queue. It also might make sense to
> have separate mutual exclusion mechanism for block device
> and scsi device level queues.
>
> Thanks
> Dipankar
> --
> Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
> Linux Technology Center, IBM Software Lab, Bangalore, India.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Michael Anderson
[email protected]
On Tue, Jul 10 2001, Mike Anderson wrote:
> The call to do_aic7xxx_isr appears that you are running the aic7xxx_old.c
> code. This driver is using the io_request_lock to protect internal data.
> The newer aic driver has its own lock. This is related to previous
> comments by Jens and Eric about lower level use of this lock.
Yes
> I would like to know why the request_freelist is going empty? Having
Well, it's a limited resource so it's bound to go empty when under load.
In fact, if you have plenty of RAM this is what will end up starting I/O
on the queued buffers. With the batch freeing of request slots, this is
ok.
> __get_request_wait being called alot would appear to be not optimal.
?? Care to explain your reasoning behind this statement.
--
Jens Axboe
On Tue, Jul 10 2001, Dipankar Sarma wrote:
> In article <[email protected]> you wrote:
> > On Mon, Jul 09 2001, Jonathan Lahr wrote:
> > It's also interesting to take a look at _why_ there's contention on the
> > io_request_lock. And fix those up first.
>
> > --
> > Jens Axboe
>
> Here are some lockmeter outputs for tiobench
> (http://sourceforge.net/projects/tiobench), a simple benchmark
> that we tried on ext2 filesystem. 4 concurrent threads doing
> random/sequential read/write on 10MB files on a 4-way pIII 700MHz
> machine with 1MB L2 cache -
>
> SPINLOCKS HOLD WAIT
> UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME
>
> 2.9% 26.7% 7.4us( 706us) 72us( 920us)( 1.9%) 1557496 73.3% 26.7% 0% io_request_lock
> 0.00% 34.9% 0.5us( 2.8us) 63us( 839us)(0.04%) 29478 65.1% 34.9% 0% __get_request_wait+0x98
> 2.6% 4.7% 17us( 706us) 69us( 740us)(0.13%) 617741 95.3% 4.7% 0% __make_request+0x110
> 0.07% 60.2% 0.5us( 4.0us) 72us( 920us)( 1.7%) 610820 39.8% 60.2% 0% blk_get_queue+0x10
> 0.09% 2.9% 6.6us( 55us) 102us( 746us)(0.01%) 55327 97.1% 2.9% 0% do_aic7xxx_isr+0x24
> 0.00% 3.7% 0.3us( 22us) 29us( 569us)(0.00%) 22602 96.3% 3.7% 0% generic_unplug_device+0x10
> 0.02% 4.9% 1.3us( 27us) 54us( 621us)(0.01%) 55382 95.1% 4.9% 0% scsi_dispatch_cmd+0x12c
> 0.02% 1.3% 1.2us( 8.0us) 23us( 554us)(0.00%) 55382 98.7% 1.3% 0% scsi_old_done+0x5b8
> 0.04% 3.2% 2.8us( 31us) 200us( 734us)(0.02%) 55382 96.8% 3.2% 0% scsi_queue_next_request+0x18
> 0.02% 1.4% 1.1us( 7.8us) 46us( 638us)(0.00%) 55382 98.6% 1.4% 0% scsi_request_fn+0x350
>
> 1557496*26.7%*72us makes it about 30 seconds of time waiting for
> io_request_lock. That is nearly one-third of the total system time
> (about 98 seconds). As number of CPUs increase, this will likely
> worsen.
Auch! That's pretty bad.
> It also seems that __make_request() holds the lock for the largest
> amount of time. This hold time isn't likely to change significantly
__make_request -> elevator merge/insertion scan. This is what is taking
all the time, not __make_request itself. With the bio-XX patches I have
completely eliminated merge scans, so that can be done in O(1) time. For
now insertion is still a O(N) scan, maybe that will change too [1].
> for a per-queue lock, but atleast it will not affect queueing i/o
> requests to other devices. Besides, I am not sure if blk_get_queue()
> really needs to grab the io_request_lock. blk_dev[] entries aren't
Funny, this is one thing I've been looking at too. blk_get_queue _will_
die, don't worry. And yes, ie at open we can assign the queue. Or simply
map it and protect it otherwise. It all ties in with being able to up or
down a queue too, currently grabbing io_request_lock from blk_get_queue
accomplishes exactly nothing and may as well be removed. If you do that,
does it change the contention numbers significantly?
[1] Appropriate data structures that both allow barriers and quick
insertion sorts are needed.
--
Jens Axboe
On Tue, Jul 10 2001, Eric Youngdale wrote:
>
> The bit that I had automated was to essentially fix each and every
> low-level SCSI driver such that each low-level driver would be responsible
> for it's own locking. At this point the patches and the tool are on hold -
> once the 2.5 kernel series gets underway, I can generate some fairly massive
> patchsets.
Cool, sounds good Eric. I'll probably go ahead and complete the initial
ll_rw_blk and IDE work, along with a few selected SCSI drivers. Then
leave the grunt of the work for your tools.
--
Jens Axboe
Hi Jens,
On Wed, Jul 11, 2001 at 09:19:00AM +0200, Jens Axboe wrote:
> On Tue, Jul 10 2001, Dipankar Sarma wrote:
> > In article <[email protected]> you wrote:
> > > On Mon, Jul 09 2001, Jonathan Lahr wrote:
> >
> > SPINLOCKS HOLD WAIT
> > UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN RJECT NAME
> >
> > 0.07% 60.2% 0.5us( 4.0us) 72us( 920us)( 1.7%) 610820 39.8% 60.2% 0% blk_get_queue+0x10
> >
> > 1557496*26.7%*72us makes it about 30 seconds of time waiting for
> > io_request_lock. That is nearly one-third of the total system time
> > (about 98 seconds). As number of CPUs increase, this will likely
> > worsen.
>
> Auch! That's pretty bad.
>
> > It also seems that __make_request() holds the lock for the largest
> > amount of time. This hold time isn't likely to change significantly
>
> __make_request -> elevator merge/insertion scan. This is what is taking
> all the time, not __make_request itself. With the bio-XX patches I have
> completely eliminated merge scans, so that can be done in O(1) time. For
> now insertion is still a O(N) scan, maybe that will change too [1].
I haven't got as far down as the elevator algorithm yet, but I would
like to, at some point in time. In any case, my point was that because
of disk block sorting done during initial queueing, there is likely
to be a slightly longer lock (per-queue or otherwise) hold time there compared
to, say, dequeueing for dispatch to lowlevel drivers.
Where can I get the bio patches from ?
>
> > for a per-queue lock, but atleast it will not affect queueing i/o
> > requests to other devices. Besides, I am not sure if blk_get_queue()
> > really needs to grab the io_request_lock. blk_dev[] entries aren't
>
> Funny, this is one thing I've been looking at too. blk_get_queue _will_
> die, don't worry. And yes, ie at open we can assign the queue. Or simply
> map it and protect it otherwise. It all ties in with being able to up or
> down a queue too, currently grabbing io_request_lock from blk_get_queue
> accomplishes exactly nothing and may as well be removed. If you do that,
> does it change the contention numbers significantly?
I haven't yet experimented with this yet, but theoritically speaking
yes, it should make a big difference. blk_get_queue() grabs the lock
very often and holds it for a very short period of time on average,
so it is the one that is affected most. Out of the 30 seconds of
spin-wait for io_request_lock, blk_get_queue() seems to take up
610820*60.2%*72us = 26.5 seconds. I will get to this soon though.
BTW, where can I get some of these lock-splitting patches from ? I
can do one myself for scsi+aic7xxx, but if there already exist some
work, I would like to start off with them.
Thanks
Dipankar
--
Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
Hi Mike,
On Tue, Jul 10, 2001 at 04:05:12PM -0700, Mike Anderson wrote:
> The call to do_aic7xxx_isr appears that you are running the aic7xxx_old.c
> code. This driver is using the io_request_lock to protect internal data.
> The newer aic driver has its own lock. This is related to previous
> comments by Jens and Eric about lower level use of this lock.
There were some problems booting with the new aic7xxx driver and 2.4.4
kernel. This may have been fixed in later kernels, so we will check
this again. Besides, I wasn't aware that the new aic7xxx driver uses
a different locking model. Thanks for letting me know.
>
> I would like to know why the request_freelist is going empty? Having
> __get_request_wait being called alot would appear to be not optimal.
It is not unreasonable for request IOCB pools to go empty, the important
issue is at what rate ? If a large portion of I/Os have to wait for
request structures to be freed, we may not be able to utilize the available
hardware bandwidth of the system optimally when we need, say, large
# of IOs/Sec. On the other hand, having large number of request structures
available may not necessarily give you large IOs/sec. The thing to look
at would be - how well are we utilizing the queueing capablility
of the hardware given a particular type of workload.
Thanks
Dipankar
--
Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
On Wed, Jul 11 2001, Dipankar Sarma wrote:
> > > It also seems that __make_request() holds the lock for the largest
> > > amount of time. This hold time isn't likely to change significantly
> >
> > __make_request -> elevator merge/insertion scan. This is what is taking
> > all the time, not __make_request itself. With the bio-XX patches I have
> > completely eliminated merge scans, so that can be done in O(1) time. For
> > now insertion is still a O(N) scan, maybe that will change too [1].
>
> I haven't got as far down as the elevator algorithm yet, but I would
> like to, at some point in time. In any case, my point was that because
> of disk block sorting done during initial queueing, there is likely
> to be a slightly longer lock (per-queue or otherwise) hold time there
> compared to, say, dequeueing for dispatch to lowlevel drivers.
That's exactly right. The merging/insertion is O(N) now, so for queueing
100 buffers it will take some time. Especially for bigger hardware,
where we allocate a much bigger freelist (and potentially get very long
queues). Dequeueing is O(1) always, and thus doesn't hold the lock for
long.
> Where can I get the bio patches from ?
kernel.org/pub/linux/kernel/people/axboe/v2.5
wait for bio-14 final though.
> > > for a per-queue lock, but atleast it will not affect queueing i/o
> > > requests to other devices. Besides, I am not sure if blk_get_queue()
> > > really needs to grab the io_request_lock. blk_dev[] entries aren't
> >
> > Funny, this is one thing I've been looking at too. blk_get_queue _will_
> > die, don't worry. And yes, ie at open we can assign the queue. Or simply
> > map it and protect it otherwise. It all ties in with being able to up or
> > down a queue too, currently grabbing io_request_lock from blk_get_queue
> > accomplishes exactly nothing and may as well be removed. If you do that,
> > does it change the contention numbers significantly?
>
> I haven't yet experimented with this yet, but theoritically speaking
> yes, it should make a big difference. blk_get_queue() grabs the lock
> very often and holds it for a very short period of time on average,
> so it is the one that is affected most. Out of the 30 seconds of
> spin-wait for io_request_lock, blk_get_queue() seems to take up
> 610820*60.2%*72us = 26.5 seconds. I will get to this soon though.
In that case, I'll make sure to rip it out immediately in the stock
kernel too. You'll note that the bio-XX patches don't use it either, and
haven't for some time.
> BTW, where can I get some of these lock-splitting patches from ? I
> can do one myself for scsi+aic7xxx, but if there already exist some
> work, I would like to start off with them.
I have some old patches somewhere, but no chance of them applying now.
It's mostly a case of s/io_request_lock/q->queue_lock in some way for
all cases, so it's probably just as easy if you do it yourself for
testing.
--
Jens Axboe
On Wed, Jul 11 2001, Dipankar Sarma wrote:
> > I would like to know why the request_freelist is going empty? Having
> > __get_request_wait being called alot would appear to be not optimal.
>
> It is not unreasonable for request IOCB pools to go empty, the important
> issue is at what rate ? If a large portion of I/Os have to wait for
> request structures to be freed, we may not be able to utilize the available
> hardware bandwidth of the system optimally when we need, say, large
> # of IOs/Sec. On the other hand, having large number of request structures
> available may not necessarily give you large IOs/sec. The thing to look
> at would be - how well are we utilizing the queueing capablility
> of the hardware given a particular type of workload.
The queue lengths should always be long enough to keep the hw busy of
course. And in addition, the bigger the queues the bigger the chance of
skipping seeks due to reordering. But don't worry, I've scaled the queue
lengths so I'm pretty sure that they are always on the safe side in
size.
It's pretty easy to test for yourself if you want, just change
QUEUE_NR_REQUESTS in blkdev.h. It's currently 8192, the request slots
are scaled down from this value. 8k will give you twice the amount of
slots that you have RAM in mb, ie 2048 on a 1gig machine.
block: queued sectors max/low 683554kB/552482kB, 2048 slots per queue
--
Jens Axboe
On Wed, Jul 11 2001, Dipankar Sarma wrote:
> On Wed, Jul 11, 2001 at 10:53:39AM +0200, Jens Axboe wrote:
> > The queue lengths should always be long enough to keep the hw busy of
> > course. And in addition, the bigger the queues the bigger the chance of
> > skipping seeks due to reordering. But don't worry, I've scaled the queue
> > lengths so I'm pretty sure that they are always on the safe side in
> > size.
> >
> > It's pretty easy to test for yourself if you want, just change
> > QUEUE_NR_REQUESTS in blkdev.h. It's currently 8192, the request slots
> > are scaled down from this value. 8k will give you twice the amount of
> > slots that you have RAM in mb, ie 2048 on a 1gig machine.
> >
> > block: queued sectors max/low 683554kB/552482kB, 2048 slots per queue
>
> Hmm.. The tiobench run was done on a 1GB machine and we still ran
> out of request slots. Will investigate.
Sure, that's to be expected. If we never ran out we would be wasting
memory. My point is that you should rerun the same test with more
request slots -- and I'd be surprised if you gained any significant
performance on that account. I never said that you'd never run out,
that's of course not true. In fact, running out is what starts the I/O
typically on a 1GB machine and bigger.
--
Jens Axboe
On Wed, Jul 11, 2001 at 10:53:39AM +0200, Jens Axboe wrote:
> The queue lengths should always be long enough to keep the hw busy of
> course. And in addition, the bigger the queues the bigger the chance of
> skipping seeks due to reordering. But don't worry, I've scaled the queue
> lengths so I'm pretty sure that they are always on the safe side in
> size.
>
> It's pretty easy to test for yourself if you want, just change
> QUEUE_NR_REQUESTS in blkdev.h. It's currently 8192, the request slots
> are scaled down from this value. 8k will give you twice the amount of
> slots that you have RAM in mb, ie 2048 on a 1gig machine.
>
> block: queued sectors max/low 683554kB/552482kB, 2048 slots per queue
Hmm.. The tiobench run was done on a 1GB machine and we still ran
out of request slots. Will investigate.
Thanks
Dipankar
--
Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
Hi Jens,
On Wed, Jul 11, 2001 at 04:01:48PM +0200, Jens Axboe wrote:
> On Wed, Jul 11 2001, Dipankar Sarma wrote:
> > On Wed, Jul 11, 2001 at 10:53:39AM +0200, Jens Axboe wrote:
> > > The queue lengths should always be long enough to keep the hw busy of
> > > course. And in addition, the bigger the queues the bigger the chance of
> > > skipping seeks due to reordering. But don't worry, I've scaled the queue
> > > lengths so I'm pretty sure that they are always on the safe side in
> > > size.
> > >
> > > It's pretty easy to test for yourself if you want, just change
> > > QUEUE_NR_REQUESTS in blkdev.h. It's currently 8192, the request slots
> > > are scaled down from this value. 8k will give you twice the amount of
> > > slots that you have RAM in mb, ie 2048 on a 1gig machine.
> > >
> > > block: queued sectors max/low 683554kB/552482kB, 2048 slots per queue
> >
> > Hmm.. The tiobench run was done on a 1GB machine and we still ran
> > out of request slots. Will investigate.
>
> Sure, that's to be expected. If we never ran out we would be wasting
> memory. My point is that you should rerun the same test with more
> request slots -- and I'd be surprised if you gained any significant
> performance on that account. I never said that you'd never run out,
> that's of course not true. In fact, running out is what starts the I/O
> typically on a 1GB machine and bigger.
I agree that running out of slots don't necessarily mean anything. I
was just wondering about the queue lengths. On the face of it, 2048
requests pending for a single disk seems like a fairly large queue length.
Is this all related to plugging of queues for disk block optimization ?
Thanks
Dipankar
--
Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
Sorry for the slow fingers just trying to catch up on this thread.
Dipankar Sarma [[email protected]] wrote:
> Hi Mike,
>
> On Tue, Jul 10, 2001 at 04:05:12PM -0700, Mike Anderson wrote:
> > The call to do_aic7xxx_isr appears that you are running the aic7xxx_old.c
> > code. This driver is using the io_request_lock to protect internal data.
> > The newer aic driver has its own lock. This is related to previous
> > comments by Jens and Eric about lower level use of this lock.
>
> There were some problems booting with the new aic7xxx driver and 2.4.4
> kernel. This may have been fixed in later kernels, so we will check
> this again. Besides, I wasn't aware that the new aic7xxx driver uses
> a different locking model. Thanks for letting me know.
>
> >
> > I would like to know why the request_freelist is going empty? Having
> > __get_request_wait being called alot would appear to be not optimal.
>
> It is not unreasonable for request IOCB pools to go empty, the important
> issue is at what rate ? If a large portion of I/Os have to wait for
> request structures to be freed, we may not be able to utilize the available
> hardware bandwidth of the system optimally when we need, say, large
> # of IOs/Sec. On the other hand, having large number of request structures
> available may not necessarily give you large IOs/sec. The thing to look
> at would be - how well are we utilizing the queueing capablility
> of the hardware given a particular type of workload.
Jens, I think Dipankar might have stated my comment about questioning
optimal utilization of a pool of resources shared by all device queues in
the last sentence of the above paragraph.
My thought was that if one has enough IO that cannot be merged on a queue
you eat up request descriptors. If a request queue contains more requests
than can be put in flight by the lower level to a spindle than this
resource might be better used by other request queues.
I might be missing something and I will look at the code some more.
Thanks.
>
> Thanks
> Dipankar
> --
> Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
> Linux Technology Center, IBM Software Lab, Bangalore, India.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Michael Anderson
[email protected]
On Wed, Jul 11 2001, Dipankar Sarma wrote:
> On Wed, Jul 11, 2001 at 04:01:48PM +0200, Jens Axboe wrote:
> > On Wed, Jul 11 2001, Dipankar Sarma wrote:
> > > On Wed, Jul 11, 2001 at 10:53:39AM +0200, Jens Axboe wrote:
> > > > The queue lengths should always be long enough to keep the hw busy of
> > > > course. And in addition, the bigger the queues the bigger the chance of
> > > > skipping seeks due to reordering. But don't worry, I've scaled the queue
> > > > lengths so I'm pretty sure that they are always on the safe side in
> > > > size.
> > > >
> > > > It's pretty easy to test for yourself if you want, just change
> > > > QUEUE_NR_REQUESTS in blkdev.h. It's currently 8192, the request slots
> > > > are scaled down from this value. 8k will give you twice the amount of
> > > > slots that you have RAM in mb, ie 2048 on a 1gig machine.
> > > >
> > > > block: queued sectors max/low 683554kB/552482kB, 2048 slots per queue
> > >
> > > Hmm.. The tiobench run was done on a 1GB machine and we still ran
> > > out of request slots. Will investigate.
> >
> > Sure, that's to be expected. If we never ran out we would be wasting
> > memory. My point is that you should rerun the same test with more
> > request slots -- and I'd be surprised if you gained any significant
> > performance on that account. I never said that you'd never run out,
> > that's of course not true. In fact, running out is what starts the I/O
> > typically on a 1GB machine and bigger.
>
> I agree that running out of slots don't necessarily mean anything. I
> was just wondering about the queue lengths. On the face of it, 2048
> requests pending for a single disk seems like a fairly large queue length.
> Is this all related to plugging of queues for disk block optimization ?
Plugging to a degree, although when you get queues this long you can
get nice merging even though the queue will probably never get replugged
(because it never empties). The merging is the most important step, and
also the insertion sort of course.
--
Jens Axboe
On Wed, Jul 11 2001, Mike Anderson wrote:
> Dipankar Sarma [[email protected]] wrote:
> > Hi Mike,
> >
> > On Tue, Jul 10, 2001 at 04:05:12PM -0700, Mike Anderson wrote:
> > > The call to do_aic7xxx_isr appears that you are running the aic7xxx_old.c
> > > code. This driver is using the io_request_lock to protect internal data.
> > > The newer aic driver has its own lock. This is related to previous
> > > comments by Jens and Eric about lower level use of this lock.
> >
> > There were some problems booting with the new aic7xxx driver and 2.4.4
> > kernel. This may have been fixed in later kernels, so we will check
> > this again. Besides, I wasn't aware that the new aic7xxx driver uses
> > a different locking model. Thanks for letting me know.
> >
> > >
> > > I would like to know why the request_freelist is going empty? Having
> > > __get_request_wait being called alot would appear to be not optimal.
> >
> > It is not unreasonable for request IOCB pools to go empty, the important
> > issue is at what rate ? If a large portion of I/Os have to wait for
> > request structures to be freed, we may not be able to utilize the available
> > hardware bandwidth of the system optimally when we need, say, large
> > # of IOs/Sec. On the other hand, having large number of request structures
> > available may not necessarily give you large IOs/sec. The thing to look
> > at would be - how well are we utilizing the queueing capablility
> > of the hardware given a particular type of workload.
>
> Jens, I think Dipankar might have stated my comment about questioning
> optimal utilization of a pool of resources shared by all device queues in
> the last sentence of the above paragraph.
>
> My thought was that if one has enough IO that cannot be merged on a queue
> you eat up request descriptors. If a request queue contains more requests
> than can be put in flight by the lower level to a spindle than this
> resource might be better used by other request queues.
True. In theory it would be possible to do request slot stealing from
idle queues, in fact it's doable without adding any additional overhead
to struct request. I did discuss this with [someone, forgot who] last
year, when the per-queue slots where introduced.
I'm not sure I want to do this though. If you have lots of disks, then
yes there will be some wastage if they are idle. IMO that's ok. What's
not ok and what I do want to fix is that slower devices get just as many
slots as a 15K disk for instance. For, say, floppy or CDROM devices we
really don't need to waste that much RAM. This will change for 2.5, not
before.
--
Jens Axboe
On Wed, Jul 11, 2001 at 09:20:22PM +0200, Jens Axboe wrote:
> True. In theory it would be possible to do request slot stealing from
> idle queues, in fact it's doable without adding any additional overhead
> to struct request. I did discuss this with [someone, forgot who] last
> year, when the per-queue slots where introduced.
>
> I'm not sure I want to do this though. If you have lots of disks, then
> yes there will be some wastage if they are idle. IMO that's ok. What's
> not ok and what I do want to fix is that slower devices get just as many
> slots as a 15K disk for instance. For, say, floppy or CDROM devices we
> really don't need to waste that much RAM. This will change for 2.5, not
> before.
Unless there is some serious evidence substantiating the need for
stealing request slots from other devices to avoid starvation, it
makes sense to avoid it and go for a simpler scheme. I suspect that device
type based slot allocation should just suffice.
Thanks
Dipankar
--
Dipankar Sarma <[email protected]> Project: http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.
On Thu, Jul 12 2001, Dipankar Sarma wrote:
> On Wed, Jul 11, 2001 at 09:20:22PM +0200, Jens Axboe wrote:
> > True. In theory it would be possible to do request slot stealing from
> > idle queues, in fact it's doable without adding any additional overhead
> > to struct request. I did discuss this with [someone, forgot who] last
> > year, when the per-queue slots where introduced.
> >
> > I'm not sure I want to do this though. If you have lots of disks, then
> > yes there will be some wastage if they are idle. IMO that's ok. What's
> > not ok and what I do want to fix is that slower devices get just as many
> > slots as a 15K disk for instance. For, say, floppy or CDROM devices we
> > really don't need to waste that much RAM. This will change for 2.5, not
> > before.
>
> Unless there is some serious evidence substantiating the need for
> stealing request slots from other devices to avoid starvation, it
> makes sense to avoid it and go for a simpler scheme. I suspect that device
> type based slot allocation should just suffice.
My point exactly. And typically, if you have lots of queues you have
lots of RAM. A standard 128meg desktop machine does not waste a whole
lot.
--
Jens Axboe
Jens Axboe [[email protected]] wrote:
> On Thu, Jul 12 2001, Dipankar Sarma wrote:
> > On Wed, Jul 11, 2001 at 09:20:22PM +0200, Jens Axboe wrote:
> > > True. In theory it would be possible to do request slot stealing from
> > > idle queues, in fact it's doable without adding any additional overhead
> > > to struct request. I did discuss this with [someone, forgot who] last
> > > year, when the per-queue slots where introduced.
> > >
> > > I'm not sure I want to do this though. If you have lots of disks, then
> > > yes there will be some wastage if they are idle. IMO that's ok. What's
> > > not ok and what I do want to fix is that slower devices get just as many
> > > slots as a 15K disk for instance. For, say, floppy or CDROM devices we
> > > really don't need to waste that much RAM. This will change for 2.5, not
> > > before.
> >
> > Unless there is some serious evidence substantiating the need for
> > stealing request slots from other devices to avoid starvation, it
> > makes sense to avoid it and go for a simpler scheme. I suspect that device
> > type based slot allocation should just suffice.
>
> My point exactly. And typically, if you have lots of queues you have
> lots of RAM. A standard 128meg desktop machine does not waste a whole
> lot.
I would vote for the simpler approach of slots. :-).
>
> --
> Jens Axboe
--
Michael Anderson
[email protected]