2002-03-01 20:09:08

by Jeff V. Merkey

[permalink] [raw]
Subject: queue_nr_requests needs to be selective


Linus/Alan/Linux,

Performance numbers can be increased dramatically (> 300 MB/S)
by increasing queue_nr_requests in ll_rw_blk.c on large RAID
controllers that are hosting a lot of drives.

We have noticed that with the defaults of 64/128 for queue_nr_requests,
2.4.18 performance hits a wall at 230 MB/S system wide. We have gotten
this number to approach 300 MB/S across 4 adapters (for short periods
during cache flushing)
by putting 2 cards on the 33 Mhz bus and 2 cards on the 66 Mhx bus
in a Serverworks configuration running Dolphin's SCI adapters as
the interconnect to connect the adapters to an SCI clustering
fabric.

With queue_nr_requests set to 64/128, we have noticed via profile=2
that a lot of time is spent needlessly going to sleep in
__get_request_wait() with calling threads sleeping way more than is
needed. Since SCI does direct memory DMA,s we are at present pushing
238 MB/S from an SCI adapter into local cache and the cache is bursting
writes to the 3Ware adapters at very high data rates > 300 MB/S, but
only after I have hard coded 1024 into queue_nr_requests for the
disk queues in 2.4.18. We are at 34% utilization on a single
processor (no SMP) moving data at these rates with SCI.

What is really needed here is to allow queue_nr_requests to be
configurable on a per adapter/device basis for these high end
raid cards like 3Ware since in a RAID 0 configuration, 8 drives
are in essence a terabyte (1.3 terrabytes in our configuration)
and each adapter is showing up as a 1.3 TB device. 64/128
requests are simply not enough to get the full spectrum of
performance atainable with these cards.

These number needs to be increased. I cannot get any 2.4.18
system to get above 230 MB/S with 3ware RAID adapters with
the numbers set to 64/128 on these big devices. An array
of 8 x 160BG disks eats up these request queue free lists,
then the feeding threads spend most of their time asleep.

Respectfully submitted,

Jeff Merkey








2002-03-01 20:45:00

by Andrew Morton

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

"Jeff V. Merkey" wrote:
>
> Linus/Alan/Linux,
>
> Performance numbers can be increased dramatically (> 300 MB/S)
> by increasing queue_nr_requests in ll_rw_blk.c on large RAID
> controllers that are hosting a lot of drives.

I don't immediately see why increasing the queue length should
increase bandwidth in this manner. One possibility is that
the shorter queue results in tasks sleeping in __get_request_wait
more often, and the real problem is the "request starvation" thing.

The "request starvation" thing could conceivably result in more
seeky behaviour. In your kernel, disk writeouts come from
two places:

- Off the tail of the dirty buffer LRU

- Basically random guess, from the page LRU.

It's competition between these two writeout sources which causes
decreased bandwidth - I've seen kernels in which ext2 writeback
performance was down 40% due to this.

Anyway. You definitely need to try 2.4.19-pre1. Max sleep times
in __get_request_wait will be improved, and it's possible that the
bandwidth will improve. Or not. My gut feel is that it won't
help.

And yes, 128 requests is too few. It used to be ~1000. I think
this change was made in a (misguided, unsuccessful) attempt to
manage latency for readers. The request queue is the only mechanism
we have for realigning out-of-order requests and it needs to be
larger so it can do this better. I've seen 15-25% throughput
improvements from a 1024-slot request queue.

And if a return to a large request queue damages latency (it doesn't)
then we need to fix that latency *without* damaging request merging.

First step: please test 2.4.19-pre1 or -pre2. Also 2.4.19-pre1-ac2
may provide some surprises..

-

2002-03-01 20:49:20

by Alan

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

> I don't immediately see why increasing the queue length should
> increase bandwidth in this manner. One possibility is that

Latency on the controllers ? With caches on the controller and disks you
can have a lot of requests actually in flight

2002-03-01 23:06:25

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

On Fri, Mar 01, 2002 at 12:43:09PM -0800, Andrew Morton wrote:
> "Jeff V. Merkey" wrote:
> >
> > Linus/Alan/Linux,
> >
> > Performance numbers can be increased dramatically (> 300 MB/S)
> > by increasing queue_nr_requests in ll_rw_blk.c on large RAID
> > controllers that are hosting a lot of drives.
>
> I don't immediately see why increasing the queue length should
> increase bandwidth in this manner. One possibility is that
> the shorter queue results in tasks sleeping in __get_request_wait
> more often, and the real problem is the "request starvation" thing.

This is the case. We end up sleeping at with 8,000 buffer head requests
(4K block size) queued per adapter. After I made the change and
increased the size to 1024, this number increased to 17,000
buffer heads queued per adapter. Performance went up and processor
utilization went down.

>From the profiling I ran, we were sleeping in __get_request_wait
far too much. This value should be maintained on a per card
basis and for RAID controllers that present a single virtual disk
for many physical disks (i.e. on 3Ware this number is 8), we should
make the queue 8 X default. I guess each driver would need to
change this value based on how many actual drives were attached
to the controller.

>
> The "request starvation" thing could conceivably result in more
> seeky behaviour. In your kernel, disk writeouts come from
> two places:

This is not happening. The elevator code is above this level and
I am seeing the requests are ordered for the most part at this
layer.

>
> - Off the tail of the dirty buffer LRU
>
> - Basically random guess, from the page LRU.

There are two senarios, one scenario is not using Linus' buffer cache
but a custom cache maintained between SCI nodes, another implementation
is using Linus' buffer cache. We are seeing > 300 MB/S on the SCI
cache.

>
> It's competition between these two writeout sources which causes
> decreased bandwidth - I've seen kernels in which ext2 writeback
> performance was down 40% due to this.
>
> Anyway. You definitely need to try 2.4.19-pre1. Max sleep times
> in __get_request_wait will be improved, and it's possible that the
> bandwidth will improve. Or not. My gut feel is that it won't
> help.
>

How about just increasing the value of queue_nr_requests or making
it adapter specific?


> And yes, 128 requests is too few. It used to be ~1000. I think
> this change was made in a (misguided, unsuccessful) attempt to
> manage latency for readers. The request queue is the only mechanism
> we have for realigning out-of-order requests and it needs to be
> larger so it can do this better. I've seen 15-25% throughput
> improvements from a 1024-slot request queue.


>
> And if a return to a large request queue damages latency (it doesn't)
> then we need to fix that latency *without* damaging request merging.
>
> First step: please test 2.4.19-pre1 or -pre2. Also 2.4.19-pre1-ac2
> may provide some surprises..
>

I will test, but unless this value is higher, I am skeptical I will see
the needed improvement. The issue here is that it sleeps too much
and what's really happening and that we are forcing 8 disk drives
toshare 64/128 request buffers rather than provide each physical disk
with what it really needs.


Jeff

> -
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-03-01 23:08:39

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

On Fri, Mar 01, 2002 at 09:03:22PM +0000, Alan Cox wrote:
> > I don't immediately see why increasing the queue length should
> > increase bandwidth in this manner. One possibility is that
>
> Latency on the controllers ? With caches on the controller and disks you
> can have a lot of requests actually in flight


Since 3ware uses a switched architecture, there are a lot of requests
in flight. Can we get the default value increased or alternately, allow
RAID drivers to submit a local queue_nr_request value per adapter
so for those cards with 8 drives that make themselves look like
a single drive to Linux, sufficient buffers are available to prevent
threads feeding the device from going to sleep all the time?

:-)

Jeff

2002-03-01 23:25:09

by Andrew Morton

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

"Jeff V. Merkey" wrote:
>
> The issue here is that it sleeps too much
> and what's really happening and that we are forcing 8 disk drives
> toshare 64/128 request buffers rather than provide each physical disk
> with what it really needs.

OK. So would it suffice to make queue_nr_requests an argument to
a new blk_init_queue()?

- blk_init_queue(q, sci_request);
+ blk_init_queue_ng(q, sci_request, 1024);



-

2002-03-02 00:13:15

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

On Fri, Mar 01, 2002 at 03:23:18PM -0800, Andrew Morton wrote:
> "Jeff V. Merkey" wrote:
> >
> > The issue here is that it sleeps too much
> > and what's really happening and that we are forcing 8 disk drives
> > toshare 64/128 request buffers rather than provide each physical disk
> > with what it really needs.
>
> OK. So would it suffice to make queue_nr_requests an argument to
> a new blk_init_queue()?
>
> - blk_init_queue(q, sci_request);
> + blk_init_queue_ng(q, sci_request, 1024);
>

Andrew,

Yep. This will do it. 3Ware (Adam) and the other RAID card folks
would need to call and setup their queues based on how many disks
are actually attached to the controller (which they know based on
their local state). It would be great if this value were maintained as
adapter specific, i.e.

id blk_init_queue(request_queue_t * q, request_fn_proc * rfn, int depth)
{
INIT_LIST_HEAD(&q->queue_head);
elevator_init(&q->elevator, ELEVATOR_LINUS);
blk_init_free_list(q);
q->request_fn = rfn;
q->back_merge_fn = ll_back_merge_fn;
q->front_merge_fn = ll_front_merge_fn;
q->merge_requests_fn = ll_merge_requests_fn;
q->make_request_fn = __make_request;
q->plug_tq.sync = 0;
q->plug_tq.routine = &generic_unplug_device;
q->plug_tq.data = q;
q->plugged = 0;

// NEW CODE

if (depth > 1024)
depth = 1024;
q->queue_nr_requests_local = (depth ? depth : -1);

//

/*
* These booleans describe the queue properties. We set the
* default (and most common) values here. Other drivers can
* use the appropriate functions to alter the queue properties.
* as appropriate.
*/
q->plug_device_fn = generic_plug_device;
q->head_active = 1;
}

and the modify this function something like this:

static void blk_init_free_list(request_queue_t *q)
{
struct request *rq;
int i;

// NEW CODE
//
int queue_depth;

queue_depth = ((q->queue_nr_requests_local != -1)
? q->queue_nr_requests_local
: queue_nr_requests);
//
//

INIT_LIST_HEAD(&q->rq[READ].free);
INIT_LIST_HEAD(&q->rq[WRITE].free);
q->rq[READ].count = 0;
q->rq[WRITE].count = 0;

/*
* Divide requests in half between read and write
*/

// for (i = 0; i < queue_nr_requests; i++) {
for (i = 0; i < queue_depth; i++) {

rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
if (rq == NULL) {
/* We'll get a `leaked requests' message from blk_cleanup_queue */
printk(KERN_EMERG "blk_init_free_list: error allocating
requests\n");
break;
}
memset(rq, 0, sizeof(struct request));
rq->rq_status = RQ_INACTIVE;
list_add(&rq->queue, &q->rq[i&1].free);
q->rq[i&1].count++;
}

init_waitqueue_head(&q->wait_for_request);
spin_lock_init(&q->queue_lock);
}

and we're there.

:-)

Jeff

>
>
> -
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-03-02 00:51:35

by Andrew Morton

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

"Jeff V. Merkey" wrote:
>
> ...
> > OK. So would it suffice to make queue_nr_requests an argument to
> > a new blk_init_queue()?
> >
> > - blk_init_queue(q, sci_request);
> > + blk_init_queue_ng(q, sci_request, 1024);
> >
>
> Andrew,
>
> Yep. This will do it.

OK. The requirement seems eminently sensible to me. Could
you please verify the suitability of this patch? And if Jens
is OK with it, I'll stick it in the pile marked "Brazil".


--- 2.4.19-pre2/drivers/block/ll_rw_blk.c~blk-queue Fri Mar 1 15:30:13 2002
+++ 2.4.19-pre2-akpm/drivers/block/ll_rw_blk.c Fri Mar 1 15:57:07 2002
@@ -117,16 +117,6 @@ int * max_readahead[MAX_BLKDEV];
*/
int * max_sectors[MAX_BLKDEV];

-/*
- * The total number of requests in each queue.
- */
-static int queue_nr_requests;
-
-/*
- * The threshold around which we make wakeup decisions
- */
-static int batch_requests;
-
static inline int get_max_sectors(kdev_t dev)
{
if (!max_sectors[MAJOR(dev)])
@@ -180,7 +170,7 @@ static int __blk_cleanup_queue(struct re
**/
void blk_cleanup_queue(request_queue_t * q)
{
- int count = queue_nr_requests;
+ int count = q->nr_requests;

count -= __blk_cleanup_queue(&q->rq[READ]);
count -= __blk_cleanup_queue(&q->rq[WRITE]);
@@ -330,9 +320,11 @@ void generic_unplug_device(void *data)
spin_unlock_irqrestore(&io_request_lock, flags);
}

-static void blk_init_free_list(request_queue_t *q)
+static void blk_init_free_list(request_queue_t *q, int nr_requests)
{
struct request *rq;
+ struct sysinfo si;
+ int megs; /* Total memory, in megabytes */
int i;

INIT_LIST_HEAD(&q->rq[READ].free);
@@ -340,10 +332,17 @@ static void blk_init_free_list(request_q
q->rq[READ].count = 0;
q->rq[WRITE].count = 0;

+ si_meminfo(&si);
+ megs = si.totalram >> (20 - PAGE_SHIFT);
+ q->nr_requests = nr_requests;
+ if (megs < 32)
+ q->nr_requests /= 2;
+ q->batch_requests = q->nr_requests / 4;
+
/*
* Divide requests in half between read and write
*/
- for (i = 0; i < queue_nr_requests; i++) {
+ for (i = 0; i < nr_requests; i++) {
rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
if (rq == NULL) {
/* We'll get a `leaked requests' message from blk_cleanup_queue */
@@ -364,15 +363,17 @@ static void blk_init_free_list(request_q
static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);

/**
- * blk_init_queue - prepare a request queue for use with a block device
+ * __blk_init_queue - prepare a request queue for use with a block device
* @q: The &request_queue_t to be initialised
* @rfn: The function to be called to process requests that have been
* placed on the queue.
+ * @nr_requests: the number of request structures on each of the device's
+ * read and write request lists.
*
* Description:
* If a block device wishes to use the standard request handling procedures,
* which sorts requests and coalesces adjacent requests, then it must
- * call blk_init_queue(). The function @rfn will be called when there
+ * call __blk_init_queue(). The function @rfn will be called when there
* are requests on the queue that need to be processed. If the device
* supports plugging, then @rfn may not be called immediately when requests
* are available on the queue, but may be called at some time later instead.
@@ -392,15 +393,21 @@ static int __make_request(request_queue_
* whenever the given queue is unplugged. This behaviour can be changed with
* blk_queue_headactive().
*
+ * Your selected value of nr_requests will be scaled down for small
+ * machines. See blk_init_free_list().
+ *
+ * If nr_requests is less than 16, things will probably fail mysteriously.
+ *
* Note:
- * blk_init_queue() must be paired with a blk_cleanup_queue() call
+ * __blk_init_queue() must be paired with a blk_cleanup_queue() call
* when the block device is deactivated (such as at module unload).
**/
-void blk_init_queue(request_queue_t * q, request_fn_proc * rfn)
+void __blk_init_queue(request_queue_t * q,
+ request_fn_proc * rfn, int nr_requests)
{
INIT_LIST_HEAD(&q->queue_head);
elevator_init(&q->elevator, ELEVATOR_LINUS);
- blk_init_free_list(q);
+ blk_init_free_list(q, nr_requests);
q->request_fn = rfn;
q->back_merge_fn = ll_back_merge_fn;
q->front_merge_fn = ll_front_merge_fn;
@@ -610,7 +617,7 @@ void blkdev_release_request(struct reque
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests &&
+ if (++q->rq[rw].count >= q->batch_requests &&
waitqueue_active(&q->wait_for_requests[rw]))
wake_up(&q->wait_for_requests[rw]);
}
@@ -802,7 +809,7 @@ get_rq:
* See description above __get_request_wait()
*/
if (rw_ahead) {
- if (q->rq[rw].count < batch_requests) {
+ if (q->rq[rw].count < q->batch_requests) {
spin_unlock_irq(&io_request_lock);
goto end_io;
}
@@ -1149,12 +1156,9 @@ void end_that_request_last(struct reques
blkdev_release_request(req);
}

-#define MB(kb) ((kb) << 10)
-
int __init blk_dev_init(void)
{
struct blk_dev_struct *dev;
- int total_ram;

request_cachep = kmem_cache_create("blkdev_requests",
sizeof(struct request),
@@ -1170,22 +1174,6 @@ int __init blk_dev_init(void)
memset(max_readahead, 0, sizeof(max_readahead));
memset(max_sectors, 0, sizeof(max_sectors));

- total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
-
- /*
- * Free request slots per queue.
- * (Half for reads, half for writes)
- */
- queue_nr_requests = 64;
- if (total_ram > MB(32))
- queue_nr_requests = 128;
-
- /*
- * Batch frees according to queue length
- */
- batch_requests = queue_nr_requests/4;
- printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
-
#ifdef CONFIG_AMIGA_Z2RAM
z2_init();
#endif
@@ -1296,7 +1284,7 @@ int __init blk_dev_init(void)
EXPORT_SYMBOL(io_request_lock);
EXPORT_SYMBOL(end_that_request_first);
EXPORT_SYMBOL(end_that_request_last);
-EXPORT_SYMBOL(blk_init_queue);
+EXPORT_SYMBOL(__blk_init_queue);
EXPORT_SYMBOL(blk_get_queue);
EXPORT_SYMBOL(blk_cleanup_queue);
EXPORT_SYMBOL(blk_queue_headactive);
--- 2.4.19-pre2/include/linux/blkdev.h~blk-queue Fri Mar 1 15:31:09 2002
+++ 2.4.19-pre2-akpm/include/linux/blkdev.h Fri Mar 1 15:41:50 2002
@@ -79,6 +79,16 @@ struct request_queue
struct request_list rq[2];

/*
+ * The total number of requests on each queue
+ */
+ int nr_requests;
+
+ /*
+ * Batching threshold for sleep/wakeup decisions
+ */
+ int batch_requests;
+
+ /*
* Together with queue_head for cacheline sharing
*/
struct list_head queue_head;
@@ -157,7 +167,8 @@ extern void blkdev_release_request(struc
/*
* Access functions for manipulating queue properties
*/
-extern void blk_init_queue(request_queue_t *, request_fn_proc *);
+#define blk_init_queue(q, fn) __blk_init_queue(q, fn, 128)
+extern void __blk_init_queue(request_queue_t *, request_fn_proc *, int);
extern void blk_cleanup_queue(request_queue_t *);
extern void blk_queue_headactive(request_queue_t *, int);
extern void blk_queue_make_request(request_queue_t *, make_request_fn *);


-

2002-03-02 02:02:51

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective



Andrew,

Patch tested and stable. I has to manually modify the value
in /usr/src/linux/include/linux/blkdev.h from 128 to 1024
since I noticed the 3Ware driver is going through the scsi.c
call to blk_init_queue() macro in this include file. I am
assuming that at some point, Adam and the other folks will
call this api directly.

Performance increased 2-3 % to 319 MB/S on dual 33/66 buses,
and utilization dropped to 27% +-. All of this activity is
DMA based and copyless between reading/writing nodes and
I am running this UP and getting these numbers. I ran SMP
as well and it also looks good.

Someone needs to look at just how the drivers will call
__blk_init_queue() directly to setup custom depth queues
from scsi.c since this call is originating there for
those IDE RAID drivers that fake out the system in thinking
they are SCSI.

:-)

Thanks for the help.

Jeff


On Fri, Mar 01, 2002 at 04:49:47PM -0800, Andrew Morton wrote:
> "Jeff V. Merkey" wrote:
> >
> > ...
> > > OK. So would it suffice to make queue_nr_requests an argument to
> > > a new blk_init_queue()?
> > >
> > > - blk_init_queue(q, sci_request);
> > > + blk_init_queue_ng(q, sci_request, 1024);
> > >
> >
> > Andrew,
> >
> > Yep. This will do it.
>
> OK. The requirement seems eminently sensible to me. Could
> you please verify the suitability of this patch? And if Jens
> is OK with it, I'll stick it in the pile marked "Brazil".
>
>
> --- 2.4.19-pre2/drivers/block/ll_rw_blk.c~blk-queue Fri Mar 1 15:30:13 2002
> +++ 2.4.19-pre2-akpm/drivers/block/ll_rw_blk.c Fri Mar 1 15:57:07 2002
> @@ -117,16 +117,6 @@ int * max_readahead[MAX_BLKDEV];
> */
> int * max_sectors[MAX_BLKDEV];
>
> -/*
> - * The total number of requests in each queue.
> - */
> -static int queue_nr_requests;
> -
> -/*
> - * The threshold around which we make wakeup decisions
> - */
> -static int batch_requests;
> -
> static inline int get_max_sectors(kdev_t dev)
> {
> if (!max_sectors[MAJOR(dev)])
> @@ -180,7 +170,7 @@ static int __blk_cleanup_queue(struct re
> **/
> void blk_cleanup_queue(request_queue_t * q)
> {
> - int count = queue_nr_requests;
> + int count = q->nr_requests;
>
> count -= __blk_cleanup_queue(&q->rq[READ]);
> count -= __blk_cleanup_queue(&q->rq[WRITE]);
> @@ -330,9 +320,11 @@ void generic_unplug_device(void *data)
> spin_unlock_irqrestore(&io_request_lock, flags);
> }
>
> -static void blk_init_free_list(request_queue_t *q)
> +static void blk_init_free_list(request_queue_t *q, int nr_requests)
> {
> struct request *rq;
> + struct sysinfo si;
> + int megs; /* Total memory, in megabytes */
> int i;
>
> INIT_LIST_HEAD(&q->rq[READ].free);
> @@ -340,10 +332,17 @@ static void blk_init_free_list(request_q
> q->rq[READ].count = 0;
> q->rq[WRITE].count = 0;
>
> + si_meminfo(&si);
> + megs = si.totalram >> (20 - PAGE_SHIFT);
> + q->nr_requests = nr_requests;
> + if (megs < 32)
> + q->nr_requests /= 2;
> + q->batch_requests = q->nr_requests / 4;
> +
> /*
> * Divide requests in half between read and write
> */
> - for (i = 0; i < queue_nr_requests; i++) {
> + for (i = 0; i < nr_requests; i++) {
> rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
> if (rq == NULL) {
> /* We'll get a `leaked requests' message from blk_cleanup_queue */
> @@ -364,15 +363,17 @@ static void blk_init_free_list(request_q
> static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
>
> /**
> - * blk_init_queue - prepare a request queue for use with a block device
> + * __blk_init_queue - prepare a request queue for use with a block device
> * @q: The &request_queue_t to be initialised
> * @rfn: The function to be called to process requests that have been
> * placed on the queue.
> + * @nr_requests: the number of request structures on each of the device's
> + * read and write request lists.
> *
> * Description:
> * If a block device wishes to use the standard request handling procedures,
> * which sorts requests and coalesces adjacent requests, then it must
> - * call blk_init_queue(). The function @rfn will be called when there
> + * call __blk_init_queue(). The function @rfn will be called when there
> * are requests on the queue that need to be processed. If the device
> * supports plugging, then @rfn may not be called immediately when requests
> * are available on the queue, but may be called at some time later instead.
> @@ -392,15 +393,21 @@ static int __make_request(request_queue_
> * whenever the given queue is unplugged. This behaviour can be changed with
> * blk_queue_headactive().
> *
> + * Your selected value of nr_requests will be scaled down for small
> + * machines. See blk_init_free_list().
> + *
> + * If nr_requests is less than 16, things will probably fail mysteriously.
> + *
> * Note:
> - * blk_init_queue() must be paired with a blk_cleanup_queue() call
> + * __blk_init_queue() must be paired with a blk_cleanup_queue() call
> * when the block device is deactivated (such as at module unload).
> **/
> -void blk_init_queue(request_queue_t * q, request_fn_proc * rfn)
> +void __blk_init_queue(request_queue_t * q,
> + request_fn_proc * rfn, int nr_requests)
> {
> INIT_LIST_HEAD(&q->queue_head);
> elevator_init(&q->elevator, ELEVATOR_LINUS);
> - blk_init_free_list(q);
> + blk_init_free_list(q, nr_requests);
> q->request_fn = rfn;
> q->back_merge_fn = ll_back_merge_fn;
> q->front_merge_fn = ll_front_merge_fn;
> @@ -610,7 +617,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= batch_requests &&
> + if (++q->rq[rw].count >= q->batch_requests &&
> waitqueue_active(&q->wait_for_requests[rw]))
> wake_up(&q->wait_for_requests[rw]);
> }
> @@ -802,7 +809,7 @@ get_rq:
> * See description above __get_request_wait()
> */
> if (rw_ahead) {
> - if (q->rq[rw].count < batch_requests) {
> + if (q->rq[rw].count < q->batch_requests) {
> spin_unlock_irq(&io_request_lock);
> goto end_io;
> }
> @@ -1149,12 +1156,9 @@ void end_that_request_last(struct reques
> blkdev_release_request(req);
> }
>
> -#define MB(kb) ((kb) << 10)
> -
> int __init blk_dev_init(void)
> {
> struct blk_dev_struct *dev;
> - int total_ram;
>
> request_cachep = kmem_cache_create("blkdev_requests",
> sizeof(struct request),
> @@ -1170,22 +1174,6 @@ int __init blk_dev_init(void)
> memset(max_readahead, 0, sizeof(max_readahead));
> memset(max_sectors, 0, sizeof(max_sectors));
>
> - total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
> -
> - /*
> - * Free request slots per queue.
> - * (Half for reads, half for writes)
> - */
> - queue_nr_requests = 64;
> - if (total_ram > MB(32))
> - queue_nr_requests = 128;
> -
> - /*
> - * Batch frees according to queue length
> - */
> - batch_requests = queue_nr_requests/4;
> - printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
> -
> #ifdef CONFIG_AMIGA_Z2RAM
> z2_init();
> #endif
> @@ -1296,7 +1284,7 @@ int __init blk_dev_init(void)
> EXPORT_SYMBOL(io_request_lock);
> EXPORT_SYMBOL(end_that_request_first);
> EXPORT_SYMBOL(end_that_request_last);
> -EXPORT_SYMBOL(blk_init_queue);
> +EXPORT_SYMBOL(__blk_init_queue);
> EXPORT_SYMBOL(blk_get_queue);
> EXPORT_SYMBOL(blk_cleanup_queue);
> EXPORT_SYMBOL(blk_queue_headactive);
> --- 2.4.19-pre2/include/linux/blkdev.h~blk-queue Fri Mar 1 15:31:09 2002
> +++ 2.4.19-pre2-akpm/include/linux/blkdev.h Fri Mar 1 15:41:50 2002
> @@ -79,6 +79,16 @@ struct request_queue
> struct request_list rq[2];
>
> /*
> + * The total number of requests on each queue
> + */
> + int nr_requests;
> +
> + /*
> + * Batching threshold for sleep/wakeup decisions
> + */
> + int batch_requests;
> +
> + /*
> * Together with queue_head for cacheline sharing
> */
> struct list_head queue_head;
> @@ -157,7 +167,8 @@ extern void blkdev_release_request(struc
> /*
> * Access functions for manipulating queue properties
> */
> -extern void blk_init_queue(request_queue_t *, request_fn_proc *);
> +#define blk_init_queue(q, fn) __blk_init_queue(q, fn, 128)
> +extern void __blk_init_queue(request_queue_t *, request_fn_proc *, int);
> extern void blk_cleanup_queue(request_queue_t *);
> extern void blk_queue_headactive(request_queue_t *, int);
> extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
>
>
> -
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-03-02 02:45:24

by Mike Anderson

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

Jeff V. Merkey [[email protected]] wrote:
>
> ..snip..
>
> What is really needed here is to allow queue_nr_requests to be
> configurable on a per adapter/device basis for these high end
> raid cards like 3Ware since in a RAID 0 configuration, 8 drives
> are in essence a terabyte (1.3 terrabytes in our configuration)
> and each adapter is showing up as a 1.3 TB device. 64/128
> requests are simply not enough to get the full spectrum of
> performance atainable with these cards.
>
Not having direct experience on this card it appears that increasing the
queue_nr_requests number will not allow you to have more ios in flight.

Unless I am reading the driver wrong you will be limited to
TW_MAX_CMDS_PER_LUN (15). This value is used by scsi_build_commandblocks
to allocate scsi commands for your scsi_device. This driver does not provide
a select_queue_depths function which allows for increase to the default
template value.

Could it be that the experimentation of increasing this number has
allowed for better merging.

-Mike
--
Michael Anderson
[email protected]

2002-03-02 03:51:59

by Andrew Morton

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

"Jeff V. Merkey" wrote:
>
> Andrew,
>
> Patch tested and stable. I has to manually modify the value
> in /usr/src/linux/include/linux/blkdev.h from 128 to 1024
> since I noticed the 3Ware driver is going through the scsi.c
> call to blk_init_queue() macro in this include file. I am
> assuming that at some point, Adam and the other folks will
> call this api directly.
>
> Performance increased 2-3 % to 319 MB/S on dual 33/66 buses,
> and utilization dropped to 27% +-. All of this activity is
> DMA based and copyless between reading/writing nodes and
> I am running this UP and getting these numbers. I ran SMP
> as well and it also looks good.
>
> Someone needs to look at just how the drivers will call
> __blk_init_queue() directly to setup custom depth queues
> from scsi.c since this call is originating there for
> those IDE RAID drivers that fake out the system in thinking
> they are SCSI.
>

So it would be more straightforward to just allow the queue
to be grown later on?



--- 2.4.19-pre2/drivers/block/ll_rw_blk.c~blk-queue Fri Mar 1 19:40:12 2002
+++ 2.4.19-pre2-akpm/drivers/block/ll_rw_blk.c Fri Mar 1 19:48:33 2002
@@ -117,16 +117,6 @@ int * max_readahead[MAX_BLKDEV];
*/
int * max_sectors[MAX_BLKDEV];

-/*
- * The total number of requests in each queue.
- */
-static int queue_nr_requests;
-
-/*
- * The threshold around which we make wakeup decisions
- */
-static int batch_requests;
-
static inline int get_max_sectors(kdev_t dev)
{
if (!max_sectors[MAJOR(dev)])
@@ -180,7 +170,7 @@ static int __blk_cleanup_queue(struct re
**/
void blk_cleanup_queue(request_queue_t * q)
{
- int count = queue_nr_requests;
+ int count = q->nr_requests;

count -= __blk_cleanup_queue(&q->rq[READ]);
count -= __blk_cleanup_queue(&q->rq[WRITE]);
@@ -330,31 +320,62 @@ void generic_unplug_device(void *data)
spin_unlock_irqrestore(&io_request_lock, flags);
}

+/** blk_grow_request_list
+ * @q: The &request_queue_t
+ * @nr_requests: how many requests are desired
+ *
+ * More free requests are added to the queue's free lists, bringing
+ * the total number of requests to @nr_requests.
+ *
+ * The requests are added equally to the request queue's read
+ * and write freelists.
+ *
+ * This function can sleep.
+ *
+ * Returns the (new) number of requests which the queue has available.
+ */
+int blk_grow_request_list(request_queue_t *q, int nr_requests)
+{
+ spin_lock_irq(&io_request_lock);
+ while (q->nr_requests < nr_requests) {
+ struct request *rq;
+ int rw;
+
+ spin_unlock_irq(&io_request_lock);
+ rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
+ spin_lock_irq(&io_request_lock);
+ if (rq == NULL)
+ break;
+ memset(rq, 0, sizeof(*rq));
+ rq->rq_status = RQ_INACTIVE;
+ rw = q->nr_requests & 1;
+ list_add(&rq->queue, &q->rq[rw].free);
+ q->rq[rw].count++;
+ q->nr_requests++;
+ }
+ q->batch_requests = q->nr_requests / 4;
+ spin_unlock_irq(&io_request_lock);
+ return q->nr_requests;
+}
+
static void blk_init_free_list(request_queue_t *q)
{
- struct request *rq;
- int i;
+ struct sysinfo si;
+ int megs; /* Total memory, in megabytes */
+ int nr_requests;

INIT_LIST_HEAD(&q->rq[READ].free);
INIT_LIST_HEAD(&q->rq[WRITE].free);
q->rq[READ].count = 0;
q->rq[WRITE].count = 0;
+ q->nr_requests = 0;

- /*
- * Divide requests in half between read and write
- */
- for (i = 0; i < queue_nr_requests; i++) {
- rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
- if (rq == NULL) {
- /* We'll get a `leaked requests' message from blk_cleanup_queue */
- printk(KERN_EMERG "blk_init_free_list: error allocating requests\n");
- break;
- }
- memset(rq, 0, sizeof(struct request));
- rq->rq_status = RQ_INACTIVE;
- list_add(&rq->queue, &q->rq[i&1].free);
- q->rq[i&1].count++;
- }
+ si_meminfo(&si);
+ megs = si.totalram >> (20 - PAGE_SHIFT);
+ nr_requests = 128;
+ if (megs < 32)
+ nr_requests /= 2;
+ blk_grow_request_list(q, nr_requests);

init_waitqueue_head(&q->wait_for_requests[0]);
init_waitqueue_head(&q->wait_for_requests[1]);
@@ -610,7 +631,7 @@ void blkdev_release_request(struct reque
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests &&
+ if (++q->rq[rw].count >= q->batch_requests &&
waitqueue_active(&q->wait_for_requests[rw]))
wake_up(&q->wait_for_requests[rw]);
}
@@ -802,7 +823,7 @@ get_rq:
* See description above __get_request_wait()
*/
if (rw_ahead) {
- if (q->rq[rw].count < batch_requests) {
+ if (q->rq[rw].count < q->batch_requests) {
spin_unlock_irq(&io_request_lock);
goto end_io;
}
@@ -1149,12 +1170,9 @@ void end_that_request_last(struct reques
blkdev_release_request(req);
}

-#define MB(kb) ((kb) << 10)
-
int __init blk_dev_init(void)
{
struct blk_dev_struct *dev;
- int total_ram;

request_cachep = kmem_cache_create("blkdev_requests",
sizeof(struct request),
@@ -1170,22 +1188,6 @@ int __init blk_dev_init(void)
memset(max_readahead, 0, sizeof(max_readahead));
memset(max_sectors, 0, sizeof(max_sectors));

- total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
-
- /*
- * Free request slots per queue.
- * (Half for reads, half for writes)
- */
- queue_nr_requests = 64;
- if (total_ram > MB(32))
- queue_nr_requests = 128;
-
- /*
- * Batch frees according to queue length
- */
- batch_requests = queue_nr_requests/4;
- printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
-
#ifdef CONFIG_AMIGA_Z2RAM
z2_init();
#endif
@@ -1296,6 +1298,7 @@ int __init blk_dev_init(void)
EXPORT_SYMBOL(io_request_lock);
EXPORT_SYMBOL(end_that_request_first);
EXPORT_SYMBOL(end_that_request_last);
+EXPORT_SYMBOL(blk_grow_request_list);
EXPORT_SYMBOL(blk_init_queue);
EXPORT_SYMBOL(blk_get_queue);
EXPORT_SYMBOL(blk_cleanup_queue);
--- 2.4.19-pre2/include/linux/blkdev.h~blk-queue Fri Mar 1 19:40:12 2002
+++ 2.4.19-pre2-akpm/include/linux/blkdev.h Fri Mar 1 19:40:12 2002
@@ -79,6 +79,16 @@ struct request_queue
struct request_list rq[2];

/*
+ * The total number of requests on each queue
+ */
+ int nr_requests;
+
+ /*
+ * Batching threshold for sleep/wakeup decisions
+ */
+ int batch_requests;
+
+ /*
* Together with queue_head for cacheline sharing
*/
struct list_head queue_head;
@@ -157,6 +167,7 @@ extern void blkdev_release_request(struc
/*
* Access functions for manipulating queue properties
*/
+extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
extern void blk_init_queue(request_queue_t *, request_fn_proc *);
extern void blk_cleanup_queue(request_queue_t *);
extern void blk_queue_headactive(request_queue_t *, int);

2002-03-02 04:20:14

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective



Andrew,

I will merge and test this patch tonight. Yes, this makes much more
sense.

:-)

Jeff

On Fri, Mar 01, 2002 at 07:50:08PM -0800, Andrew Morton wrote:
> "Jeff V. Merkey" wrote:
> >
> > Andrew,
> >
> > Patch tested and stable. I has to manually modify the value
> > in /usr/src/linux/include/linux/blkdev.h from 128 to 1024
> > since I noticed the 3Ware driver is going through the scsi.c
> > call to blk_init_queue() macro in this include file. I am
> > assuming that at some point, Adam and the other folks will
> > call this api directly.
> >
> > Performance increased 2-3 % to 319 MB/S on dual 33/66 buses,
> > and utilization dropped to 27% +-. All of this activity is
> > DMA based and copyless between reading/writing nodes and
> > I am running this UP and getting these numbers. I ran SMP
> > as well and it also looks good.
> >
> > Someone needs to look at just how the drivers will call
> > __blk_init_queue() directly to setup custom depth queues
> > from scsi.c since this call is originating there for
> > those IDE RAID drivers that fake out the system in thinking
> > they are SCSI.
> >
>
> So it would be more straightforward to just allow the queue
> to be grown later on?
>
>
>
> --- 2.4.19-pre2/drivers/block/ll_rw_blk.c~blk-queue Fri Mar 1 19:40:12 2002
> +++ 2.4.19-pre2-akpm/drivers/block/ll_rw_blk.c Fri Mar 1 19:48:33 2002
> @@ -117,16 +117,6 @@ int * max_readahead[MAX_BLKDEV];
> */
> int * max_sectors[MAX_BLKDEV];
>
> -/*
> - * The total number of requests in each queue.
> - */
> -static int queue_nr_requests;
> -
> -/*
> - * The threshold around which we make wakeup decisions
> - */
> -static int batch_requests;
> -
> static inline int get_max_sectors(kdev_t dev)
> {
> if (!max_sectors[MAJOR(dev)])
> @@ -180,7 +170,7 @@ static int __blk_cleanup_queue(struct re
> **/
> void blk_cleanup_queue(request_queue_t * q)
> {
> - int count = queue_nr_requests;
> + int count = q->nr_requests;
>
> count -= __blk_cleanup_queue(&q->rq[READ]);
> count -= __blk_cleanup_queue(&q->rq[WRITE]);
> @@ -330,31 +320,62 @@ void generic_unplug_device(void *data)
> spin_unlock_irqrestore(&io_request_lock, flags);
> }
>
> +/** blk_grow_request_list
> + * @q: The &request_queue_t
> + * @nr_requests: how many requests are desired
> + *
> + * More free requests are added to the queue's free lists, bringing
> + * the total number of requests to @nr_requests.
> + *
> + * The requests are added equally to the request queue's read
> + * and write freelists.
> + *
> + * This function can sleep.
> + *
> + * Returns the (new) number of requests which the queue has available.
> + */
> +int blk_grow_request_list(request_queue_t *q, int nr_requests)
> +{
> + spin_lock_irq(&io_request_lock);
> + while (q->nr_requests < nr_requests) {
> + struct request *rq;
> + int rw;
> +
> + spin_unlock_irq(&io_request_lock);
> + rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
> + spin_lock_irq(&io_request_lock);
> + if (rq == NULL)
> + break;
> + memset(rq, 0, sizeof(*rq));
> + rq->rq_status = RQ_INACTIVE;
> + rw = q->nr_requests & 1;
> + list_add(&rq->queue, &q->rq[rw].free);
> + q->rq[rw].count++;
> + q->nr_requests++;
> + }
> + q->batch_requests = q->nr_requests / 4;
> + spin_unlock_irq(&io_request_lock);
> + return q->nr_requests;
> +}
> +
> static void blk_init_free_list(request_queue_t *q)
> {
> - struct request *rq;
> - int i;
> + struct sysinfo si;
> + int megs; /* Total memory, in megabytes */
> + int nr_requests;
>
> INIT_LIST_HEAD(&q->rq[READ].free);
> INIT_LIST_HEAD(&q->rq[WRITE].free);
> q->rq[READ].count = 0;
> q->rq[WRITE].count = 0;
> + q->nr_requests = 0;
>
> - /*
> - * Divide requests in half between read and write
> - */
> - for (i = 0; i < queue_nr_requests; i++) {
> - rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
> - if (rq == NULL) {
> - /* We'll get a `leaked requests' message from blk_cleanup_queue */
> - printk(KERN_EMERG "blk_init_free_list: error allocating requests\n");
> - break;
> - }
> - memset(rq, 0, sizeof(struct request));
> - rq->rq_status = RQ_INACTIVE;
> - list_add(&rq->queue, &q->rq[i&1].free);
> - q->rq[i&1].count++;
> - }
> + si_meminfo(&si);
> + megs = si.totalram >> (20 - PAGE_SHIFT);
> + nr_requests = 128;
> + if (megs < 32)
> + nr_requests /= 2;
> + blk_grow_request_list(q, nr_requests);
>
> init_waitqueue_head(&q->wait_for_requests[0]);
> init_waitqueue_head(&q->wait_for_requests[1]);
> @@ -610,7 +631,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= batch_requests &&
> + if (++q->rq[rw].count >= q->batch_requests &&
> waitqueue_active(&q->wait_for_requests[rw]))
> wake_up(&q->wait_for_requests[rw]);
> }
> @@ -802,7 +823,7 @@ get_rq:
> * See description above __get_request_wait()
> */
> if (rw_ahead) {
> - if (q->rq[rw].count < batch_requests) {
> + if (q->rq[rw].count < q->batch_requests) {
> spin_unlock_irq(&io_request_lock);
> goto end_io;
> }
> @@ -1149,12 +1170,9 @@ void end_that_request_last(struct reques
> blkdev_release_request(req);
> }
>
> -#define MB(kb) ((kb) << 10)
> -
> int __init blk_dev_init(void)
> {
> struct blk_dev_struct *dev;
> - int total_ram;
>
> request_cachep = kmem_cache_create("blkdev_requests",
> sizeof(struct request),
> @@ -1170,22 +1188,6 @@ int __init blk_dev_init(void)
> memset(max_readahead, 0, sizeof(max_readahead));
> memset(max_sectors, 0, sizeof(max_sectors));
>
> - total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
> -
> - /*
> - * Free request slots per queue.
> - * (Half for reads, half for writes)
> - */
> - queue_nr_requests = 64;
> - if (total_ram > MB(32))
> - queue_nr_requests = 128;
> -
> - /*
> - * Batch frees according to queue length
> - */
> - batch_requests = queue_nr_requests/4;
> - printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
> -
> #ifdef CONFIG_AMIGA_Z2RAM
> z2_init();
> #endif
> @@ -1296,6 +1298,7 @@ int __init blk_dev_init(void)
> EXPORT_SYMBOL(io_request_lock);
> EXPORT_SYMBOL(end_that_request_first);
> EXPORT_SYMBOL(end_that_request_last);
> +EXPORT_SYMBOL(blk_grow_request_list);
> EXPORT_SYMBOL(blk_init_queue);
> EXPORT_SYMBOL(blk_get_queue);
> EXPORT_SYMBOL(blk_cleanup_queue);
> --- 2.4.19-pre2/include/linux/blkdev.h~blk-queue Fri Mar 1 19:40:12 2002
> +++ 2.4.19-pre2-akpm/include/linux/blkdev.h Fri Mar 1 19:40:12 2002
> @@ -79,6 +79,16 @@ struct request_queue
> struct request_list rq[2];
>
> /*
> + * The total number of requests on each queue
> + */
> + int nr_requests;
> +
> + /*
> + * Batching threshold for sleep/wakeup decisions
> + */
> + int batch_requests;
> +
> + /*
> * Together with queue_head for cacheline sharing
> */
> struct list_head queue_head;
> @@ -157,6 +167,7 @@ extern void blkdev_release_request(struc
> /*
> * Access functions for manipulating queue properties
> */
> +extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
> extern void blk_init_queue(request_queue_t *, request_fn_proc *);
> extern void blk_cleanup_queue(request_queue_t *);
> extern void blk_queue_headactive(request_queue_t *, int);

2002-03-02 04:25:24

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective



We are going to sleep a lot in __get_request_wait(). This
means the write queue has no free request blocks. We are mostly writing
to the adapter in this test case, and the data we are writing
is already in order when it's posted.

We are also posting via submit_bh() so you should trace
that path. I am seeing 22,000+ buffer heads posted concurrently
on each 3Ware card of 4K each with this application
on the patch for 2.4.19-pre2. I will post the actual data
for you. Stand by.

These 3Ware cards are incredible.

Jeff


On Fri, Mar 01, 2002 at 04:51:04PM -0800, Mike Anderson wrote:
> Jeff V. Merkey [[email protected]] wrote:
> >
> > ..snip..
> >
> > What is really needed here is to allow queue_nr_requests to be
> > configurable on a per adapter/device basis for these high end
> > raid cards like 3Ware since in a RAID 0 configuration, 8 drives
> > are in essence a terabyte (1.3 terrabytes in our configuration)
> > and each adapter is showing up as a 1.3 TB device. 64/128
> > requests are simply not enough to get the full spectrum of
> > performance atainable with these cards.
> >
> Not having direct experience on this card it appears that increasing the
> queue_nr_requests number will not allow you to have more ios in flight.
>
> Unless I am reading the driver wrong you will be limited to
> TW_MAX_CMDS_PER_LUN (15). This value is used by scsi_build_commandblocks
> to allocate scsi commands for your scsi_device. This driver does not provide
> a select_queue_depths function which allows for increase to the default
> template value.
>
> Could it be that the experimentation of increasing this number has
> allowed for better merging.
>
> -Mike
> --
> Michael Anderson
> [email protected]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-03-02 05:45:38

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective



Mike,

Here are some numbers from the running system. This system is
running at 120 MB/S on a single 3Ware adapter. Stats attached. You
will note that the max commands hitting the adapter are way above
15. I can only presume this is due to caching behavior on the card.
I do have these cards enabled with caching. I have had these numbers
as high as 319 MB/S with multiple cards in separate buses. The system
this test is running on has 3 PCI buses. 2 x 33 Mhz and 1 x 66 Mhz.
with Serverwork HE Chipset.

We are averaging complete saturation of the inbound command queue
on the 3Ware adapter and are posting 64K writes (128 * 512) * 3 on
aeverage with each command. I have no idea why the 15 queue depth
limit is being exceeded here, possibly due to the fact the adapter
is caching, but the stats report up to 255 requests being posted
per driver pass.

cat /proc/scsi/3w-xxxx/3

scsi3: 3ware Storage Controller
Driver version: 1.02.00.016
Current commands posted: 205
Max commands posted: 255
Current pending commands: 0
Max pending commands: 0
Last sgl length: 0
Max sgl length: 3
Last sector count: 128
Max sector count: 256
Resets: 0
Aborts: 0
AEN's: 0


cat /proc/scsi/3w-xxxx/3

scsi3: 3ware Storage Controller
Driver version: 1.02.00.016
Current commands posted: 7
Max commands posted: 255
Current pending commands: 0
Max pending commands: 0
Last sgl length: 0
Max sgl length: 3
Last sector count: 128
Max sector count: 256
Resets: 0
Aborts: 0
AEN's: 0


This test is posting a max of 54,000 4K buffer heads to a single 3Ware
adapter. See below:


trxinfo -a

ioctl TRXDRV_QUERY_AIO ret 0
TRXDRV-AIO STATS
aio_submitted-3 aio_completed-25601331 aio_error-0
disk_aio_submitted-25601539 disk_aio_completed-25601331
sync_active [ x x x x x x x x ]
async_active [ x x x x x x x x ]
cb_active-0 aio_sequence-0
bh_count-65536 bh_inuse-32208 bh_max_inuse-54593 bh_waiters-0
hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
probe_avg-0 probe_max-1
total_read_req-0 total_write_req-25601536 total_fill_req-0
total_complete-25739847
req_sec-0 seconds-12925

trxinfo -a

ioctl TRXDRV_QUERY_AIO ret 0
TRXDRV-AIO STATS
aio_submitted-3 aio_completed-25605732 aio_error-0
disk_aio_submitted-25605891 disk_aio_completed-25605732
sync_active [ x x x x x x x x ]
async_active [ x x x x x x x x ]
cb_active-0 aio_sequence-0
bh_count-65536 bh_inuse-31440 bh_max_inuse-54593 bh_waiters-0
hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
probe_avg-0 probe_max-1
total_read_req-0 total_write_req-25605888 total_fill_req-0
total_complete-25744271
req_sec-0 seconds-12927

This test is posting a max of 54,000 4K buffer heads to a single 3Ware
adapter.


Jeff




On Fri, Mar 01, 2002 at 09:39:08PM -0700, Jeff V. Merkey wrote:
>
>
> We are going to sleep a lot in __get_request_wait(). This
> means the write queue has no free request blocks. We are mostly writing
> to the adapter in this test case, and the data we are writing
> is already in order when it's posted.
>
> We are also posting via submit_bh() so you should trace
> that path. I am seeing 22,000+ buffer heads posted concurrently
> on each 3Ware card of 4K each with this application
> on the patch for 2.4.19-pre2. I will post the actual data
> for you. Stand by.
>
> These 3Ware cards are incredible.
>
> Jeff
>
>
> On Fri, Mar 01, 2002 at 04:51:04PM -0800, Mike Anderson wrote:
> > Jeff V. Merkey [[email protected]] wrote:
> > >
> > > ..snip..
> > >
> > > What is really needed here is to allow queue_nr_requests to be
> > > configurable on a per adapter/device basis for these high end
> > > raid cards like 3Ware since in a RAID 0 configuration, 8 drives
> > > are in essence a terabyte (1.3 terrabytes in our configuration)
> > > and each adapter is showing up as a 1.3 TB device. 64/128
> > > requests are simply not enough to get the full spectrum of
> > > performance atainable with these cards.
> > >
> > Not having direct experience on this card it appears that increasing the
> > queue_nr_requests number will not allow you to have more ios in flight.
> >
> > Unless I am reading the driver wrong you will be limited to
> > TW_MAX_CMDS_PER_LUN (15). This value is used by scsi_build_commandblocks
> > to allocate scsi commands for your scsi_device. This driver does not provide
> > a select_queue_depths function which allows for increase to the default
> > template value.
> >
> > Could it be that the experimentation of increasing this number has
> > allowed for better merging.
> >
> > -Mike
> > --
> > Michael Anderson
> > [email protected]
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

2002-03-02 05:47:58

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

On Fri, Mar 01, 2002 at 10:59:18PM -0700, Jeff V. Merkey wrote:
>
>
> Mike,
>
> Here are some numbers from the running system. This system is
> running at 120 MB/S on a single 3Ware adapter. Stats attached. You
> will note that the max commands hitting the adapter are way above
> 15. I can only presume this is due to caching behavior on the card.
> I do have these cards enabled with caching. I have had these numbers
> as high as 319 MB/S with multiple cards in separate buses. The system
> this test is running on has 3 PCI buses. 2 x 33 Mhz and 1 x 66 Mhz.
> with Serverwork HE Chipset.
>
> We are averaging complete saturation of the inbound command queue
> on the 3Ware adapter and are posting 64K writes (128 * 512) * 3 on
> aeverage with each command. I have no idea why the 15 queue depth
> limit is being exceeded here, possibly due to the fact the adapter
> is caching, but the stats report up to 255 requests being posted
> per driver pass.
>
> cat /proc/scsi/3w-xxxx/3
>
> scsi3: 3ware Storage Controller
> Driver version: 1.02.00.016
> Current commands posted: 205


Why 64/64 request blocks read/write queue was not enough.

===========>

> Max commands posted: 255

Jeff


> Current pending commands: 0
> Max pending commands: 0
> Last sgl length: 0
> Max sgl length: 3
> Last sector count: 128
> Max sector count: 256
> Resets: 0
> Aborts: 0
> AEN's: 0
>
>
> cat /proc/scsi/3w-xxxx/3
>
> scsi3: 3ware Storage Controller
> Driver version: 1.02.00.016
> Current commands posted: 7
> Max commands posted: 255
> Current pending commands: 0
> Max pending commands: 0
> Last sgl length: 0
> Max sgl length: 3
> Last sector count: 128
> Max sector count: 256
> Resets: 0
> Aborts: 0
> AEN's: 0
>
>
> This test is posting a max of 54,000 4K buffer heads to a single 3Ware
> adapter. See below:
>
>
> trxinfo -a
>
> ioctl TRXDRV_QUERY_AIO ret 0
> TRXDRV-AIO STATS
> aio_submitted-3 aio_completed-25601331 aio_error-0
> disk_aio_submitted-25601539 disk_aio_completed-25601331
> sync_active [ x x x x x x x x ]
> async_active [ x x x x x x x x ]
> cb_active-0 aio_sequence-0
> bh_count-65536 bh_inuse-32208 bh_max_inuse-54593 bh_waiters-0
> hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
> probe_avg-0 probe_max-1
> total_read_req-0 total_write_req-25601536 total_fill_req-0
> total_complete-25739847
> req_sec-0 seconds-12925
>
> trxinfo -a
>
> ioctl TRXDRV_QUERY_AIO ret 0
> TRXDRV-AIO STATS
> aio_submitted-3 aio_completed-25605732 aio_error-0
> disk_aio_submitted-25605891 disk_aio_completed-25605732
> sync_active [ x x x x x x x x ]
> async_active [ x x x x x x x x ]
> cb_active-0 aio_sequence-0
> bh_count-65536 bh_inuse-31440 bh_max_inuse-54593 bh_waiters-0
> hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
> probe_avg-0 probe_max-1
> total_read_req-0 total_write_req-25605888 total_fill_req-0
> total_complete-25744271
> req_sec-0 seconds-12927
>
> This test is posting a max of 54,000 4K buffer heads to a single 3Ware
> adapter.
>
>
> Jeff
>
>
>
>
> On Fri, Mar 01, 2002 at 09:39:08PM -0700, Jeff V. Merkey wrote:
> >
> >
> > We are going to sleep a lot in __get_request_wait(). This
> > means the write queue has no free request blocks. We are mostly writing
> > to the adapter in this test case, and the data we are writing
> > is already in order when it's posted.
> >
> > We are also posting via submit_bh() so you should trace
> > that path. I am seeing 22,000+ buffer heads posted concurrently
> > on each 3Ware card of 4K each with this application
> > on the patch for 2.4.19-pre2. I will post the actual data
> > for you. Stand by.
> >
> > These 3Ware cards are incredible.
> >
> > Jeff
> >
> >
> > On Fri, Mar 01, 2002 at 04:51:04PM -0800, Mike Anderson wrote:
> > > Jeff V. Merkey [[email protected]] wrote:
> > > >
> > > > ..snip..
> > > >
> > > > What is really needed here is to allow queue_nr_requests to be
> > > > configurable on a per adapter/device basis for these high end
> > > > raid cards like 3Ware since in a RAID 0 configuration, 8 drives
> > > > are in essence a terabyte (1.3 terrabytes in our configuration)
> > > > and each adapter is showing up as a 1.3 TB device. 64/128
> > > > requests are simply not enough to get the full spectrum of
> > > > performance atainable with these cards.
> > > >
> > > Not having direct experience on this card it appears that increasing the
> > > queue_nr_requests number will not allow you to have more ios in flight.
> > >
> > > Unless I am reading the driver wrong you will be limited to
> > > TW_MAX_CMDS_PER_LUN (15). This value is used by scsi_build_commandblocks
> > > to allocate scsi commands for your scsi_device. This driver does not provide
> > > a select_queue_depths function which allows for increase to the default
> > > template value.
> > >
> > > Could it be that the experimentation of increasing this number has
> > > allowed for better merging.
> > >
> > > -Mike
> > > --
> > > Michael Anderson
> > > [email protected]
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/

2002-03-02 06:02:51

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

On Fri, Mar 01, 2002 at 11:01:42PM -0700, Jeff V. Merkey wrote:
> On Fri, Mar 01, 2002 at 10:59:18PM -0700, Jeff V. Merkey wrote:
> >
> >
> > Mike,
> >
> > Here are some numbers from the running system. This system is
> > running at 120 MB/S

Mike,

Also, I've had a single adapter above 150 MB/S. This test I am running
was to test this 15 queue depth issue, so I took the system down
to a single adapter for the test so I could get the stats you raised the
issue about.

Jeff


on a single 3Ware adapter. Stats attached. You
> > will note that the max commands hitting the adapter are way above
> > 15. I can only presume this is due to caching behavior on the card.
> > I do have these cards enabled with caching. I have had these numbers
> > as high as 319 MB/S with multiple cards in separate buses. The system
> > this test is running on has 3 PCI buses. 2 x 33 Mhz and 1 x 66 Mhz.
> > with Serverwork HE Chipset.
> >

2002-03-02 07:19:29

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective



Andrew,

Patch is tested and stable.

Jeff


On Fri, Mar 01, 2002 at 07:50:08PM -0800, Andrew Morton wrote:
> "Jeff V. Merkey" wrote:
> >
> > Andrew,
> >
> > Patch tested and stable. I has to manually modify the value
> > in /usr/src/linux/include/linux/blkdev.h from 128 to 1024
> > since I noticed the 3Ware driver is going through the scsi.c
> > call to blk_init_queue() macro in this include file. I am
> > assuming that at some point, Adam and the other folks will
> > call this api directly.
> >
> > Performance increased 2-3 % to 319 MB/S on dual 33/66 buses,
> > and utilization dropped to 27% +-. All of this activity is
> > DMA based and copyless between reading/writing nodes and
> > I am running this UP and getting these numbers. I ran SMP
> > as well and it also looks good.
> >
> > Someone needs to look at just how the drivers will call
> > __blk_init_queue() directly to setup custom depth queues
> > from scsi.c since this call is originating there for
> > those IDE RAID drivers that fake out the system in thinking
> > they are SCSI.
> >
>
> So it would be more straightforward to just allow the queue
> to be grown later on?
>
>
>
> --- 2.4.19-pre2/drivers/block/ll_rw_blk.c~blk-queue Fri Mar 1 19:40:12 2002
> +++ 2.4.19-pre2-akpm/drivers/block/ll_rw_blk.c Fri Mar 1 19:48:33 2002
> @@ -117,16 +117,6 @@ int * max_readahead[MAX_BLKDEV];
> */
> int * max_sectors[MAX_BLKDEV];
>
> -/*
> - * The total number of requests in each queue.
> - */
> -static int queue_nr_requests;
> -
> -/*
> - * The threshold around which we make wakeup decisions
> - */
> -static int batch_requests;
> -
> static inline int get_max_sectors(kdev_t dev)
> {
> if (!max_sectors[MAJOR(dev)])
> @@ -180,7 +170,7 @@ static int __blk_cleanup_queue(struct re
> **/
> void blk_cleanup_queue(request_queue_t * q)
> {
> - int count = queue_nr_requests;
> + int count = q->nr_requests;
>
> count -= __blk_cleanup_queue(&q->rq[READ]);
> count -= __blk_cleanup_queue(&q->rq[WRITE]);
> @@ -330,31 +320,62 @@ void generic_unplug_device(void *data)
> spin_unlock_irqrestore(&io_request_lock, flags);
> }
>
> +/** blk_grow_request_list
> + * @q: The &request_queue_t
> + * @nr_requests: how many requests are desired
> + *
> + * More free requests are added to the queue's free lists, bringing
> + * the total number of requests to @nr_requests.
> + *
> + * The requests are added equally to the request queue's read
> + * and write freelists.
> + *
> + * This function can sleep.
> + *
> + * Returns the (new) number of requests which the queue has available.
> + */
> +int blk_grow_request_list(request_queue_t *q, int nr_requests)
> +{
> + spin_lock_irq(&io_request_lock);
> + while (q->nr_requests < nr_requests) {
> + struct request *rq;
> + int rw;
> +
> + spin_unlock_irq(&io_request_lock);
> + rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
> + spin_lock_irq(&io_request_lock);
> + if (rq == NULL)
> + break;
> + memset(rq, 0, sizeof(*rq));
> + rq->rq_status = RQ_INACTIVE;
> + rw = q->nr_requests & 1;
> + list_add(&rq->queue, &q->rq[rw].free);
> + q->rq[rw].count++;
> + q->nr_requests++;
> + }
> + q->batch_requests = q->nr_requests / 4;
> + spin_unlock_irq(&io_request_lock);
> + return q->nr_requests;
> +}
> +
> static void blk_init_free_list(request_queue_t *q)
> {
> - struct request *rq;
> - int i;
> + struct sysinfo si;
> + int megs; /* Total memory, in megabytes */
> + int nr_requests;
>
> INIT_LIST_HEAD(&q->rq[READ].free);
> INIT_LIST_HEAD(&q->rq[WRITE].free);
> q->rq[READ].count = 0;
> q->rq[WRITE].count = 0;
> + q->nr_requests = 0;
>
> - /*
> - * Divide requests in half between read and write
> - */
> - for (i = 0; i < queue_nr_requests; i++) {
> - rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
> - if (rq == NULL) {
> - /* We'll get a `leaked requests' message from blk_cleanup_queue */
> - printk(KERN_EMERG "blk_init_free_list: error allocating requests\n");
> - break;
> - }
> - memset(rq, 0, sizeof(struct request));
> - rq->rq_status = RQ_INACTIVE;
> - list_add(&rq->queue, &q->rq[i&1].free);
> - q->rq[i&1].count++;
> - }
> + si_meminfo(&si);
> + megs = si.totalram >> (20 - PAGE_SHIFT);
> + nr_requests = 128;
> + if (megs < 32)
> + nr_requests /= 2;
> + blk_grow_request_list(q, nr_requests);
>
> init_waitqueue_head(&q->wait_for_requests[0]);
> init_waitqueue_head(&q->wait_for_requests[1]);
> @@ -610,7 +631,7 @@ void blkdev_release_request(struct reque
> */
> if (q) {
> list_add(&req->queue, &q->rq[rw].free);
> - if (++q->rq[rw].count >= batch_requests &&
> + if (++q->rq[rw].count >= q->batch_requests &&
> waitqueue_active(&q->wait_for_requests[rw]))
> wake_up(&q->wait_for_requests[rw]);
> }
> @@ -802,7 +823,7 @@ get_rq:
> * See description above __get_request_wait()
> */
> if (rw_ahead) {
> - if (q->rq[rw].count < batch_requests) {
> + if (q->rq[rw].count < q->batch_requests) {
> spin_unlock_irq(&io_request_lock);
> goto end_io;
> }
> @@ -1149,12 +1170,9 @@ void end_that_request_last(struct reques
> blkdev_release_request(req);
> }
>
> -#define MB(kb) ((kb) << 10)
> -
> int __init blk_dev_init(void)
> {
> struct blk_dev_struct *dev;
> - int total_ram;
>
> request_cachep = kmem_cache_create("blkdev_requests",
> sizeof(struct request),
> @@ -1170,22 +1188,6 @@ int __init blk_dev_init(void)
> memset(max_readahead, 0, sizeof(max_readahead));
> memset(max_sectors, 0, sizeof(max_sectors));
>
> - total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
> -
> - /*
> - * Free request slots per queue.
> - * (Half for reads, half for writes)
> - */
> - queue_nr_requests = 64;
> - if (total_ram > MB(32))
> - queue_nr_requests = 128;
> -
> - /*
> - * Batch frees according to queue length
> - */
> - batch_requests = queue_nr_requests/4;
> - printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
> -
> #ifdef CONFIG_AMIGA_Z2RAM
> z2_init();
> #endif
> @@ -1296,6 +1298,7 @@ int __init blk_dev_init(void)
> EXPORT_SYMBOL(io_request_lock);
> EXPORT_SYMBOL(end_that_request_first);
> EXPORT_SYMBOL(end_that_request_last);
> +EXPORT_SYMBOL(blk_grow_request_list);
> EXPORT_SYMBOL(blk_init_queue);
> EXPORT_SYMBOL(blk_get_queue);
> EXPORT_SYMBOL(blk_cleanup_queue);
> --- 2.4.19-pre2/include/linux/blkdev.h~blk-queue Fri Mar 1 19:40:12 2002
> +++ 2.4.19-pre2-akpm/include/linux/blkdev.h Fri Mar 1 19:40:12 2002
> @@ -79,6 +79,16 @@ struct request_queue
> struct request_list rq[2];
>
> /*
> + * The total number of requests on each queue
> + */
> + int nr_requests;
> +
> + /*
> + * Batching threshold for sleep/wakeup decisions
> + */
> + int batch_requests;
> +
> + /*
> * Together with queue_head for cacheline sharing
> */
> struct list_head queue_head;
> @@ -157,6 +167,7 @@ extern void blkdev_release_request(struc
> /*
> * Access functions for manipulating queue properties
> */
> +extern int blk_grow_request_list(request_queue_t *q, int nr_requests);
> extern void blk_init_queue(request_queue_t *, request_fn_proc *);
> extern void blk_cleanup_queue(request_queue_t *);
> extern void blk_queue_headactive(request_queue_t *, int);

2002-03-02 09:11:01

by Jens Axboe

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

On Fri, Mar 01 2002, Andrew Morton wrote:
> So it would be more straightforward to just allow the queue
> to be grown later on?

I agree with that too. I'm fine with the patch, I'm just a bit worried
about the batch_request vs nr_requests ratio. Are you sure 1/4 is always
a good ratio? In my previous testing, a batch value of more than 32 had
little impact and usually changed things for the worse.

--
Jens Axboe

2002-03-02 09:24:07

by Andrew Morton

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

Jens Axboe wrote:
>
> On Fri, Mar 01 2002, Andrew Morton wrote:
> > So it would be more straightforward to just allow the queue
> > to be grown later on?
>
> I agree with that too. I'm fine with the patch,

Thanks.

> I'm just a bit worried
> about the batch_request vs nr_requests ratio. Are you sure 1/4 is always
> a good ratio? In my previous testing, a batch value of more than 32 had
> little impact and usually changed things for the worse.
>

Well I just left it as it was for the default case...

I haven't tested much at all for different batching levels. And
in this area, tuning it for my combination of hardware probably
doesn't carry much relevance for Jeff's setup (for example).

And the change to FIFO wakeup may have invalidated your earlier
testing.

So hmm. I'll have a play with it, and if nothing obvious jumps
out, I'll just clamp it at 32.

-

2002-03-04 07:17:15

by Mike Anderson

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

Thanks Jeff for the data. I went back and re-read the code.

It looks like the template cmd_per_lun value is recalculated before the
scsi-register call is made by the driver. The number is 255 / number of
units (which you said from your previous post is one). This then matches
your proc info. From the numbers these look like pretty nice cards :-).

-Mike

Jeff V. Merkey [[email protected]] wrote:
>
>
> Mike,
>
> Here are some numbers from the running system. This system is
> running at 120 MB/S on a single 3Ware adapter. Stats attached. You
> will note that the max commands hitting the adapter are way above
> 15. I can only presume this is due to caching behavior on the card.
> I do have these cards enabled with caching. I have had these numbers
> as high as 319 MB/S with multiple cards in separate buses. The system
> this test is running on has 3 PCI buses. 2 x 33 Mhz and 1 x 66 Mhz.
> with Serverwork HE Chipset.
>
> We are averaging complete saturation of the inbound command queue
> on the 3Ware adapter and are posting 64K writes (128 * 512) * 3 on
> aeverage with each command. I have no idea why the 15 queue depth
> limit is being exceeded here, possibly due to the fact the adapter
> is caching, but the stats report up to 255 requests being posted
> per driver pass.
>
> cat /proc/scsi/3w-xxxx/3
>
> scsi3: 3ware Storage Controller
> Driver version: 1.02.00.016
> Current commands posted: 205
> Max commands posted: 255
> Current pending commands: 0
> Max pending commands: 0
> Last sgl length: 0
> Max sgl length: 3
> Last sector count: 128
> Max sector count: 256
> Resets: 0
> Aborts: 0
> AEN's: 0
>
>
> cat /proc/scsi/3w-xxxx/3
>
> scsi3: 3ware Storage Controller
> Driver version: 1.02.00.016
> Current commands posted: 7
> Max commands posted: 255
> Current pending commands: 0
> Max pending commands: 0
> Last sgl length: 0
> Max sgl length: 3
> Last sector count: 128
> Max sector count: 256
> Resets: 0
> Aborts: 0
> AEN's: 0
>
>
> This test is posting a max of 54,000 4K buffer heads to a single 3Ware
> adapter. See below:
>
>
> trxinfo -a
>
> ioctl TRXDRV_QUERY_AIO ret 0
> TRXDRV-AIO STATS
> aio_submitted-3 aio_completed-25601331 aio_error-0
> disk_aio_submitted-25601539 disk_aio_completed-25601331
> sync_active [ x x x x x x x x ]
> async_active [ x x x x x x x x ]
> cb_active-0 aio_sequence-0
> bh_count-65536 bh_inuse-32208 bh_max_inuse-54593 bh_waiters-0
> hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
> probe_avg-0 probe_max-1
> total_read_req-0 total_write_req-25601536 total_fill_req-0
> total_complete-25739847
> req_sec-0 seconds-12925
>
> trxinfo -a
>
> ioctl TRXDRV_QUERY_AIO ret 0
> TRXDRV-AIO STATS
> aio_submitted-3 aio_completed-25605732 aio_error-0
> disk_aio_submitted-25605891 disk_aio_completed-25605732
> sync_active [ x x x x x x x x ]
> async_active [ x x x x x x x x ]
> cb_active-0 aio_sequence-0
> bh_count-65536 bh_inuse-31440 bh_max_inuse-54593 bh_waiters-0
> hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
> probe_avg-0 probe_max-1
> total_read_req-0 total_write_req-25605888 total_fill_req-0
> total_complete-25744271
> req_sec-0 seconds-12927
>
> This test is posting a max of 54,000 4K buffer heads to a single 3Ware
> adapter.
>
>
> Jeff
>
>
>
>
> On Fri, Mar 01, 2002 at 09:39:08PM -0700, Jeff V. Merkey wrote:
> >
> >
> > We are going to sleep a lot in __get_request_wait(). This
> > means the write queue has no free request blocks. We are mostly writing
> > to the adapter in this test case, and the data we are writing
> > is already in order when it's posted.
> >
> > We are also posting via submit_bh() so you should trace
> > that path. I am seeing 22,000+ buffer heads posted concurrently
> > on each 3Ware card of 4K each with this application
> > on the patch for 2.4.19-pre2. I will post the actual data
> > for you. Stand by.
> >
> > These 3Ware cards are incredible.
> >
> > Jeff
> >
> >
> > On Fri, Mar 01, 2002 at 04:51:04PM -0800, Mike Anderson wrote:
> > > Jeff V. Merkey [[email protected]] wrote:
> > > >
> > > > ..snip..
> > > >
> > > > What is really needed here is to allow queue_nr_requests to be
> > > > configurable on a per adapter/device basis for these high end
> > > > raid cards like 3Ware since in a RAID 0 configuration, 8 drives
> > > > are in essence a terabyte (1.3 terrabytes in our configuration)
> > > > and each adapter is showing up as a 1.3 TB device. 64/128
> > > > requests are simply not enough to get the full spectrum of
> > > > performance atainable with these cards.
> > > >
> > > Not having direct experience on this card it appears that increasing the
> > > queue_nr_requests number will not allow you to have more ios in flight.
> > >
> > > Unless I am reading the driver wrong you will be limited to
> > > TW_MAX_CMDS_PER_LUN (15). This value is used by scsi_build_commandblocks
> > > to allocate scsi commands for your scsi_device. This driver does not provide
> > > a select_queue_depths function which allows for increase to the default
> > > template value.
> > >
> > > Could it be that the experimentation of increasing this number has
> > > allowed for better merging.
> > >
> > > -Mike
> > > --
> > > Michael Anderson
> > > [email protected]
> > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michael Anderson
[email protected]

2002-03-04 09:12:41

by Jens Axboe

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective

On Sat, Mar 02 2002, Andrew Morton wrote:
> > I'm just a bit worried
> > about the batch_request vs nr_requests ratio. Are you sure 1/4 is always
> > a good ratio? In my previous testing, a batch value of more than 32 had
> > little impact and usually changed things for the worse.
> >
>
> Well I just left it as it was for the default case...
>
> I haven't tested much at all for different batching levels. And
> in this area, tuning it for my combination of hardware probably
> doesn't carry much relevance for Jeff's setup (for example).

Just the fact that you are using much bigger free queue sizes now means
that the batching in itself has less of an effect than with the smaller
queue sizes (here it's absolutely vital to get good merging).

> And the change to FIFO wakeup may have invalidated your earlier
> testing.

Could be, I'd inclined to think that FIFO behaviour would aid the
batching as well. Remember that we have had FIFO wakups there before, so
that's not entirely new.

> So hmm. I'll have a play with it, and if nothing obvious jumps
> out, I'll just clamp it at 32.

I think that would be best, I'll wait for numbers though :-). To be
honest, batch counts of eg 256 just sounds a bit insane to me.

--
Jens Axboe

2002-03-04 17:25:06

by Jeff V. Merkey

[permalink] [raw]
Subject: Re: queue_nr_requests needs to be selective


:-)

They're great cards.

Jeff


On Sun, Mar 03, 2002 at 11:16:34PM -0800, Mike Anderson wrote:
> Thanks Jeff for the data. I went back and re-read the code.
>
> It looks like the template cmd_per_lun value is recalculated before the
> scsi-register call is made by the driver. The number is 255 / number of
> units (which you said from your previous post is one). This then matches
> your proc info. From the numbers these look like pretty nice cards :-).
>
> -Mike
>
> Jeff V. Merkey [[email protected]] wrote:
> >
> >
> > Mike,
> >
> > Here are some numbers from the running system. This system is
> > running at 120 MB/S on a single 3Ware adapter. Stats attached. You
> > will note that the max commands hitting the adapter are way above
> > 15. I can only presume this is due to caching behavior on the card.
> > I do have these cards enabled with caching. I have had these numbers
> > as high as 319 MB/S with multiple cards in separate buses. The system
> > this test is running on has 3 PCI buses. 2 x 33 Mhz and 1 x 66 Mhz.
> > with Serverwork HE Chipset.
> >
> > We are averaging complete saturation of the inbound command queue
> > on the 3Ware adapter and are posting 64K writes (128 * 512) * 3 on
> > aeverage with each command. I have no idea why the 15 queue depth
> > limit is being exceeded here, possibly due to the fact the adapter
> > is caching, but the stats report up to 255 requests being posted
> > per driver pass.
> >
> > cat /proc/scsi/3w-xxxx/3
> >
> > scsi3: 3ware Storage Controller
> > Driver version: 1.02.00.016
> > Current commands posted: 205
> > Max commands posted: 255
> > Current pending commands: 0
> > Max pending commands: 0
> > Last sgl length: 0
> > Max sgl length: 3
> > Last sector count: 128
> > Max sector count: 256
> > Resets: 0
> > Aborts: 0
> > AEN's: 0
> >
> >
> > cat /proc/scsi/3w-xxxx/3
> >
> > scsi3: 3ware Storage Controller
> > Driver version: 1.02.00.016
> > Current commands posted: 7
> > Max commands posted: 255
> > Current pending commands: 0
> > Max pending commands: 0
> > Last sgl length: 0
> > Max sgl length: 3
> > Last sector count: 128
> > Max sector count: 256
> > Resets: 0
> > Aborts: 0
> > AEN's: 0
> >
> >
> > This test is posting a max of 54,000 4K buffer heads to a single 3Ware
> > adapter. See below:
> >
> >
> > trxinfo -a
> >
> > ioctl TRXDRV_QUERY_AIO ret 0
> > TRXDRV-AIO STATS
> > aio_submitted-3 aio_completed-25601331 aio_error-0
> > disk_aio_submitted-25601539 disk_aio_completed-25601331
> > sync_active [ x x x x x x x x ]
> > async_active [ x x x x x x x x ]
> > cb_active-0 aio_sequence-0
> > bh_count-65536 bh_inuse-32208 bh_max_inuse-54593 bh_waiters-0
> > hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
> > probe_avg-0 probe_max-1
> > total_read_req-0 total_write_req-25601536 total_fill_req-0
> > total_complete-25739847
> > req_sec-0 seconds-12925
> >
> > trxinfo -a
> >
> > ioctl TRXDRV_QUERY_AIO ret 0
> > TRXDRV-AIO STATS
> > aio_submitted-3 aio_completed-25605732 aio_error-0
> > disk_aio_submitted-25605891 disk_aio_completed-25605732
> > sync_active [ x x x x x x x x ]
> > async_active [ x x x x x x x x ]
> > cb_active-0 aio_sequence-0
> > bh_count-65536 bh_inuse-31440 bh_max_inuse-54593 bh_waiters-0
> > hash_hits-0 hash_misses-0 hash_fill-3 hash_total-3
> > probe_avg-0 probe_max-1
> > total_read_req-0 total_write_req-25605888 total_fill_req-0
> > total_complete-25744271
> > req_sec-0 seconds-12927
> >
> > This test is posting a max of 54,000 4K buffer heads to a single 3Ware
> > adapter.
> >
> >
> > Jeff
> >
> >
> >
> >
> > On Fri, Mar 01, 2002 at 09:39:08PM -0700, Jeff V. Merkey wrote:
> > >
> > >
> > > We are going to sleep a lot in __get_request_wait(). This
> > > means the write queue has no free request blocks. We are mostly writing
> > > to the adapter in this test case, and the data we are writing
> > > is already in order when it's posted.
> > >
> > > We are also posting via submit_bh() so you should trace
> > > that path. I am seeing 22,000+ buffer heads posted concurrently
> > > on each 3Ware card of 4K each with this application
> > > on the patch for 2.4.19-pre2. I will post the actual data
> > > for you. Stand by.
> > >
> > > These 3Ware cards are incredible.
> > >
> > > Jeff
> > >
> > >
> > > On Fri, Mar 01, 2002 at 04:51:04PM -0800, Mike Anderson wrote:
> > > > Jeff V. Merkey [[email protected]] wrote:
> > > > >
> > > > > ..snip..
> > > > >
> > > > > What is really needed here is to allow queue_nr_requests to be
> > > > > configurable on a per adapter/device basis for these high end
> > > > > raid cards like 3Ware since in a RAID 0 configuration, 8 drives
> > > > > are in essence a terabyte (1.3 terrabytes in our configuration)
> > > > > and each adapter is showing up as a 1.3 TB device. 64/128
> > > > > requests are simply not enough to get the full spectrum of
> > > > > performance atainable with these cards.
> > > > >
> > > > Not having direct experience on this card it appears that increasing the
> > > > queue_nr_requests number will not allow you to have more ios in flight.
> > > >
> > > > Unless I am reading the driver wrong you will be limited to
> > > > TW_MAX_CMDS_PER_LUN (15). This value is used by scsi_build_commandblocks
> > > > to allocate scsi commands for your scsi_device. This driver does not provide
> > > > a select_queue_depths function which allows for increase to the default
> > > > template value.
> > > >
> > > > Could it be that the experimentation of increasing this number has
> > > > allowed for better merging.
> > > >
> > > > -Mike
> > > > --
> > > > Michael Anderson
> > > > [email protected]
> > > >
> > > > -
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at http://www.tux.org/lkml/
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Michael Anderson
> [email protected]