2011-05-10 06:40:44

by Alex Shi

[permalink] [raw]
Subject: Perfromance drop on SCSI hard disk

commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
scsi_run_queue() to punt all requests on starved_list devices to
kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
hurt here. :) (Intel SSD isn't effected here)

In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
about 30~40% throughput, fio randread/randwrite with aio ioengine drop
about 20%/50% throughput. and fio mmap testing was hurt also.

With the following debug patch, the performance can be totally recovered
in our testing. But without REENTER flag here, in some corner case, like
a device is keeping blocked and then unblocked repeatedly,
__blk_run_queue() may recursively call scsi_run_queue() and then cause
kernel stack overflow.
I don't know details of block device driver, just wondering why on scsi
need the REENTER flag here. :)

James, do you have some idea on this.

Regards
Alex
======
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9901b8..24e8589 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -432,8 +432,11 @@ static void scsi_run_queue(struct request_queue *q)
&shost->starved_list);
continue;
}
-
- blk_run_queue_async(sdev->request_queue);
+ spin_unlock(shost->host_lock);
+ spin_lock(sdev->request_queue->queue_lock);
+ __blk_run_queue(sdev->request_queue);
+ spin_unlock(sdev->request_queue->queue_lock);
+ spin_lock(shost->host_lock);
}
/* put any unprocessed entries back */
list_splice(&starved_list, &shost->starved_list);







2011-05-10 06:52:41

by Shaohua Li

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Tue, May 10, 2011 at 02:40:00PM +0800, Shi, Alex wrote:
> commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> scsi_run_queue() to punt all requests on starved_list devices to
> kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> hurt here. :) (Intel SSD isn't effected here)
>
> In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> about 20%/50% throughput. and fio mmap testing was hurt also.
>
> With the following debug patch, the performance can be totally recovered
> in our testing. But without REENTER flag here, in some corner case, like
> a device is keeping blocked and then unblocked repeatedly,
> __blk_run_queue() may recursively call scsi_run_queue() and then cause
> kernel stack overflow.
> I don't know details of block device driver, just wondering why on scsi
> need the REENTER flag here. :)
Hi Jens,
I want to add more analysis about the problem to help understand the issue.
This is a severe problem, hopefully we can solve it before 2.6.39 release.

Basically the offending patch has some problems:
a. more context switches
b. __blk_run_queue losts the recursive detection. In some cases, it could be
called recursively. For example, blk_run_queue in scsi_run_queue()
c. fairness issue. Say we have sdev1 and sdev2 in starved_list. Then run
scsi_run_queue():
1. remove both sdev1 and sdev2 from starved_list
2. async queue dispatches sdev1's request. host becames busy again.
3. add sdev1 into starved_list again. Since starved_list is empty,
sdev1 is added at the head
4. async queue checks sdev2, and add sdev2 into starved_list tail.
In this scenario, sdev1 is serviced first next time, so sdev2 is starved.
In our test, 12 disks connect to one HBA card. disk's queue depth is 64,
while HBA card queue depth is 128. Our test does sync write, so block size
is big, so just several requests can occurpy one disk's bandwidth. Saturate
one disk but starve others will hurt total throughput.

problem a isn't a big problem in our test (we did observe higher context
switch, which is about 4x more CS), but guess it will hurt high end system.

problem b is easy to fix for scsi. just replace blk_run_queue with
blk_run_queue_async in scsi_run_queue

problem c is the root cause for the regression. I had a patch for it.
Basically with my patch, we don't remove sdev from starved_list in
scsi_run_queue, but we delay the removal in scsi_request_fn() till a
starved device really dispatches a request. My patch can fully fix
the regression.

But given problem a, we should revert the patch (or Alex's patch if stack
overflow isn't a big deal here), so I didn't post my patch here. Problem
c actually exsists even we revert the patch (we could do async execution
with small chance), but not that severe. I can post a fix after the
patch is reverted.

Thanks,
Shaohua

2011-05-12 00:36:08

by Shaohua Li

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

2011/5/10 Shaohua Li <[email protected]>:
> On Tue, May 10, 2011 at 02:40:00PM +0800, Shi, Alex wrote:
>> commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
>> scsi_run_queue() to punt all requests on starved_list devices to
>> kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
>> hurt here. ?:) (Intel SSD isn't effected here)
>>
>> In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
>> about 30~40% throughput, fio randread/randwrite with aio ioengine drop
>> about 20%/50% throughput. and fio mmap testing was hurt also.
>>
>> With the following debug patch, the performance can be totally recovered
>> in our testing. But without REENTER flag here, in some corner case, like
>> a device is keeping blocked and then unblocked repeatedly,
>> __blk_run_queue() may recursively call scsi_run_queue() and then cause
>> kernel stack overflow.
>> I don't know details of block device driver, just wondering why on scsi
>> need the REENTER flag here. :)
> Hi Jens,
> I want to add more analysis about the problem to help understand the issue.
> This is a severe problem, hopefully we can solve it before 2.6.39 release.
>
> Basically the offending patch has some problems:
> a. more context switches
> b. __blk_run_queue losts the recursive detection. In some cases, it could be
> ?called recursively. For example, blk_run_queue in scsi_run_queue()
> c. fairness issue. Say we have sdev1 and sdev2 in starved_list. Then run
> scsi_run_queue():
> ?1. remove both sdev1 and sdev2 from starved_list
> ?2. async queue dispatches sdev1's request. host becames busy again.
> ?3. add sdev1 into starved_list again. Since starved_list is empty,
> ? ? sdev1 is added at the head
> ?4. async queue checks sdev2, and add sdev2 into starved_list tail.
> ?In this scenario, sdev1 is serviced first next time, so sdev2 is starved.
> ?In our test, 12 disks connect to one HBA card. disk's queue depth is 64,
> ?while HBA card queue depth is 128. Our test does sync write, so block size
> ?is big, so just several requests can occurpy one disk's bandwidth. Saturate
> ?one disk but starve others will hurt total throughput.
>
> problem a isn't a big problem in our test (we did observe higher context
> switch, which is about 4x more CS), but guess it will hurt high end system.
>
> problem b is easy to fix for scsi. just replace blk_run_queue with
> blk_run_queue_async in scsi_run_queue
>
> problem c is the root cause for the regression. I had a patch for it.
> Basically with my patch, we don't remove sdev from starved_list in
> scsi_run_queue, but we delay the removal in scsi_request_fn() till a
> starved device really dispatches a request. My patch can fully fix
> the regression.
>
> But given problem a, we should revert the patch (or Alex's patch if stack
> overflow isn't a big deal here), so I didn't post my patch here. Problem
> c actually exsists even we revert the patch (we could do async execution
> with small chance), but not that severe. I can post a fix after the
> patch is reverted.
Hi,
Ping again. I hope this issue isn't missed. The regression is big from
20% ~ 50% of IO throughput.

Thanks,
Shaohua

2011-05-12 20:29:59

by Jens Axboe

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On 2011-05-10 08:40, Alex,Shi wrote:
> commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> scsi_run_queue() to punt all requests on starved_list devices to
> kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> hurt here. :) (Intel SSD isn't effected here)
>
> In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> about 20%/50% throughput. and fio mmap testing was hurt also.
>
> With the following debug patch, the performance can be totally recovered
> in our testing. But without REENTER flag here, in some corner case, like
> a device is keeping blocked and then unblocked repeatedly,
> __blk_run_queue() may recursively call scsi_run_queue() and then cause
> kernel stack overflow.
> I don't know details of block device driver, just wondering why on scsi
> need the REENTER flag here. :)

This is a problem and we should do something about it for 2.6.39. I knew
that there would be cases where the async offload would cause a
performance degredation, but not to the extent that you are reporting.
Must be hitting the pathological case.

I can think of two scenarios where it could potentially recurse:

- request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
repeat.
- Running starved list from request_fn, two (or more) devices could
alternately recurse.

The first case should be fairly easy to handle. The second one is
already handled by the local list splice.

Looking at the code, is this a real scenario? Only potential recurse I
see is:

scsi_request_fn()
scsi_dispatch_cmd()
scsi_queue_insert()
__scsi_queue_insert()
scsi_run_queue()

Why are we even re-running the queue immediately on a BUSY condition?
Should only be needed if we have zero pending commands from this
particular queue, and for that particular case async run is just fine
since it's a rare condition (or performance would suck already).

And it should only really be needed for the 'q' being passed in, not the
others. Something like the below.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0bac91e..0b01c1f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
*/
#define SCSI_QUEUE_DELAY 3

-static void scsi_run_queue(struct request_queue *q);
+static void scsi_run_queue_async(struct request_queue *q);

/*
* Function: scsi_unprep_request()
@@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
blk_requeue_request(q, cmd->request);
spin_unlock_irqrestore(q->queue_lock, flags);

- scsi_run_queue(q);
+ scsi_run_queue_async(q);

return 0;
}
@@ -391,13 +391,14 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
* Purpose: Select a proper request queue to serve next
*
* Arguments: q - last request's queue
+ * async - prevent potential request_fn recurse by running async
*
* Returns: Nothing
*
* Notes: The previous command was completely finished, start
* a new one if possible.
*/
-static void scsi_run_queue(struct request_queue *q)
+static void __scsi_run_queue(struct request_queue *q, bool async)
{
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost;
@@ -438,13 +439,30 @@ static void scsi_run_queue(struct request_queue *q)
continue;
}

- blk_run_queue_async(sdev->request_queue);
+ spin_unlock(shost->host_lock);
+ spin_lock(sdev->request_queue->queue_lock);
+ __blk_run_queue(sdev->request_queue);
+ spin_unlock(sdev->request_queue->queue_lock);
+ spin_lock(shost->host_lock);
}
/* put any unprocessed entries back */
list_splice(&starved_list, &shost->starved_list);
spin_unlock_irqrestore(shost->host_lock, flags);

- blk_run_queue(q);
+ if (async)
+ blk_run_queue_async(q);
+ else
+ blk_run_queue(q);
+}
+
+static void scsi_run_queue(struct request_queue *q)
+{
+ __scsi_run_queue(q, false);
+}
+
+static void scsi_run_queue_async(struct request_queue *q)
+{
+ __scsi_run_queue(q, true);
}

/*

--
Jens Axboe

2011-05-13 00:12:33

by Alex Shi

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> On 2011-05-10 08:40, Alex,Shi wrote:
> > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > scsi_run_queue() to punt all requests on starved_list devices to
> > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > hurt here. :) (Intel SSD isn't effected here)
> >
> > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > about 20%/50% throughput. and fio mmap testing was hurt also.
> >
> > With the following debug patch, the performance can be totally recovered
> > in our testing. But without REENTER flag here, in some corner case, like
> > a device is keeping blocked and then unblocked repeatedly,
> > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > kernel stack overflow.
> > I don't know details of block device driver, just wondering why on scsi
> > need the REENTER flag here. :)
>
> This is a problem and we should do something about it for 2.6.39. I knew
> that there would be cases where the async offload would cause a
> performance degredation, but not to the extent that you are reporting.
> Must be hitting the pathological case.
>
> I can think of two scenarios where it could potentially recurse:
>
> - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
> repeat.
> - Running starved list from request_fn, two (or more) devices could
> alternately recurse.
>
> The first case should be fairly easy to handle. The second one is
> already handled by the local list splice.
>
> Looking at the code, is this a real scenario? Only potential recurse I
> see is:
>
> scsi_request_fn()
> scsi_dispatch_cmd()
> scsi_queue_insert()
> __scsi_queue_insert()
> scsi_run_queue()
>
> Why are we even re-running the queue immediately on a BUSY condition?
> Should only be needed if we have zero pending commands from this
> particular queue, and for that particular case async run is just fine
> since it's a rare condition (or performance would suck already).

Yeah, this is correct way to fix it. Let me try the patch on our
machine.

2011-05-13 00:48:28

by Shaohua Li

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> On 2011-05-10 08:40, Alex,Shi wrote:
> > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > scsi_run_queue() to punt all requests on starved_list devices to
> > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > hurt here. :) (Intel SSD isn't effected here)
> >
> > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > about 20%/50% throughput. and fio mmap testing was hurt also.
> >
> > With the following debug patch, the performance can be totally recovered
> > in our testing. But without REENTER flag here, in some corner case, like
> > a device is keeping blocked and then unblocked repeatedly,
> > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > kernel stack overflow.
> > I don't know details of block device driver, just wondering why on scsi
> > need the REENTER flag here. :)
>
> This is a problem and we should do something about it for 2.6.39. I knew
> that there would be cases where the async offload would cause a
> performance degredation, but not to the extent that you are reporting.
> Must be hitting the pathological case.
async offload is expected to increase context switch. But the real root
cause of the issue is fairness issue. Please see my previous email.

> I can think of two scenarios where it could potentially recurse:
>
> - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
> repeat.
> - Running starved list from request_fn, two (or more) devices could
> alternately recurse.
>
> The first case should be fairly easy to handle. The second one is
> already handled by the local list splice.
this isn't true to me. if you unlock host_lock in scsi_run_queue, other
cpus can add sdev to the starved device list again. In the recursive
call of scsi_run_queue, the starved device list might not be empty. So
the local list_splice doesn't help.

>
> Looking at the code, is this a real scenario? Only potential recurse I
> see is:
>
> scsi_request_fn()
> scsi_dispatch_cmd()
> scsi_queue_insert()
> __scsi_queue_insert()
> scsi_run_queue()
>
> Why are we even re-running the queue immediately on a BUSY condition?
> Should only be needed if we have zero pending commands from this
> particular queue, and for that particular case async run is just fine
> since it's a rare condition (or performance would suck already).
>
> And it should only really be needed for the 'q' being passed in, not the
> others. Something like the below.
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 0bac91e..0b01c1f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
> */
> #define SCSI_QUEUE_DELAY 3
>
> -static void scsi_run_queue(struct request_queue *q);
> +static void scsi_run_queue_async(struct request_queue *q);
>
> /*
> * Function: scsi_unprep_request()
> @@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> blk_requeue_request(q, cmd->request);
> spin_unlock_irqrestore(q->queue_lock, flags);
>
> - scsi_run_queue(q);
> + scsi_run_queue_async(q);
so you could still recursivly run into starved list. Do you want to put
the whole __scsi_run_queue into workqueue?

Thanks,
Shaohua

2011-05-13 03:02:01

by Shaohua Li

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Fri, 2011-05-13 at 08:48 +0800, Shaohua Li wrote:
> On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> > On 2011-05-10 08:40, Alex,Shi wrote:
> > > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > > scsi_run_queue() to punt all requests on starved_list devices to
> > > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > > hurt here. :) (Intel SSD isn't effected here)
> > >
> > > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > > about 20%/50% throughput. and fio mmap testing was hurt also.
> > >
> > > With the following debug patch, the performance can be totally recovered
> > > in our testing. But without REENTER flag here, in some corner case, like
> > > a device is keeping blocked and then unblocked repeatedly,
> > > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > > kernel stack overflow.
> > > I don't know details of block device driver, just wondering why on scsi
> > > need the REENTER flag here. :)
> >
> > This is a problem and we should do something about it for 2.6.39. I knew
> > that there would be cases where the async offload would cause a
> > performance degredation, but not to the extent that you are reporting.
> > Must be hitting the pathological case.
> async offload is expected to increase context switch. But the real root
> cause of the issue is fairness issue. Please see my previous email.
>
> > I can think of two scenarios where it could potentially recurse:
> >
> > - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
> > repeat.
> > - Running starved list from request_fn, two (or more) devices could
> > alternately recurse.
> >
> > The first case should be fairly easy to handle. The second one is
> > already handled by the local list splice.
> this isn't true to me. if you unlock host_lock in scsi_run_queue, other
> cpus can add sdev to the starved device list again. In the recursive
> call of scsi_run_queue, the starved device list might not be empty. So
> the local list_splice doesn't help.
>
> >
> > Looking at the code, is this a real scenario? Only potential recurse I
> > see is:
> >
> > scsi_request_fn()
> > scsi_dispatch_cmd()
> > scsi_queue_insert()
> > __scsi_queue_insert()
> > scsi_run_queue()
> >
> > Why are we even re-running the queue immediately on a BUSY condition?
> > Should only be needed if we have zero pending commands from this
> > particular queue, and for that particular case async run is just fine
> > since it's a rare condition (or performance would suck already).
> >
> > And it should only really be needed for the 'q' being passed in, not the
> > others. Something like the below.
> >
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 0bac91e..0b01c1f 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
> > */
> > #define SCSI_QUEUE_DELAY 3
> >
> > -static void scsi_run_queue(struct request_queue *q);
> > +static void scsi_run_queue_async(struct request_queue *q);
> >
> > /*
> > * Function: scsi_unprep_request()
> > @@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> > blk_requeue_request(q, cmd->request);
> > spin_unlock_irqrestore(q->queue_lock, flags);
> >
> > - scsi_run_queue(q);
> > + scsi_run_queue_async(q);
> so you could still recursivly run into starved list. Do you want to put
> the whole __scsi_run_queue into workqueue?
what I mean is current sdev (other devices too) can still be added into
starved list, so only does async execute for current q isn't enough,
we'd better put whole __scsi_run_queue into workqueue. something like
below on top of yours, untested. Not sure if there are other recursive
cases.

Index: linux/drivers/scsi/scsi_lib.c
===================================================================
--- linux.orig/drivers/scsi/scsi_lib.c 2011-05-13 10:32:28.000000000 +0800
+++ linux/drivers/scsi/scsi_lib.c 2011-05-13 10:52:51.000000000 +0800
@@ -74,8 +74,6 @@ struct kmem_cache *scsi_sdb_cache;
*/
#define SCSI_QUEUE_DELAY 3

-static void scsi_run_queue_async(struct request_queue *q);
-
/*
* Function: scsi_unprep_request()
*
@@ -161,7 +159,7 @@ static int __scsi_queue_insert(struct sc
blk_requeue_request(q, cmd->request);
spin_unlock_irqrestore(q->queue_lock, flags);

- scsi_run_queue_async(q);
+ kblockd_schedule_work(q, &device->requeue_work);

return 0;
}
@@ -391,14 +389,13 @@ static inline int scsi_host_is_busy(stru
* Purpose: Select a proper request queue to serve next
*
* Arguments: q - last request's queue
- * async - prevent potential request_fn recurse by running async
*
* Returns: Nothing
*
* Notes: The previous command was completely finished, start
* a new one if possible.
*/
-static void __scsi_run_queue(struct request_queue *q, bool async)
+static void scsi_run_queue(struct request_queue *q)
{
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost;
@@ -449,20 +446,17 @@ static void __scsi_run_queue(struct requ
list_splice(&starved_list, &shost->starved_list);
spin_unlock_irqrestore(shost->host_lock, flags);

- if (async)
- blk_run_queue_async(q);
- else
- blk_run_queue(q);
+ blk_run_queue(q);
}

-static void scsi_run_queue(struct request_queue *q)
+void scsi_requeue_run_queue(struct work_struct *work)
{
- __scsi_run_queue(q, false);
-}
+ struct scsi_device *sdev;
+ struct request_queue *q;

-static void scsi_run_queue_async(struct request_queue *q)
-{
- __scsi_run_queue(q, true);
+ sdev = container_of(work, struct scsi_device, requeue_work);
+ q = sdev->request_queue;
+ scsi_run_queue(q);
}

/*
Index: linux/drivers/scsi/scsi_scan.c
===================================================================
--- linux.orig/drivers/scsi/scsi_scan.c 2011-05-13 10:44:09.000000000 +0800
+++ linux/drivers/scsi/scsi_scan.c 2011-05-13 10:45:41.000000000 +0800
@@ -242,6 +242,7 @@ static struct scsi_device *scsi_alloc_sd
int display_failure_msg = 1, ret;
struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
extern void scsi_evt_thread(struct work_struct *work);
+ extern void scsi_requeue_run_queue(struct work_struct *work);

sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
GFP_ATOMIC);
@@ -264,6 +265,7 @@ static struct scsi_device *scsi_alloc_sd
INIT_LIST_HEAD(&sdev->event_list);
spin_lock_init(&sdev->list_lock);
INIT_WORK(&sdev->event_work, scsi_evt_thread);
+ INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);

sdev->sdev_gendev.parent = get_device(&starget->dev);
sdev->sdev_target = starget;
Index: linux/include/scsi/scsi_device.h
===================================================================
--- linux.orig/include/scsi/scsi_device.h 2011-05-13 10:36:31.000000000 +0800
+++ linux/include/scsi/scsi_device.h 2011-05-13 10:40:46.000000000 +0800
@@ -169,6 +169,7 @@ struct scsi_device {
sdev_dev;

struct execute_work ew; /* used to get process context on put */
+ struct work_struct requeue_work;

struct scsi_dh_data *scsi_dh_data;
enum scsi_device_state sdev_state;


2011-05-16 08:04:18

by Shaohua Li

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Fri, 2011-05-13 at 11:01 +0800, Shaohua Li wrote:
> On Fri, 2011-05-13 at 08:48 +0800, Shaohua Li wrote:
> > On Fri, 2011-05-13 at 04:29 +0800, Jens Axboe wrote:
> > > On 2011-05-10 08:40, Alex,Shi wrote:
> > > > commit c21e6beba8835d09bb80e34961 removed the REENTER flag and changed
> > > > scsi_run_queue() to punt all requests on starved_list devices to
> > > > kblockd. Yes, like Jens mentioned, the performance on slow SCSI disk was
> > > > hurt here. :) (Intel SSD isn't effected here)
> > > >
> > > > In our testing on 12 SAS disk JBD, the fio write with sync ioengine drop
> > > > about 30~40% throughput, fio randread/randwrite with aio ioengine drop
> > > > about 20%/50% throughput. and fio mmap testing was hurt also.
> > > >
> > > > With the following debug patch, the performance can be totally recovered
> > > > in our testing. But without REENTER flag here, in some corner case, like
> > > > a device is keeping blocked and then unblocked repeatedly,
> > > > __blk_run_queue() may recursively call scsi_run_queue() and then cause
> > > > kernel stack overflow.
> > > > I don't know details of block device driver, just wondering why on scsi
> > > > need the REENTER flag here. :)
> > >
> > > This is a problem and we should do something about it for 2.6.39. I knew
> > > that there would be cases where the async offload would cause a
> > > performance degredation, but not to the extent that you are reporting.
> > > Must be hitting the pathological case.
> > async offload is expected to increase context switch. But the real root
> > cause of the issue is fairness issue. Please see my previous email.
> >
> > > I can think of two scenarios where it could potentially recurse:
> > >
> > > - request_fn enter, end up requeuing IO. Run queue at the end. Rinse,
> > > repeat.
> > > - Running starved list from request_fn, two (or more) devices could
> > > alternately recurse.
> > >
> > > The first case should be fairly easy to handle. The second one is
> > > already handled by the local list splice.
> > this isn't true to me. if you unlock host_lock in scsi_run_queue, other
> > cpus can add sdev to the starved device list again. In the recursive
> > call of scsi_run_queue, the starved device list might not be empty. So
> > the local list_splice doesn't help.
> >
> > >
> > > Looking at the code, is this a real scenario? Only potential recurse I
> > > see is:
> > >
> > > scsi_request_fn()
> > > scsi_dispatch_cmd()
> > > scsi_queue_insert()
> > > __scsi_queue_insert()
> > > scsi_run_queue()
> > >
> > > Why are we even re-running the queue immediately on a BUSY condition?
> > > Should only be needed if we have zero pending commands from this
> > > particular queue, and for that particular case async run is just fine
> > > since it's a rare condition (or performance would suck already).
> > >
> > > And it should only really be needed for the 'q' being passed in, not the
> > > others. Something like the below.
> > >
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index 0bac91e..0b01c1f 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -74,7 +74,7 @@ struct kmem_cache *scsi_sdb_cache;
> > > */
> > > #define SCSI_QUEUE_DELAY 3
> > >
> > > -static void scsi_run_queue(struct request_queue *q);
> > > +static void scsi_run_queue_async(struct request_queue *q);
> > >
> > > /*
> > > * Function: scsi_unprep_request()
> > > @@ -161,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> > > blk_requeue_request(q, cmd->request);
> > > spin_unlock_irqrestore(q->queue_lock, flags);
> > >
> > > - scsi_run_queue(q);
> > > + scsi_run_queue_async(q);
> > so you could still recursivly run into starved list. Do you want to put
> > the whole __scsi_run_queue into workqueue?
> what I mean is current sdev (other devices too) can still be added into
> starved list, so only does async execute for current q isn't enough,
> we'd better put whole __scsi_run_queue into workqueue. something like
> below on top of yours, untested. Not sure if there are other recursive
> cases.
verified the regression can be fully fixed by your patch (with my
suggested fix to avoid race). Can we put a formal patch upstream?

Thanks,
Shaohua

2011-05-16 08:38:56

by Alex Shi

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk


> > what I mean is current sdev (other devices too) can still be added into
> > starved list, so only does async execute for current q isn't enough,
> > we'd better put whole __scsi_run_queue into workqueue. something like
> > below on top of yours, untested. Not sure if there are other recursive
> > cases.
> verified the regression can be fully fixed by your patch (with my
> suggested fix to avoid race). Can we put a formal patch upstream?

Yes, we tested Jens patch alone and plus Shaohua's patch too. Both of
them recovered SAS disk performance too. Now I am testing them on SSD
disk with kbuild and fio cases, In theory, they will both work.

>
> Thanks,
> Shaohua
>

2011-05-17 06:10:34

by Alex Shi

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Mon, 2011-05-16 at 16:37 +0800, Alex,Shi wrote:
> > > what I mean is current sdev (other devices too) can still be added into
> > > starved list, so only does async execute for current q isn't enough,
> > > we'd better put whole __scsi_run_queue into workqueue. something like
> > > below on top of yours, untested. Not sure if there are other recursive
> > > cases.
> > verified the regression can be fully fixed by your patch (with my
> > suggested fix to avoid race). Can we put a formal patch upstream?
>
> Yes, we tested Jens patch alone and plus Shaohua's patch too. Both of
> them recovered SAS disk performance too. Now I am testing them on SSD
> disk with kbuild and fio cases, In theory, they will both work.
>

As expected, both patches has no effect on SSD disks in JBD.

> >
> > Thanks,
> > Shaohua
> >
>

2011-05-17 07:20:39

by Jens Axboe

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On 2011-05-17 08:09, Alex,Shi wrote:
> On Mon, 2011-05-16 at 16:37 +0800, Alex,Shi wrote:
>>>> what I mean is current sdev (other devices too) can still be added into
>>>> starved list, so only does async execute for current q isn't enough,
>>>> we'd better put whole __scsi_run_queue into workqueue. something like
>>>> below on top of yours, untested. Not sure if there are other recursive
>>>> cases.
>>> verified the regression can be fully fixed by your patch (with my
>>> suggested fix to avoid race). Can we put a formal patch upstream?
>>
>> Yes, we tested Jens patch alone and plus Shaohua's patch too. Both of
>> them recovered SAS disk performance too. Now I am testing them on SSD
>> disk with kbuild and fio cases, In theory, they will both work.
>>
>
> As expected, both patches has no effect on SSD disks in JBD.

I will queue up the combined patch, it looks fine from here as well.

--
Jens Axboe

2011-05-19 08:27:40

by Alex Shi

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk


> I will queue up the combined patch, it looks fine from here as well.
>

When I have some time to study Jens and shaohua's patch today. I find a
simpler way to resolved the re-enter issue on starved_list. Following
Jens' idea, we can just put the starved_list device into kblockd if it
come from __scsi_queue_insert().
It can resolve the re-enter issue and recover performance totally, and
need not a work_struct in every scsi_device. The logic/code also looks a
bit simpler.
What's your opinion of this?


Signed-off-by: Alex Shi <[email protected]>
---
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ec1803a..de7c569 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -74,6 +74,8 @@ struct kmem_cache *scsi_sdb_cache;
*/
#define SCSI_QUEUE_DELAY 3

+static void scsi_run_queue_async(struct request_queue *q);
+
/*
* Function: scsi_unprep_request()
*
@@ -159,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
blk_requeue_request(q, cmd->request);
spin_unlock_irqrestore(q->queue_lock, flags);

- kblockd_schedule_work(q, &device->requeue_work);
+ scsi_run_queue_async(q);

return 0;
}
@@ -395,7 +397,7 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
* Notes: The previous command was completely finished, start
* a new one if possible.
*/
-static void scsi_run_queue(struct request_queue *q)
+static void __scsi_run_queue(struct request_queue *q, bool async)
{
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost;
@@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue *q)
&shost->starved_list);
continue;
}
-
- spin_unlock(shost->host_lock);
- spin_lock(sdev->request_queue->queue_lock);
- __blk_run_queue(sdev->request_queue);
- spin_unlock(sdev->request_queue->queue_lock);
- spin_lock(shost->host_lock);
+ if (async)
+ blk_run_queue_async(sdev->request_queue);
+ else {
+ spin_unlock(shost->host_lock);
+ spin_lock(sdev->request_queue->queue_lock);
+ __blk_run_queue(sdev->request_queue);
+ spin_unlock(sdev->request_queue->queue_lock);
+ spin_lock(shost->host_lock);
+ }
}
/* put any unprocessed entries back */
list_splice(&starved_list, &shost->starved_list);
spin_unlock_irqrestore(shost->host_lock, flags);

- blk_run_queue(q);
+ if (async)
+ blk_run_queue_async(q);
+ else
+ blk_run_queue(q);
}

-void scsi_requeue_run_queue(struct work_struct *work)
+static void scsi_run_queue(struct request_queue *q)
{
- struct scsi_device *sdev;
- struct request_queue *q;
-
- sdev = container_of(work, struct scsi_device, requeue_work);
- q = sdev->request_queue;
- scsi_run_queue(q);
+ __scsi_run_queue(q, false);
}

+static void scsi_run_queue_async(struct request_queue *q)
+{
+ __scsi_run_queue(q, true);
+}
/*
* Function: scsi_requeue_command()
*
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 58584dc..087821f 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -242,7 +242,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
int display_failure_msg = 1, ret;
struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
extern void scsi_evt_thread(struct work_struct *work);
- extern void scsi_requeue_run_queue(struct work_struct *work);

sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
GFP_ATOMIC);
@@ -265,7 +264,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
INIT_LIST_HEAD(&sdev->event_list);
spin_lock_init(&sdev->list_lock);
INIT_WORK(&sdev->event_work, scsi_evt_thread);
- INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);

sdev->sdev_gendev.parent = get_device(&starget->dev);
sdev->sdev_target = starget;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index dd82e02..2d3ec50 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -169,7 +169,6 @@ struct scsi_device {
sdev_dev;

struct execute_work ew; /* used to get process context on put */
- struct work_struct requeue_work;

struct scsi_dh_data *scsi_dh_data;
enum scsi_device_state sdev_state;

2011-05-19 08:48:50

by Alex Shi

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Thu, 2011-05-19 at 16:26 +0800, Alex,Shi wrote:
> > I will queue up the combined patch, it looks fine from here as well.
> >
>
> When I have some time to study Jens and shaohua's patch today. I find a
> simpler way to resolved the re-enter issue on starved_list. Following
> Jens' idea, we can just put the starved_list device into kblockd if it
> come from __scsi_queue_insert().
> It can resolve the re-enter issue and recover performance totally, and
> need not a work_struct in every scsi_device. The logic/code also looks a
> bit simpler.
> What's your opinion of this?

BTW, Jens, this patch is against on latest kernel, 2.6.39.
>
>
> Signed-off-by: Alex Shi <[email protected]>
> ---
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index ec1803a..de7c569 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -74,6 +74,8 @@ struct kmem_cache *scsi_sdb_cache;
> */
> #define SCSI_QUEUE_DELAY 3
>
> +static void scsi_run_queue_async(struct request_queue *q);
> +
> /*
> * Function: scsi_unprep_request()
> *
> @@ -159,7 +161,7 @@ static int __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> blk_requeue_request(q, cmd->request);
> spin_unlock_irqrestore(q->queue_lock, flags);
>
> - kblockd_schedule_work(q, &device->requeue_work);
> + scsi_run_queue_async(q);
>
> return 0;
> }
> @@ -395,7 +397,7 @@ static inline int scsi_host_is_busy(struct Scsi_Host *shost)
> * Notes: The previous command was completely finished, start
> * a new one if possible.
> */
> -static void scsi_run_queue(struct request_queue *q)
> +static void __scsi_run_queue(struct request_queue *q, bool async)
> {
> struct scsi_device *sdev = q->queuedata;
> struct Scsi_Host *shost;
> @@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue *q)
> &shost->starved_list);
> continue;
> }
> -
> - spin_unlock(shost->host_lock);
> - spin_lock(sdev->request_queue->queue_lock);
> - __blk_run_queue(sdev->request_queue);
> - spin_unlock(sdev->request_queue->queue_lock);
> - spin_lock(shost->host_lock);
> + if (async)
> + blk_run_queue_async(sdev->request_queue);
> + else {
> + spin_unlock(shost->host_lock);
> + spin_lock(sdev->request_queue->queue_lock);
> + __blk_run_queue(sdev->request_queue);
> + spin_unlock(sdev->request_queue->queue_lock);
> + spin_lock(shost->host_lock);
> + }
> }
> /* put any unprocessed entries back */
> list_splice(&starved_list, &shost->starved_list);
> spin_unlock_irqrestore(shost->host_lock, flags);
>
> - blk_run_queue(q);
> + if (async)
> + blk_run_queue_async(q);
> + else
> + blk_run_queue(q);
> }
>
> -void scsi_requeue_run_queue(struct work_struct *work)
> +static void scsi_run_queue(struct request_queue *q)
> {
> - struct scsi_device *sdev;
> - struct request_queue *q;
> -
> - sdev = container_of(work, struct scsi_device, requeue_work);
> - q = sdev->request_queue;
> - scsi_run_queue(q);
> + __scsi_run_queue(q, false);
> }
>
> +static void scsi_run_queue_async(struct request_queue *q)
> +{
> + __scsi_run_queue(q, true);
> +}
> /*
> * Function: scsi_requeue_command()
> *
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 58584dc..087821f 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -242,7 +242,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
> int display_failure_msg = 1, ret;
> struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
> extern void scsi_evt_thread(struct work_struct *work);
> - extern void scsi_requeue_run_queue(struct work_struct *work);
>
> sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
> GFP_ATOMIC);
> @@ -265,7 +264,6 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
> INIT_LIST_HEAD(&sdev->event_list);
> spin_lock_init(&sdev->list_lock);
> INIT_WORK(&sdev->event_work, scsi_evt_thread);
> - INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);
>
> sdev->sdev_gendev.parent = get_device(&starget->dev);
> sdev->sdev_target = starget;
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index dd82e02..2d3ec50 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -169,7 +169,6 @@ struct scsi_device {
> sdev_dev;
>
> struct execute_work ew; /* used to get process context on put */
> - struct work_struct requeue_work;
>
> struct scsi_dh_data *scsi_dh_data;
> enum scsi_device_state sdev_state;
>

2011-05-19 18:27:27

by Jens Axboe

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On 2011-05-19 10:26, Alex,Shi wrote:
>
>> I will queue up the combined patch, it looks fine from here as well.
>>
>
> When I have some time to study Jens and shaohua's patch today. I find a
> simpler way to resolved the re-enter issue on starved_list. Following
> Jens' idea, we can just put the starved_list device into kblockd if it
> come from __scsi_queue_insert().
> It can resolve the re-enter issue and recover performance totally, and
> need not a work_struct in every scsi_device. The logic/code also looks a
> bit simpler.
> What's your opinion of this?

Isn't this _identical_ to my original patch, with the added async run of
the queue passed in (which is important, an oversight)?

--
Jens Axboe

2011-05-20 00:23:22

by Alex Shi

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Fri, 2011-05-20 at 02:27 +0800, Jens Axboe wrote:
> On 2011-05-19 10:26, Alex,Shi wrote:
> >
> >> I will queue up the combined patch, it looks fine from here as well.
> >>
> >
> > When I have some time to study Jens and shaohua's patch today. I find a
> > simpler way to resolved the re-enter issue on starved_list. Following
> > Jens' idea, we can just put the starved_list device into kblockd if it
> > come from __scsi_queue_insert().
> > It can resolve the re-enter issue and recover performance totally, and
> > need not a work_struct in every scsi_device. The logic/code also looks a
> > bit simpler.
> > What's your opinion of this?
>
> Isn't this _identical_ to my original patch, with the added async run of
> the queue passed in (which is important, an oversight)?

Not exactly same. It bases on your patch, but added a bypass way for
starved_list device. If a starved_list device come from
__scsi_queue_insert(), that may caused by our talking recursion, kblockd
with take over the process. Maybe you oversight this point in original
patch. :)

The different part from yours is below:
---
static void __scsi_run_queue(struct request_queue *q, bool async)
{
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost;
@@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue
*q)
&shost->starved_list);
continue;
}
-
- spin_unlock(shost->host_lock);
- spin_lock(sdev->request_queue->queue_lock);
- __blk_run_queue(sdev->request_queue);
- spin_unlock(sdev->request_queue->queue_lock);
- spin_lock(shost->host_lock);
+ if (async)
+ blk_run_queue_async(sdev->request_queue);
+ else {
+ spin_unlock(shost->host_lock);
+ spin_lock(sdev->request_queue->queue_lock);
+ __blk_run_queue(sdev->request_queue);
+ spin_unlock(sdev->request_queue->queue_lock);
+ spin_lock(shost->host_lock);
>

2011-05-20 00:40:59

by Shaohua Li

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

2011/5/20 Alex,Shi <[email protected]>:
> On Fri, 2011-05-20 at 02:27 +0800, Jens Axboe wrote:
>> On 2011-05-19 10:26, Alex,Shi wrote:
>> >
>> >> I will queue up the combined patch, it looks fine from here as well.
>> >>
>> >
>> > When I have some time to study Jens and shaohua's patch today. I find a
>> > simpler way to resolved the re-enter issue on starved_list. Following
>> > Jens' idea, we can just put the starved_list device into kblockd if it
>> > come from __scsi_queue_insert().
>> > It can resolve the re-enter issue and recover performance totally, and
>> > need not a work_struct in every scsi_device. The logic/code also looks a
>> > bit simpler.
>> > What's your opinion of this?
>>
>> Isn't this _identical_ to my original patch, with the added async run of
>> the queue passed in (which is important, an oversight)?
>
> Not exactly same. It bases on your patch, but added a bypass way for
> starved_list device. If a starved_list device come from
> __scsi_queue_insert(), that may caused by our talking recursion, kblockd
> with take over the process. ?Maybe you oversight this point in original
> patch. :)
>
> The different part from yours is below:
> ---
> static void __scsi_run_queue(struct request_queue *q, bool async)
> ?{
> ? ? ? ?struct scsi_device *sdev = q->queuedata;
> ? ? ? ?struct Scsi_Host *shost;
> @@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue
> *q)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &shost->starved_list);
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
> -
> - ? ? ? ? ? ? ? spin_unlock(shost->host_lock);
> - ? ? ? ? ? ? ? spin_lock(sdev->request_queue->queue_lock);
> - ? ? ? ? ? ? ? __blk_run_queue(sdev->request_queue);
> - ? ? ? ? ? ? ? spin_unlock(sdev->request_queue->queue_lock);
> - ? ? ? ? ? ? ? spin_lock(shost->host_lock);
> + ? ? ? ? ? ? ? if (async)
> + ? ? ? ? ? ? ? ? ? ? ? blk_run_queue_async(sdev->request_queue);
> + ? ? ? ? ? ? ? else {
> + ? ? ? ? ? ? ? ? ? ? ? spin_unlock(shost->host_lock);
> + ? ? ? ? ? ? ? ? ? ? ? spin_lock(sdev->request_queue->queue_lock);
> + ? ? ? ? ? ? ? ? ? ? ? __blk_run_queue(sdev->request_queue);
> + ? ? ? ? ? ? ? ? ? ? ? spin_unlock(sdev->request_queue->queue_lock);
> + ? ? ? ? ? ? ? ? ? ? ? spin_lock(shost->host_lock);
>>
I don't quite like this approach. blk_run_queue_async() could
introduce fairness issue as I said in previous mail, because we drop
the sdev from starved list but didn't run its queue immediately. The
issue exists before, but it's a bug to me.
Alex, is there any real advantage of your patch?

Thanks,
Shaohua

2011-05-20 05:19:09

by Alex Shi

[permalink] [raw]
Subject: Re: Perfromance drop on SCSI hard disk

On Fri, 2011-05-20 at 08:40 +0800, Shaohua Li wrote:
> 2011/5/20 Alex,Shi <[email protected]>:
> > On Fri, 2011-05-20 at 02:27 +0800, Jens Axboe wrote:
> >> On 2011-05-19 10:26, Alex,Shi wrote:
> >> >
> >> >> I will queue up the combined patch, it looks fine from here as well.
> >> >>
> >> >
> >> > When I have some time to study Jens and shaohua's patch today. I find a
> >> > simpler way to resolved the re-enter issue on starved_list. Following
> >> > Jens' idea, we can just put the starved_list device into kblockd if it
> >> > come from __scsi_queue_insert().
> >> > It can resolve the re-enter issue and recover performance totally, and
> >> > need not a work_struct in every scsi_device. The logic/code also looks a
> >> > bit simpler.
> >> > What's your opinion of this?
> >>
> >> Isn't this _identical_ to my original patch, with the added async run of
> >> the queue passed in (which is important, an oversight)?
> >
> > Not exactly same. It bases on your patch, but added a bypass way for
> > starved_list device. If a starved_list device come from
> > __scsi_queue_insert(), that may caused by our talking recursion, kblockd
> > with take over the process. Maybe you oversight this point in original
> > patch. :)
> >
> > The different part from yours is below:
> > ---
> > static void __scsi_run_queue(struct request_queue *q, bool async)
> > {
> > struct scsi_device *sdev = q->queuedata;
> > struct Scsi_Host *shost;
> > @@ -435,30 +437,35 @@ static void scsi_run_queue(struct request_queue
> > *q)
> > &shost->starved_list);
> > continue;
> > }
> > -
> > - spin_unlock(shost->host_lock);
> > - spin_lock(sdev->request_queue->queue_lock);
> > - __blk_run_queue(sdev->request_queue);
> > - spin_unlock(sdev->request_queue->queue_lock);
> > - spin_lock(shost->host_lock);
> > + if (async)
> > + blk_run_queue_async(sdev->request_queue);
> > + else {
> > + spin_unlock(shost->host_lock);
> > + spin_lock(sdev->request_queue->queue_lock);
> > + __blk_run_queue(sdev->request_queue);
> > + spin_unlock(sdev->request_queue->queue_lock);
> > + spin_lock(shost->host_lock);
> >>
> I don't quite like this approach. blk_run_queue_async() could
> introduce fairness issue as I said in previous mail, because we drop
> the sdev from starved list but didn't run its queue immediately. The
> issue exists before, but it's a bug to me.

I understand what's your worried. But not quite clear of the trigger
scenario. anyway, it is still a potential issue of fairness exist. So
forget my patch.