2011-04-11 14:53:33

by Steven Whitehouse

[permalink] [raw]
Subject: Strange block/scsi/workqueue issue

Hi,

Sometime around the -rc1 time (I'm not sure exactly, but it seemed to
start around then) my test box stopped booting. Lacking a serial console
I didn't get around to looking further before I went on holiday, but
since the problem is still there in the latest Linus' tree, I've been
looking again today.

My original problem was that the messages were scrolling off the screen
before I could read them. Using a video camera to try and capture them
was unsuccessful.

Today I used a serial console, and magically it seemed that the initial
boot issue had gone away - so I guess there is at least an element of
timing dependence. However, not very long after boot, I got the attached
messages. These look suspiciously similar (bearing in mind I could only
see the last few lines) to the messages I was getting with the boot
issue.

I should also say that the box was idle aside from the usual f14 start
up tasks, so there was not much I/O going on at the time.

Is the workqueue by any chance trying to wait synchronously for itself
to finish?

Steve.

INFO: task kworker/2:1:382 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/2:1 D ffff8800cb5d2500 4528 382 2 0x00000000
ffff8800c88d5880 0000000000000086 00000000001d2500 ffff8800c88d57c8
0000000000000086 ffff8800c88d5fd8 ffff8800c92d8b80 ffff8800c88d5fd8
ffff8800c88d4000 ffff8800c88d4000 ffff8800c88d4000 ffff8800c88d5fd8
Call Trace:
[<ffffffff8107fdea>] ? finish_task_switch+0x3a/0x110
[<ffffffff8167ebe6>] ? _raw_spin_unlock_irq+0x46/0x70
[<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
[<ffffffff8107fdea>] ? finish_task_switch+0x3a/0x110
[<ffffffff8167b8e5>] schedule_timeout+0x295/0x310
[<ffffffff816826b9>] ? sub_preempt_count+0xa9/0xe0
[<ffffffff8167a650>] wait_for_common+0x120/0x170
[<ffffffff810898c0>] ? try_to_wake_up+0x360/0x360
[<ffffffff8167a748>] wait_for_completion+0x18/0x20
[<ffffffff810aba4c>] wait_on_cpu_work+0xec/0x100
[<ffffffff810aa460>] ? do_work_for_cpu+0x30/0x30
[<ffffffff810abb3b>] wait_on_work+0xdb/0x150
[<ffffffff810aba60>] ? wait_on_cpu_work+0x100/0x100
[<ffffffff810abc33>] __cancel_work_timer+0x83/0x130
[<ffffffff810abced>] cancel_delayed_work_sync+0xd/0x10
[<ffffffff813b24b4>] blk_sync_queue+0x24/0x50
[<ffffffff813b24ef>] blk_cleanup_queue+0xf/0x60
[<ffffffff81479a89>] scsi_free_queue+0x9/0x10
[<ffffffff8147d30b>] scsi_device_dev_release_usercontext+0xeb/0x140
[<ffffffff810cbd5d>] ? trace_hardirqs_on_caller+0x14d/0x190
[<ffffffff8147d220>] ? scsi_device_cls_release+0x20/0x20
[<ffffffff810ac826>] execute_in_process_context+0x86/0xa0
[<ffffffff8147d1f7>] scsi_device_dev_release+0x17/0x20
[<ffffffff814609f2>] device_release+0x22/0x90
[<ffffffff8167ef98>] ? retint_restore_args+0x13/0x13
[<ffffffff813c8165>] kobject_release+0x45/0x90
[<ffffffff813c8120>] ? kobject_del+0x40/0x40
[<ffffffff813c9767>] kref_put+0x37/0x70
[<ffffffff813c8027>] kobject_put+0x27/0x60
[<ffffffff81460822>] put_device+0x12/0x20
[<ffffffff81478bd9>] scsi_request_fn+0xb9/0x4a0
[<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
[<ffffffff813b1f66>] blk_delay_work+0x26/0x40
[<ffffffff810aa9c7>] process_one_work+0x197/0x520
[<ffffffff810aa961>] ? process_one_work+0x131/0x520
[<ffffffff813b1f40>] ? blk_make_request+0x90/0x90
[<ffffffff810acfec>] worker_thread+0x15c/0x330
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810b1f16>] kthread+0xa6/0xb0
[<ffffffff81687064>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
[<ffffffff8167ebe6>] ? _raw_spin_unlock_irq+0x46/0x70
[<ffffffff8167ef98>] ? retint_restore_args+0x13/0x13
[<ffffffff810b1e70>] ? __init_kthread_worker+0x70/0x70
[<ffffffff81687060>] ? gs_change+0x13/0x13
INFO: lockdep is turned off.
INFO: task kworker/2:1:382 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/2:1 D ffff8800cb5d2500 4528 382 2 0x00000000
ffff8800c88d5880 0000000000000086 00000000001d2500 ffff8800c88d57c8
0000000000000086 ffff8800c88d5fd8 ffff8800c92d8b80 ffff8800c88d5fd8
ffff8800c88d4000 ffff8800c88d4000 ffff8800c88d4000 ffff8800c88d5fd8
Call Trace:
[<ffffffff8107fdea>] ? finish_task_switch+0x3a/0x110
[<ffffffff8167ebe6>] ? _raw_spin_unlock_irq+0x46/0x70
[<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
[<ffffffff8107fdea>] ? finish_task_switch+0x3a/0x110
[<ffffffff8167b8e5>] schedule_timeout+0x295/0x310
[<ffffffff816826b9>] ? sub_preempt_count+0xa9/0xe0
[<ffffffff8167a650>] wait_for_common+0x120/0x170
[<ffffffff810898c0>] ? try_to_wake_up+0x360/0x360
[<ffffffff8167a748>] wait_for_completion+0x18/0x20
[<ffffffff810aba4c>] wait_on_cpu_work+0xec/0x100
[<ffffffff810aa460>] ? do_work_for_cpu+0x30/0x30
[<ffffffff810abb3b>] wait_on_work+0xdb/0x150
[<ffffffff810aba60>] ? wait_on_cpu_work+0x100/0x100
[<ffffffff810abc33>] __cancel_work_timer+0x83/0x130
[<ffffffff810abced>] cancel_delayed_work_sync+0xd/0x10
[<ffffffff813b24b4>] blk_sync_queue+0x24/0x50
[<ffffffff813b24ef>] blk_cleanup_queue+0xf/0x60
[<ffffffff81479a89>] scsi_free_queue+0x9/0x10
[<ffffffff8147d30b>] scsi_device_dev_release_usercontext+0xeb/0x140
[<ffffffff810cbd5d>] ? trace_hardirqs_on_caller+0x14d/0x190
[<ffffffff8147d220>] ? scsi_device_cls_release+0x20/0x20
[<ffffffff810ac826>] execute_in_process_context+0x86/0xa0
[<ffffffff8147d1f7>] scsi_device_dev_release+0x17/0x20
[<ffffffff814609f2>] device_release+0x22/0x90
[<ffffffff8167ef98>] ? retint_restore_args+0x13/0x13
[<ffffffff813c8165>] kobject_release+0x45/0x90
[<ffffffff813c8120>] ? kobject_del+0x40/0x40
[<ffffffff813c9767>] kref_put+0x37/0x70
[<ffffffff813c8027>] kobject_put+0x27/0x60
[<ffffffff81460822>] put_device+0x12/0x20
[<ffffffff81478bd9>] scsi_request_fn+0xb9/0x4a0
[<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
[<ffffffff813b1f66>] blk_delay_work+0x26/0x40
[<ffffffff810aa9c7>] process_one_work+0x197/0x520
[<ffffffff810aa961>] ? process_one_work+0x131/0x520
[<ffffffff813b1f40>] ? blk_make_request+0x90/0x90
[<ffffffff810acfec>] worker_thread+0x15c/0x330
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810b1f16>] kthread+0xa6/0xb0
[<ffffffff81687064>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
[<ffffffff8167ebe6>] ? _raw_spin_unlock_irq+0x46/0x70
[<ffffffff8167ef98>] ? retint_restore_args+0x13/0x13
[<ffffffff810b1e70>] ? __init_kthread_worker+0x70/0x70
[<ffffffff81687060>] ? gs_change+0x13/0x13
INFO: lockdep is turned off.


2011-04-11 17:18:12

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hello,

(cc'ing James. The original message is http://lkml.org/lkml/2011/4/11/175 )

Please read from the bottom up.

On Mon, Apr 11, 2011 at 03:56:03PM +0100, Steven Whitehouse wrote:
> [<ffffffff8167b8e5>] schedule_timeout+0x295/0x310
> [<ffffffff8167a650>] wait_for_common+0x120/0x170
> [<ffffffff8167a748>] wait_for_completion+0x18/0x20
> [<ffffffff810aba4c>] wait_on_cpu_work+0xec/0x100
> [<ffffffff810abb3b>] wait_on_work+0xdb/0x150
> [<ffffffff810abc33>] __cancel_work_timer+0x83/0x130
> [<ffffffff810abced>] cancel_delayed_work_sync+0xd/0x10

4. which in turn tries to sync cancel q->delay_work. Oops, deadlock.

> [<ffffffff813b24b4>] blk_sync_queue+0x24/0x50

3. and calls into blk_sync_queue()

> [<ffffffff813b24ef>] blk_cleanup_queue+0xf/0x60
> [<ffffffff81479a89>] scsi_free_queue+0x9/0x10
> [<ffffffff8147d30b>] scsi_device_dev_release_usercontext+0xeb/0x140
> [<ffffffff810ac826>] execute_in_process_context+0x86/0xa0

2. It triggers SCSI device release

> [<ffffffff8147d1f7>] scsi_device_dev_release+0x17/0x20
> [<ffffffff814609f2>] device_release+0x22/0x90
> [<ffffffff813c8165>] kobject_release+0x45/0x90
> [<ffffffff813c9767>] kref_put+0x37/0x70
> [<ffffffff813c8027>] kobject_put+0x27/0x60
> [<ffffffff81460822>] put_device+0x12/0x20
> [<ffffffff81478bd9>] scsi_request_fn+0xb9/0x4a0
> [<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
> [<ffffffff813b1f66>] blk_delay_work+0x26/0x40

1. Workqueue starting execution of q->delay_work and scsi_request_fn()
is run from there.

> [<ffffffff810aa9c7>] process_one_work+0x197/0x520
> [<ffffffff810acfec>] worker_thread+0x15c/0x330
> [<ffffffff810b1f16>] kthread+0xa6/0xb0
> [<ffffffff81687064>] kernel_thread_helper+0x4/0x10

So, q->delay_work ends up waiting for itself. I'd like to blame SCSI
(as it also fits my agenda to kill execute_in_process_context ;-) for
diving all the way into blk_cleanup_queue() directly from request_fn.

Does the following patch fix the problem?

Thanks.

Subject: scsi: don't use execute_in_process_context()

SCSI is the only subsystem which uses execute_in_process_context() and
its use is racy against module unload. ie. the reap work is not
properly flushed and could still be running after the scsi module is
unloaded.

Although execute_in_process_context() can be more efficient when the
caller already has a context, in this case, the call paths are quite
cold and the difference is practically meaningless. With commit
c8efcc25 (workqueue: allow chained queueing during destruction), the
race condition can easily be fixed by using a dedicated workqueue and
destroying it on module unload.

Create and use scsi_wq instead of execute_in_process_context().

* scsi_device->ew is replaced with release_work. scsi_target->ew is
replaced with reap_work.

* Both works are initialized with the respective release/reap function
during device/target init. scsi_target_reap_usercontext() is moved
upwards to avoid needing forward declaration.

* scsi_alloc_target() now explicitly flushes the reap_work of the
found dying target before putting it instead of depending on
flush_scheduled_work().

For more info on the issues, please read the following thread.

http://thread.gmane.org/gmane.linux.scsi/62923

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/scsi/scsi.c | 15 +++++++++++++--
drivers/scsi/scsi_scan.c | 26 +++++++++++++-------------
drivers/scsi/scsi_sysfs.c | 8 +++++---
include/scsi/scsi_device.h | 6 ++++--
4 files changed, 35 insertions(+), 20 deletions(-)

Index: work/drivers/scsi/scsi_scan.c
===================================================================
--- work.orig/drivers/scsi/scsi_scan.c
+++ work/drivers/scsi/scsi_scan.c
@@ -362,6 +362,16 @@ int scsi_is_target_device(const struct d
}
EXPORT_SYMBOL(scsi_is_target_device);

+static void scsi_target_reap_usercontext(struct work_struct *work)
+{
+ struct scsi_target *starget =
+ container_of(work, struct scsi_target, reap_work);
+
+ transport_remove_device(&starget->dev);
+ device_del(&starget->dev);
+ scsi_target_destroy(starget);
+}
+
static struct scsi_target *__scsi_find_target(struct device *parent,
int channel, uint id)
{
@@ -427,6 +437,7 @@ static struct scsi_target *scsi_alloc_ta
starget->state = STARGET_CREATED;
starget->scsi_level = SCSI_2;
starget->max_target_blocked = SCSI_DEFAULT_TARGET_BLOCKED;
+ INIT_WORK(&starget->reap_work, scsi_target_reap_usercontext);
retry:
spin_lock_irqsave(shost->host_lock, flags);

@@ -462,21 +473,11 @@ static struct scsi_target *scsi_alloc_ta
}
/* Unfortunately, we found a dying target; need to
* wait until it's dead before we can get a new one */
+ flush_work(&found_target->reap_work);
put_device(&found_target->dev);
- flush_scheduled_work();
goto retry;
}

-static void scsi_target_reap_usercontext(struct work_struct *work)
-{
- struct scsi_target *starget =
- container_of(work, struct scsi_target, ew.work);
-
- transport_remove_device(&starget->dev);
- device_del(&starget->dev);
- scsi_target_destroy(starget);
-}
-
/**
* scsi_target_reap - check to see if target is in use and destroy if not
* @starget: target to be checked
@@ -507,8 +508,7 @@ void scsi_target_reap(struct scsi_target
if (state == STARGET_CREATED)
scsi_target_destroy(starget);
else
- execute_in_process_context(scsi_target_reap_usercontext,
- &starget->ew);
+ queue_work(scsi_wq, &starget->reap_work);
}

/**
Index: work/drivers/scsi/scsi_sysfs.c
===================================================================
--- work.orig/drivers/scsi/scsi_sysfs.c
+++ work/drivers/scsi/scsi_sysfs.c
@@ -300,7 +300,7 @@ static void scsi_device_dev_release_user
struct list_head *this, *tmp;
unsigned long flags;

- sdev = container_of(work, struct scsi_device, ew.work);
+ sdev = container_of(work, struct scsi_device, release_work);

parent = sdev->sdev_gendev.parent;
starget = to_scsi_target(parent);
@@ -343,8 +343,8 @@ static void scsi_device_dev_release_user
static void scsi_device_dev_release(struct device *dev)
{
struct scsi_device *sdp = to_scsi_device(dev);
- execute_in_process_context(scsi_device_dev_release_usercontext,
- &sdp->ew);
+
+ queue_work(scsi_wq, &sdp->release_work);
}

static struct class sdev_class = {
@@ -1069,6 +1069,8 @@ void scsi_sysfs_device_initialize(struct
dev_set_name(&sdev->sdev_dev, "%d:%d:%d:%d",
sdev->host->host_no, sdev->channel, sdev->id, sdev->lun);
sdev->scsi_level = starget->scsi_level;
+ INIT_WORK(&sdev->release_work, scsi_device_dev_release_usercontext);
+
transport_setup_device(&sdev->sdev_gendev);
spin_lock_irqsave(shost->host_lock, flags);
list_add_tail(&sdev->same_target_siblings, &starget->devices);
Index: work/include/scsi/scsi_device.h
===================================================================
--- work.orig/include/scsi/scsi_device.h
+++ work/include/scsi/scsi_device.h
@@ -168,7 +168,7 @@ struct scsi_device {
struct device sdev_gendev,
sdev_dev;

- struct execute_work ew; /* used to get process context on put */
+ struct work_struct release_work; /* for process context on put */

struct scsi_dh_data *scsi_dh_data;
enum scsi_device_state sdev_state;
@@ -259,7 +259,7 @@ struct scsi_target {
#define SCSI_DEFAULT_TARGET_BLOCKED 3

char scsi_level;
- struct execute_work ew;
+ struct work_struct reap_work;
enum scsi_target_state state;
void *hostdata; /* available to low-level driver */
unsigned long starget_data[0]; /* for the transport */
@@ -277,6 +277,8 @@ static inline struct scsi_target *scsi_t
#define starget_printk(prefix, starget, fmt, a...) \
dev_printk(prefix, &(starget)->dev, fmt, ##a)

+extern struct workqueue_struct *scsi_wq;
+
extern struct scsi_device *__scsi_add_device(struct Scsi_Host *,
uint, uint, uint, void *hostdata);
extern int scsi_add_device(struct Scsi_Host *host, uint channel,
Index: work/drivers/scsi/scsi.c
===================================================================
--- work.orig/drivers/scsi/scsi.c
+++ work/drivers/scsi/scsi.c
@@ -70,6 +70,11 @@
#define CREATE_TRACE_POINTS
#include <trace/events/scsi.h>

+/*
+ * Utility multithreaded workqueue for SCSI.
+ */
+struct workqueue_struct *scsi_wq;
+
static void scsi_done(struct scsi_cmnd *cmd);

/*
@@ -1306,11 +1311,14 @@ MODULE_PARM_DESC(scsi_logging_level, "a

static int __init init_scsi(void)
{
- int error;
+ int error = -ENOMEM;

+ scsi_wq = alloc_workqueue("scsi", 0, 0);
+ if (!scsi_wq)
+ return error;
error = scsi_init_queue();
if (error)
- return error;
+ goto cleanup_wq;
error = scsi_init_procfs();
if (error)
goto cleanup_queue;
@@ -1342,6 +1350,8 @@ cleanup_procfs:
scsi_exit_procfs();
cleanup_queue:
scsi_exit_queue();
+cleanup_wq:
+ destroy_workqueue(scsi_wq);
printk(KERN_ERR "SCSI subsystem failed to initialize, error = %d\n",
-error);
return error;
@@ -1356,6 +1366,7 @@ static void __exit exit_scsi(void)
scsi_exit_devinfo();
scsi_exit_procfs();
scsi_exit_queue();
+ destroy_workqueue(scsi_wq);
}

subsys_initcall(init_scsi);

2011-04-11 17:29:18

by Jens Axboe

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On 2011-04-11 19:18, Tejun Heo wrote:
> Hello,
>
> (cc'ing James. The original message is http://lkml.org/lkml/2011/4/11/175 )
>
> Please read from the bottom up.
>
> On Mon, Apr 11, 2011 at 03:56:03PM +0100, Steven Whitehouse wrote:
>> [<ffffffff8167b8e5>] schedule_timeout+0x295/0x310
>> [<ffffffff8167a650>] wait_for_common+0x120/0x170
>> [<ffffffff8167a748>] wait_for_completion+0x18/0x20
>> [<ffffffff810aba4c>] wait_on_cpu_work+0xec/0x100
>> [<ffffffff810abb3b>] wait_on_work+0xdb/0x150
>> [<ffffffff810abc33>] __cancel_work_timer+0x83/0x130
>> [<ffffffff810abced>] cancel_delayed_work_sync+0xd/0x10
>
> 4. which in turn tries to sync cancel q->delay_work. Oops, deadlock.
>
>> [<ffffffff813b24b4>] blk_sync_queue+0x24/0x50
>
> 3. and calls into blk_sync_queue()
>
>> [<ffffffff813b24ef>] blk_cleanup_queue+0xf/0x60
>> [<ffffffff81479a89>] scsi_free_queue+0x9/0x10
>> [<ffffffff8147d30b>] scsi_device_dev_release_usercontext+0xeb/0x140
>> [<ffffffff810ac826>] execute_in_process_context+0x86/0xa0
>
> 2. It triggers SCSI device release
>
>> [<ffffffff8147d1f7>] scsi_device_dev_release+0x17/0x20
>> [<ffffffff814609f2>] device_release+0x22/0x90
>> [<ffffffff813c8165>] kobject_release+0x45/0x90
>> [<ffffffff813c9767>] kref_put+0x37/0x70
>> [<ffffffff813c8027>] kobject_put+0x27/0x60
>> [<ffffffff81460822>] put_device+0x12/0x20
>> [<ffffffff81478bd9>] scsi_request_fn+0xb9/0x4a0
>> [<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
>> [<ffffffff813b1f66>] blk_delay_work+0x26/0x40
>
> 1. Workqueue starting execution of q->delay_work and scsi_request_fn()
> is run from there.
>
>> [<ffffffff810aa9c7>] process_one_work+0x197/0x520
>> [<ffffffff810acfec>] worker_thread+0x15c/0x330
>> [<ffffffff810b1f16>] kthread+0xa6/0xb0
>> [<ffffffff81687064>] kernel_thread_helper+0x4/0x10
>
> So, q->delay_work ends up waiting for itself. I'd like to blame SCSI
> (as it also fits my agenda to kill execute_in_process_context ;-) for
> diving all the way into blk_cleanup_queue() directly from request_fn.
>
> Does the following patch fix the problem?

Thanks, that looks a lot saner. This is/was a time bomb waiting to blow
up.

--
Jens Axboe

2011-04-11 17:49:24

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 02:18 +0900, Tejun Heo wrote:
> Hello,
>
> (cc'ing James. The original message is http://lkml.org/lkml/2011/4/11/175 )
>
> Please read from the bottom up.
>
> On Mon, Apr 11, 2011 at 03:56:03PM +0100, Steven Whitehouse wrote:
> > [<ffffffff8167b8e5>] schedule_timeout+0x295/0x310
> > [<ffffffff8167a650>] wait_for_common+0x120/0x170
> > [<ffffffff8167a748>] wait_for_completion+0x18/0x20
> > [<ffffffff810aba4c>] wait_on_cpu_work+0xec/0x100
> > [<ffffffff810abb3b>] wait_on_work+0xdb/0x150
> > [<ffffffff810abc33>] __cancel_work_timer+0x83/0x130
> > [<ffffffff810abced>] cancel_delayed_work_sync+0xd/0x10
>
> 4. which in turn tries to sync cancel q->delay_work. Oops, deadlock.
>
> > [<ffffffff813b24b4>] blk_sync_queue+0x24/0x50
>
> 3. and calls into blk_sync_queue()
>
> > [<ffffffff813b24ef>] blk_cleanup_queue+0xf/0x60
> > [<ffffffff81479a89>] scsi_free_queue+0x9/0x10
> > [<ffffffff8147d30b>] scsi_device_dev_release_usercontext+0xeb/0x140
> > [<ffffffff810ac826>] execute_in_process_context+0x86/0xa0
>
> 2. It triggers SCSI device release
>
> > [<ffffffff8147d1f7>] scsi_device_dev_release+0x17/0x20
> > [<ffffffff814609f2>] device_release+0x22/0x90
> > [<ffffffff813c8165>] kobject_release+0x45/0x90
> > [<ffffffff813c9767>] kref_put+0x37/0x70
> > [<ffffffff813c8027>] kobject_put+0x27/0x60
> > [<ffffffff81460822>] put_device+0x12/0x20
> > [<ffffffff81478bd9>] scsi_request_fn+0xb9/0x4a0
> > [<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
> > [<ffffffff813b1f66>] blk_delay_work+0x26/0x40
>
> 1. Workqueue starting execution of q->delay_work and scsi_request_fn()
> is run from there.
>
> > [<ffffffff810aa9c7>] process_one_work+0x197/0x520
> > [<ffffffff810acfec>] worker_thread+0x15c/0x330
> > [<ffffffff810b1f16>] kthread+0xa6/0xb0
> > [<ffffffff81687064>] kernel_thread_helper+0x4/0x10
>
> So, q->delay_work ends up waiting for itself. I'd like to blame SCSI
> (as it also fits my agenda to kill execute_in_process_context ;-) for
> diving all the way into blk_cleanup_queue() directly from request_fn.
>
> Does the following patch fix the problem?
>
Unfortunately not:

scsi 0:0:32:0: Enclosure DP BACKPLANE 1.07 PQ: 0
ANSI: 5
scsi 0:2:0:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0
ANSI: 5
scsi 0:2:1:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0
ANSI: 5
------------[ cut here ]------------
------------[ cut here ]------------
WARNING: at lib/kref.c:34 kref_get+0x2d/0x30()
Hardware name: PowerEdge R710
Modules linked in:
Pid: 12, comm: kworker/2:0 Not tainted 2.6.39-rc2+ #188
Call Trace:
[<ffffffff8108fa9a>] warn_slowpath_common+0x7a/0xb0
[<ffffffff8108fae5>] warn_slowpath_null+0x15/0x20
[<ffffffff813c97cd>] kref_get+0x2d/0x30
[<ffffffff813c81ca>] kobject_get+0x1a/0x30
[<ffffffff814607f4>] get_device+0x14/0x20
[<ffffffff81478b57>] scsi_request_fn+0x37/0x4a0
[<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
[<ffffffff813b1f66>] blk_delay_work+0x26/0x40
[<ffffffff810aa9c7>] process_one_work+0x197/0x520
[<ffffffff810aa961>] ? process_one_work+0x131/0x520
[<ffffffff813b1f40>] ? blk_make_request+0x90/0x90
[<ffffffff810acfec>] worker_thread+0x15c/0x330
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810b1f16>] kthread+0xa6/0xb0
[<ffffffff816870e4>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
[<ffffffff8167ec66>] ? _raw_spin_unlock_irq+0x46/0x70
[<ffffffff8167f018>] ? retint_restore_args+0x13/0x13
[<ffffffff810b1e70>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816870e0>] ? gs_change+0x13/0x13
---[ end trace 3681e9da2630a94b ]---

Steve.

2011-04-12 00:14:36

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hello,

On Mon, Apr 11, 2011 at 06:52:10PM +0100, Steven Whitehouse wrote:
> WARNING: at lib/kref.c:34 kref_get+0x2d/0x30()
> Hardware name: PowerEdge R710
> Modules linked in:
> Pid: 12, comm: kworker/2:0 Not tainted 2.6.39-rc2+ #188
> Call Trace:
> [<ffffffff8108fa9a>] warn_slowpath_common+0x7a/0xb0
> [<ffffffff8108fae5>] warn_slowpath_null+0x15/0x20
> [<ffffffff813c97cd>] kref_get+0x2d/0x30
> [<ffffffff813c81ca>] kobject_get+0x1a/0x30
> [<ffffffff814607f4>] get_device+0x14/0x20
> [<ffffffff81478b57>] scsi_request_fn+0x37/0x4a0
> [<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
> [<ffffffff813b1f66>] blk_delay_work+0x26/0x40
> [<ffffffff810aa9c7>] process_one_work+0x197/0x520
> [<ffffffff810acfec>] worker_thread+0x15c/0x330
> [<ffffffff810b1f16>] kthread+0xa6/0xb0
> [<ffffffff816870e4>] kernel_thread_helper+0x4/0x10
> ---[ end trace 3681e9da2630a94b ]---

Hmm, it could be that the root cause of the problem is
premature/double put of scsi_device. Without the patch, it makes
scsi_request_fn() call into device destruction path prematurely
triggering deadlock while after the patch, the deadlock is gone but
the ref count reaches zero prematurely triggering kref warning on the
next request.

The problem doesn't seem widespread so something about the setup is
peculiar. Steven, can you please detail the setup (and steps needed
to trigger the problem) and attach the full boot log? James, any
ideas?

Thanks.

--
tejun

2011-04-12 00:48:03

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 02:18 +0900, Tejun Heo wrote:
> Hello,
>
> (cc'ing James. The original message is http://lkml.org/lkml/2011/4/11/175 )
>
> Please read from the bottom up.
>
> On Mon, Apr 11, 2011 at 03:56:03PM +0100, Steven Whitehouse wrote:
> > [<ffffffff8167b8e5>] schedule_timeout+0x295/0x310
> > [<ffffffff8167a650>] wait_for_common+0x120/0x170
> > [<ffffffff8167a748>] wait_for_completion+0x18/0x20
> > [<ffffffff810aba4c>] wait_on_cpu_work+0xec/0x100
> > [<ffffffff810abb3b>] wait_on_work+0xdb/0x150
> > [<ffffffff810abc33>] __cancel_work_timer+0x83/0x130
> > [<ffffffff810abced>] cancel_delayed_work_sync+0xd/0x10
>
> 4. which in turn tries to sync cancel q->delay_work. Oops, deadlock.
>
> > [<ffffffff813b24b4>] blk_sync_queue+0x24/0x50
>
> 3. and calls into blk_sync_queue()
>
> > [<ffffffff813b24ef>] blk_cleanup_queue+0xf/0x60
> > [<ffffffff81479a89>] scsi_free_queue+0x9/0x10
> > [<ffffffff8147d30b>] scsi_device_dev_release_usercontext+0xeb/0x140
> > [<ffffffff810ac826>] execute_in_process_context+0x86/0xa0
>
> 2. It triggers SCSI device release
>
> > [<ffffffff8147d1f7>] scsi_device_dev_release+0x17/0x20
> > [<ffffffff814609f2>] device_release+0x22/0x90
> > [<ffffffff813c8165>] kobject_release+0x45/0x90
> > [<ffffffff813c9767>] kref_put+0x37/0x70
> > [<ffffffff813c8027>] kobject_put+0x27/0x60
> > [<ffffffff81460822>] put_device+0x12/0x20
> > [<ffffffff81478bd9>] scsi_request_fn+0xb9/0x4a0
> > [<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
> > [<ffffffff813b1f66>] blk_delay_work+0x26/0x40
>
> 1. Workqueue starting execution of q->delay_work and scsi_request_fn()
> is run from there.
>
> > [<ffffffff810aa9c7>] process_one_work+0x197/0x520
> > [<ffffffff810acfec>] worker_thread+0x15c/0x330
> > [<ffffffff810b1f16>] kthread+0xa6/0xb0
> > [<ffffffff81687064>] kernel_thread_helper+0x4/0x10
>
> So, q->delay_work ends up waiting for itself. I'd like to blame SCSI
> (as it also fits my agenda to kill execute_in_process_context ;-) for
> diving all the way into blk_cleanup_queue() directly from request_fn.

Actually, I don't think it's anything to do with the user process stuff.
The problem seems to be that the block delay function ends up being the
last user of the SCSI device, so it does the final put of the sdev when
it's finished processing. This will trigger queue destruction
(blk_cleanup_queue) and so on with your analysis.

The problem seems to be that with the new workqueue changes, the queue
itself may no longer be the last holder of a reference on the sdev
because the queue destruction is in the sdev release function and a
queue cannot now be destroyed from its own delayed work. This is a bit
contrary to the principles SCSI was using, which was that we drive queue
lifetime from the sdev, not vice versa.

The obvious fix seems to be to move queue destruction earlier, but I'm
loth to do that because it will get us back into the old situation where
we no longer have a queue to do the teardown work.

How about moving the blk_sync_queue() call out of blk_cleanup_queue()?
Since that's the direct cause of this.

James

2011-04-12 02:51:54

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hello, James.

On Mon, Apr 11, 2011 at 07:47:56PM -0500, James Bottomley wrote:
> Actually, I don't think it's anything to do with the user process stuff.
> The problem seems to be that the block delay function ends up being the
> last user of the SCSI device, so it does the final put of the sdev when
> it's finished processing. This will trigger queue destruction
> (blk_cleanup_queue) and so on with your analysis.

Hmm... this I can understand.

> The problem seems to be that with the new workqueue changes, the queue
> itself may no longer be the last holder of a reference on the sdev
> because the queue destruction is in the sdev release function and a
> queue cannot now be destroyed from its own delayed work. This is a bit
> contrary to the principles SCSI was using, which was that we drive queue
> lifetime from the sdev, not vice versa.

But confused here. Why does it make any difference whether the
release operation is in the request_fn context or not? What makes
SCSI refcounting different from others?

Thanks.

--
tejun

2011-04-12 04:49:24

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 11:51 +0900, Tejun Heo wrote:
> Hello, James.
>
> On Mon, Apr 11, 2011 at 07:47:56PM -0500, James Bottomley wrote:
> > Actually, I don't think it's anything to do with the user process stuff.
> > The problem seems to be that the block delay function ends up being the
> > last user of the SCSI device, so it does the final put of the sdev when
> > it's finished processing. This will trigger queue destruction
> > (blk_cleanup_queue) and so on with your analysis.
>
> Hmm... this I can understand.
>
> > The problem seems to be that with the new workqueue changes, the queue
> > itself may no longer be the last holder of a reference on the sdev
> > because the queue destruction is in the sdev release function and a
> > queue cannot now be destroyed from its own delayed work. This is a bit
> > contrary to the principles SCSI was using, which was that we drive queue
> > lifetime from the sdev, not vice versa.
>
> But confused here. Why does it make any difference whether the
> release operation is in the request_fn context or not? What makes
> SCSI refcounting different from others?

I didn't say it did. SCSI refcounting is fairly standard.

The problem isn't really anything to do with SCSI ... it's the way block
queue destruction must now be called. The block queue destruction
includes a synchronous flush of the work queue. That means it can't be
called from the executing workqueue without deadlocking. The last put
of a SCSI device destroys the queue. This now means that the last put
of the SCSI device can't be in the block delay work path. However, as
the device shuts down that can very well wind up happening if
blk_delay_queue() ends up being called as the device is dying.

The entangled deadlock seems to have been introduced by commit
3cca6dc1c81e2407928dc4c6105252146fd3924f prior to that, there was no
synchronous cancel in the destroy path.

A fix might be to shunt more stuff off to workqueues, but that's
producing a more complex system which would be prone to entanglements
that would be even harder to spot.

Perhaps a better solution is just not to use sync cancellations in
block? As long as the work in the queue holds a queue ref, they can be
done asynchronously.

James

2011-04-12 05:02:55

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Mon, 2011-04-11 at 23:49 -0500, James Bottomley wrote:
> The entangled deadlock seems to have been introduced by commit
> 3cca6dc1c81e2407928dc4c6105252146fd3924f prior to that, there was no
> synchronous cancel in the destroy path.
>
> A fix might be to shunt more stuff off to workqueues, but that's
> producing a more complex system which would be prone to entanglements
> that would be even harder to spot.
>
> Perhaps a better solution is just not to use sync cancellations in
> block? As long as the work in the queue holds a queue ref, they can be
> done asynchronously.

So this is a possible implementation, does this fix the problem?
(compile tested only).

James

---

diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..f600f88 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -219,6 +219,7 @@ static void blk_delay_work(struct work_struct *work)
spin_lock_irq(q->queue_lock);
__blk_run_queue(q, false);
spin_unlock_irq(q->queue_lock);
+ blk_put_queue(q);
}

/**
@@ -233,7 +234,8 @@ static void blk_delay_work(struct work_struct *work)
*/
void blk_delay_queue(struct request_queue *q, unsigned long msecs)
{
- schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs));
+ if (!blk_get_queue(q))
+ schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs));
}
EXPORT_SYMBOL(blk_delay_queue);

@@ -271,7 +273,8 @@ EXPORT_SYMBOL(blk_start_queue);
**/
void blk_stop_queue(struct request_queue *q)
{
- __cancel_delayed_work(&q->delay_work);
+ if (__cancel_delayed_work(&q->delay_work))
+ blk_put_queue(q);
queue_flag_set(QUEUE_FLAG_STOPPED, q);
}
EXPORT_SYMBOL(blk_stop_queue);
@@ -297,7 +300,8 @@ EXPORT_SYMBOL(blk_stop_queue);
void blk_sync_queue(struct request_queue *q)
{
del_timer_sync(&q->timeout);
- cancel_delayed_work_sync(&q->delay_work);
+ if (__cancel_delayed_work(&q->delay_work))
+ blk_put_queue(q);
queue_sync_plugs(q);
}
EXPORT_SYMBOL(blk_sync_queue);

2011-04-12 05:15:20

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hello,

On Mon, Apr 11, 2011 at 11:49:17PM -0500, James Bottomley wrote:
> > But confused here. Why does it make any difference whether the
> > release operation is in the request_fn context or not? What makes
> > SCSI refcounting different from others?
>
> I didn't say it did. SCSI refcounting is fairly standard.
>
> The problem isn't really anything to do with SCSI ... it's the way block
> queue destruction must now be called. The block queue destruction
> includes a synchronous flush of the work queue. That means it can't be
> called from the executing workqueue without deadlocking. The last put
> of a SCSI device destroys the queue. This now means that the last put
> of the SCSI device can't be in the block delay work path. However, as
> the device shuts down that can very well wind up happening if
> blk_delay_queue() ends up being called as the device is dying.

Yeah, I understood that part. I thought you were referring to the
problem caused by the proposed workqueue replacement in the patch when
you talked about workqueue related issues in the previous message, and
saying that there's an inherent problem with using workqueue directly.

> The entangled deadlock seems to have been introduced by commit
> 3cca6dc1c81e2407928dc4c6105252146fd3924f prior to that, there was no
> synchronous cancel in the destroy path.
>
> A fix might be to shunt more stuff off to workqueues, but that's
> producing a more complex system which would be prone to entanglements
> that would be even harder to spot.

I don't agree there. To me, the cause for entanglement seems to be
request_fn calling all the way through blocking destruction because it
detected that the final put was called with sleepable context. It's
just weird and difficult to anticipate to directly call into sleepable
destruction path from request_fn whether it had sleepable context or
not. With the yet-to-be-debugged bug caused by the conversion aside,
I think simply using workqueue is the better solution.

> Perhaps a better solution is just not to use sync cancellations in
> block? As long as the work in the queue holds a queue ref, they can be
> done asynchronously.

Hmmm... maybe but at least I prefer doing explicit shutdown/draining
on destruction even if the base data structure is refcounted. Things
become much more predictable that way.

Thanks.

--
tejun

2011-04-12 08:39:28

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 00:02 -0500, James Bottomley wrote:
> On Mon, 2011-04-11 at 23:49 -0500, James Bottomley wrote:
> > The entangled deadlock seems to have been introduced by commit
> > 3cca6dc1c81e2407928dc4c6105252146fd3924f prior to that, there was no
> > synchronous cancel in the destroy path.
> >
> > A fix might be to shunt more stuff off to workqueues, but that's
> > producing a more complex system which would be prone to entanglements
> > that would be even harder to spot.
> >
> > Perhaps a better solution is just not to use sync cancellations in
> > block? As long as the work in the queue holds a queue ref, they can be
> > done asynchronously.
>
> So this is a possible implementation, does this fix the problem?
> (compile tested only).
>
> James
>
> ---
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 90f22cc..f600f88 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -219,6 +219,7 @@ static void blk_delay_work(struct work_struct *work)
> spin_lock_irq(q->queue_lock);
> __blk_run_queue(q, false);
> spin_unlock_irq(q->queue_lock);
> + blk_put_queue(q);
> }
>
> /**
> @@ -233,7 +234,8 @@ static void blk_delay_work(struct work_struct *work)
> */
> void blk_delay_queue(struct request_queue *q, unsigned long msecs)
> {
> - schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs));
> + if (!blk_get_queue(q))
> + schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs));
> }
> EXPORT_SYMBOL(blk_delay_queue);
>
> @@ -271,7 +273,8 @@ EXPORT_SYMBOL(blk_start_queue);
> **/
> void blk_stop_queue(struct request_queue *q)
> {
> - __cancel_delayed_work(&q->delay_work);
> + if (__cancel_delayed_work(&q->delay_work))
> + blk_put_queue(q);
> queue_flag_set(QUEUE_FLAG_STOPPED, q);
> }
> EXPORT_SYMBOL(blk_stop_queue);
> @@ -297,7 +300,8 @@ EXPORT_SYMBOL(blk_stop_queue);
> void blk_sync_queue(struct request_queue *q)
> {
> del_timer_sync(&q->timeout);
> - cancel_delayed_work_sync(&q->delay_work);
> + if (__cancel_delayed_work(&q->delay_work))
> + blk_put_queue(q);
> queue_sync_plugs(q);
> }
> EXPORT_SYMBOL(blk_sync_queue);
>
>

Just done a couple of tests, with the following results:

#1. Including Tejun's patch:

scsi 0:0:32:0: Enclosure DP BACKPLANE 1.07 PQ: 0 ANSI: 5
scsi 0:2:0:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
scsi 0:2:1:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
------------[ cut here ]------------
WARNING: at lib/kref.c:34 kref_get+0x2d/0x30()
Hardware name: PowerEdge R710
Modules linked in:
Pid: 18, comm: kworker/4:0 Not tainted 2.6.39-rc2+ #189
Call Trace:
[<ffffffff8108fa9a>] warn_slowpath_common+0x7a/0xb0
[<ffffffff8108fae5>] warn_slowpath_null+0x15/0x20
[<ffffffff813c983d>] kref_get+0x2d/0x30
[<ffffffff813c823a>] kobject_get+0x1a/0x30
[<ffffffff81460864>] get_device+0x14/0x20
[<ffffffff81478bc7>] scsi_request_fn+0x37/0x4a0
[<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
[<ffffffff813b2481>] blk_delay_work+0x31/0x60
[<ffffffff810aa9c7>] process_one_work+0x197/0x520
[<ffffffff810aa961>] ? process_one_work+0x131/0x520
[<ffffffff813b2450>] ? blk_alloc_queue+0x10/0x10
[<ffffffff810acfec>] worker_thread+0x15c/0x330
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810ace90>] ? manage_workers.clone.20+0x240/0x240
[<ffffffff810b1f16>] kthread+0xa6/0xb0
[<ffffffff81687164>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
[<ffffffff8167ecd6>] ? _raw_spin_unlock_irq+0x46/0x70
[<ffffffff8167f098>] ? retint_restore_args+0x13/0x13
[<ffffffff810b1e70>] ? __init_kthread_worker+0x70/0x70
[<ffffffff81687160>] ? gs_change+0x13/0x13
---[ end trace c35781f847a41f31 ]---

#2 On its own:

scsi 0:0:32:0: Enclosure DP BACKPLANE 1.07 PQ: 0 ANSI: 5
scsi 0:2:0:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
scsi 0:2:1:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffff810ab4d3>] __queue_work+0x403/0x460
PGD 0
Oops: 0000 [#1] PREEMPT SMP
last sysfs file:
CPU 0
Modules linked in:

Pid: 3, comm: ksoftirqd/0 Not tainted 2.6.39-rc2+ #190 Dell Inc. PowerEdge R710/0N047H
RIP: 0010:[<ffffffff810ab4d3>] [<ffffffff810ab4d3>] __queue_work+0x403/0x460
RSP: 0018:ffff8800c90abb20 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff8800c56a8c58 RCX: ffff8800c56a8c60
RDX: ffff8800c56a8c60 RSI: 0000000000000000 RDI: ffff8800cb00e388
RBP: ffff8800c90abb70 R08: 0000000000000900 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: ffff8800cb00e340
R13: 0000000000000000 R14: ffff8800cb1d6700 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8800cb000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 0000000001c33000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ksoftirqd/0 (pid: 3, threadinfo ffff8800c90aa000, task ffff8800c90a80c0)
Stack:
ffff8800c90abb90 0000000000000086 0000000000000000 0000000000000086
ffff8800c90a80c0 ffff8800c92ebe00 ffff8800c56a8c58 ffff8800c88bac28
ffff8800c90abc30 ffff8800c565c080 ffff8800c90abb80 ffffffff810ab59d
Call Trace:
[<ffffffff810ab59d>] queue_work_on+0x1d/0x30
[<ffffffff810ac749>] queue_work+0x29/0x60
[<ffffffff810ac865>] queue_delayed_work+0x25/0x30
[<ffffffff813aff75>] __blk_run_queue+0xb5/0x110
[<ffffffff813b04ed>] blk_run_queue+0x2d/0x50
[<ffffffff8147715a>] scsi_run_queue+0xea/0x3f0
[<ffffffff814710cf>] ? __scsi_put_command+0x5f/0xa0
[<ffffffff814793cd>] scsi_next_command+0x3d/0x60
[<ffffffff81479602>] scsi_io_completion+0x1b2/0x630
[<ffffffff81470cd7>] scsi_finish_command+0xc7/0x130
[<ffffffff8147936f>] scsi_softirq_done+0x13f/0x160
[<ffffffff813b8252>] blk_done_softirq+0xa2/0xc0
[<ffffffff810970ac>] __do_softirq+0xdc/0x290
[<ffffffff81097395>] run_ksoftirqd+0x135/0x250
[<ffffffff81097260>] ? __do_softirq+0x290/0x290
[<ffffffff810b1f16>] kthread+0xa6/0xb0
[<ffffffff816870e4>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
[<ffffffff8167ec56>] ? _raw_spin_unlock_irq+0x46/0x70
[<ffffffff8167f018>] ? retint_restore_args+0x13/0x13
[<ffffffff810b1e70>] ? __init_kthread_worker+0x70/0x70
[<ffffffff816870e0>] ? gs_change+0x13/0x13
Code: 8b 44 24 48 49 8d 7c 24 48 48 83 e8 08 31 f6 eb 27 0f 1f 80 00 00 00 00 48 8b 00 49 89 c0 41 80 e0 00 a8 04 48 89 f0 49 0f 45 c0
8b 40 08 f6 00 10 74 14 48 8d 41 f8 48 8b 48 08 48 8d 50 08
RIP [<ffffffff810ab4d3>] __queue_work+0x403/0x460
RSP <ffff8800c90abb20>
CR2: 0000000000000008
---[ end trace 52fce2e9afb9bf59 ]---
Kernel panic - not syncing: Fatal exception in interrupt

These are both on the same base kernel as yesterday (i.e. Linus tree
from this time yesterday morning) plus a few GFS2 patches, but they
only affect the GFS2 module which is not loaded.

I don't do anything special to trigger the problem - it happens of its
own accord during normal boot. The initscripts are those from f14.

Steve.


2011-04-12 08:46:37

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 09:14 +0900, Tejun Heo wrote:
> Hello,
>
> On Mon, Apr 11, 2011 at 06:52:10PM +0100, Steven Whitehouse wrote:
> > WARNING: at lib/kref.c:34 kref_get+0x2d/0x30()
> > Hardware name: PowerEdge R710
> > Modules linked in:
> > Pid: 12, comm: kworker/2:0 Not tainted 2.6.39-rc2+ #188
> > Call Trace:
> > [<ffffffff8108fa9a>] warn_slowpath_common+0x7a/0xb0
> > [<ffffffff8108fae5>] warn_slowpath_null+0x15/0x20
> > [<ffffffff813c97cd>] kref_get+0x2d/0x30
> > [<ffffffff813c81ca>] kobject_get+0x1a/0x30
> > [<ffffffff814607f4>] get_device+0x14/0x20
> > [<ffffffff81478b57>] scsi_request_fn+0x37/0x4a0
> > [<ffffffff813aff2a>] __blk_run_queue+0x6a/0x110
> > [<ffffffff813b1f66>] blk_delay_work+0x26/0x40
> > [<ffffffff810aa9c7>] process_one_work+0x197/0x520
> > [<ffffffff810acfec>] worker_thread+0x15c/0x330
> > [<ffffffff810b1f16>] kthread+0xa6/0xb0
> > [<ffffffff816870e4>] kernel_thread_helper+0x4/0x10
> > ---[ end trace 3681e9da2630a94b ]---
>
> Hmm, it could be that the root cause of the problem is
> premature/double put of scsi_device. Without the patch, it makes
> scsi_request_fn() call into device destruction path prematurely
> triggering deadlock while after the patch, the deadlock is gone but
> the ref count reaches zero prematurely triggering kref warning on the
> next request.
>
> The problem doesn't seem widespread so something about the setup is
> peculiar. Steven, can you please detail the setup (and steps needed
> to trigger the problem) and attach the full boot log? James, any
> ideas?
>
> Thanks.
>
The hardware is as follows:

Dell R710 server with two 2GHz 4-core CPUs
Two 146G SAS disks with hardware mirroring as root/OS disk
Two 300G SAS disks with hardware mirroring as GFS2 test disk (note: not
mounted during boot process)
12G RAM (boots with mem=4G since I was originally running tests with
lower memory, but I don't think the memory size affects this at all)

I've attached the boot log from testing your patch, yesterday. If you
want the boot logs including James' patch or the original boot log, I
can send those too,

Steve.


Attachments:
tejun.txt (160.62 kB)

2011-04-12 13:42:36

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 09:42 +0100, Steven Whitehouse wrote:
> #2 On its own:

Right, that's how it's supposed to be applied

> scsi 0:0:32:0: Enclosure DP BACKPLANE 1.07 PQ: 0 ANSI: 5
> scsi 0:2:0:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
> scsi 0:2:1:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> IP: [<ffffffff810ab4d3>] __queue_work+0x403/0x460
> PGD 0
> Oops: 0000 [#1] PREEMPT SMP
> last sysfs file:
> CPU 0
> Modules linked in:
>
> Pid: 3, comm: ksoftirqd/0 Not tainted 2.6.39-rc2+ #190 Dell Inc. PowerEdge R710/0N047H
> RIP: 0010:[<ffffffff810ab4d3>] [<ffffffff810ab4d3>] __queue_work+0x403/0x460
> RSP: 0018:ffff8800c90abb20 EFLAGS: 00010046
> RAX: 0000000000000000 RBX: ffff8800c56a8c58 RCX: ffff8800c56a8c60
> RDX: ffff8800c56a8c60 RSI: 0000000000000000 RDI: ffff8800cb00e388
> RBP: ffff8800c90abb70 R08: 0000000000000900 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000001 R12: ffff8800cb00e340
> R13: 0000000000000000 R14: ffff8800cb1d6700 R15: 0000000000000000
> FS: 0000000000000000(0000) GS:ffff8800cb000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000008 CR3: 0000000001c33000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process ksoftirqd/0 (pid: 3, threadinfo ffff8800c90aa000, task ffff8800c90a80c0)
> Stack:
> ffff8800c90abb90 0000000000000086 0000000000000000 0000000000000086
> ffff8800c90a80c0 ffff8800c92ebe00 ffff8800c56a8c58 ffff8800c88bac28
> ffff8800c90abc30 ffff8800c565c080 ffff8800c90abb80 ffffffff810ab59d
> Call Trace:
> [<ffffffff810ab59d>] queue_work_on+0x1d/0x30
> [<ffffffff810ac749>] queue_work+0x29/0x60
> [<ffffffff810ac865>] queue_delayed_work+0x25/0x30
> [<ffffffff813aff75>] __blk_run_queue+0xb5/0x110

OK, this is because the patch is incomplete ... it needs an additional
reference which I missed in run queue ... can you apply the update patch
below on top?

Actually, there are probably more places I missed, sigh.

> [<ffffffff813b04ed>] blk_run_queue+0x2d/0x50
> [<ffffffff8147715a>] scsi_run_queue+0xea/0x3f0
> [<ffffffff814710cf>] ? __scsi_put_command+0x5f/0xa0
> [<ffffffff814793cd>] scsi_next_command+0x3d/0x60
> [<ffffffff81479602>] scsi_io_completion+0x1b2/0x630
> [<ffffffff81470cd7>] scsi_finish_command+0xc7/0x130
> [<ffffffff8147936f>] scsi_softirq_done+0x13f/0x160
> [<ffffffff813b8252>] blk_done_softirq+0xa2/0xc0
> [<ffffffff810970ac>] __do_softirq+0xdc/0x290
> [<ffffffff81097395>] run_ksoftirqd+0x135/0x250
> [<ffffffff81097260>] ? __do_softirq+0x290/0x290
> [<ffffffff810b1f16>] kthread+0xa6/0xb0
> [<ffffffff816870e4>] kernel_thread_helper+0x4/0x10
> [<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
> [<ffffffff8167ec56>] ? _raw_spin_unlock_irq+0x46/0x70
> [<ffffffff8167f018>] ? retint_restore_args+0x13/0x13
> [<ffffffff810b1e70>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff816870e0>] ? gs_change+0x13/0x13
> Code: 8b 44 24 48 49 8d 7c 24 48 48 83 e8 08 31 f6 eb 27 0f 1f 80 00 00 00 00 48 8b 00 49 89 c0 41 80 e0 00 a8 04 48 89 f0 49 0f 45 c0
> 8b 40 08 f6 00 10 74 14 48 8d 41 f8 48 8b 48 08 48 8d 50 08
> RIP [<ffffffff810ab4d3>] __queue_work+0x403/0x460
> RSP <ffff8800c90abb20>
> CR2: 0000000000000008
> ---[ end trace 52fce2e9afb9bf59 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
>
> These are both on the same base kernel as yesterday (i.e. Linus tree
> from this time yesterday morning) plus a few GFS2 patches, but they
> only affect the GFS2 module which is not loaded.
>
> I don't do anything special to trigger the problem - it happens of its
> own accord during normal boot. The initscripts are those from f14.

That's useful ... at least it's not hard to reproduce ... it's just I
don't see anything like this in my boot sequence.

James

---

Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c
+++ linux-2.6/block/blk-core.c
@@ -328,7 +328,7 @@ void __blk_run_queue(struct request_queu
if (!force_kblockd && !queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
q->request_fn(q);
queue_flag_clear(QUEUE_FLAG_REENTER, q);
- } else
+ } else if (!blk_get_queue(q))
queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
}
EXPORT_SYMBOL(__blk_run_queue);

2011-04-12 14:04:07

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 08:42 -0500, James Bottomley wrote:
> On Tue, 2011-04-12 at 09:42 +0100, Steven Whitehouse wrote:
> > #2 On its own:
>
> Right, that's how it's supposed to be applied
>
> > scsi 0:0:32:0: Enclosure DP BACKPLANE 1.07 PQ: 0 ANSI: 5
> > scsi 0:2:0:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
> > scsi 0:2:1:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> > IP: [<ffffffff810ab4d3>] __queue_work+0x403/0x460
> > PGD 0
> > Oops: 0000 [#1] PREEMPT SMP
> > last sysfs file:
> > CPU 0
> > Modules linked in:
> >
> > Pid: 3, comm: ksoftirqd/0 Not tainted 2.6.39-rc2+ #190 Dell Inc. PowerEdge R710/0N047H
> > RIP: 0010:[<ffffffff810ab4d3>] [<ffffffff810ab4d3>] __queue_work+0x403/0x460
> > RSP: 0018:ffff8800c90abb20 EFLAGS: 00010046
> > RAX: 0000000000000000 RBX: ffff8800c56a8c58 RCX: ffff8800c56a8c60
> > RDX: ffff8800c56a8c60 RSI: 0000000000000000 RDI: ffff8800cb00e388
> > RBP: ffff8800c90abb70 R08: 0000000000000900 R09: 0000000000000000
> > R10: 0000000000000001 R11: 0000000000000001 R12: ffff8800cb00e340
> > R13: 0000000000000000 R14: ffff8800cb1d6700 R15: 0000000000000000
> > FS: 0000000000000000(0000) GS:ffff8800cb000000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000000008 CR3: 0000000001c33000 CR4: 00000000000006f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process ksoftirqd/0 (pid: 3, threadinfo ffff8800c90aa000, task ffff8800c90a80c0)
> > Stack:
> > ffff8800c90abb90 0000000000000086 0000000000000000 0000000000000086
> > ffff8800c90a80c0 ffff8800c92ebe00 ffff8800c56a8c58 ffff8800c88bac28
> > ffff8800c90abc30 ffff8800c565c080 ffff8800c90abb80 ffffffff810ab59d
> > Call Trace:
> > [<ffffffff810ab59d>] queue_work_on+0x1d/0x30
> > [<ffffffff810ac749>] queue_work+0x29/0x60
> > [<ffffffff810ac865>] queue_delayed_work+0x25/0x30
> > [<ffffffff813aff75>] __blk_run_queue+0xb5/0x110
>
> OK, this is because the patch is incomplete ... it needs an additional
> reference which I missed in run queue ... can you apply the update patch
> below on top?
>
> Actually, there are probably more places I missed, sigh.
>
> > [<ffffffff813b04ed>] blk_run_queue+0x2d/0x50
> > [<ffffffff8147715a>] scsi_run_queue+0xea/0x3f0
> > [<ffffffff814710cf>] ? __scsi_put_command+0x5f/0xa0
> > [<ffffffff814793cd>] scsi_next_command+0x3d/0x60
> > [<ffffffff81479602>] scsi_io_completion+0x1b2/0x630
> > [<ffffffff81470cd7>] scsi_finish_command+0xc7/0x130
> > [<ffffffff8147936f>] scsi_softirq_done+0x13f/0x160
> > [<ffffffff813b8252>] blk_done_softirq+0xa2/0xc0
> > [<ffffffff810970ac>] __do_softirq+0xdc/0x290
> > [<ffffffff81097395>] run_ksoftirqd+0x135/0x250
> > [<ffffffff81097260>] ? __do_softirq+0x290/0x290
> > [<ffffffff810b1f16>] kthread+0xa6/0xb0
> > [<ffffffff816870e4>] kernel_thread_helper+0x4/0x10
> > [<ffffffff8107fe1f>] ? finish_task_switch+0x6f/0x110
> > [<ffffffff8167ec56>] ? _raw_spin_unlock_irq+0x46/0x70
> > [<ffffffff8167f018>] ? retint_restore_args+0x13/0x13
> > [<ffffffff810b1e70>] ? __init_kthread_worker+0x70/0x70
> > [<ffffffff816870e0>] ? gs_change+0x13/0x13
> > Code: 8b 44 24 48 49 8d 7c 24 48 48 83 e8 08 31 f6 eb 27 0f 1f 80 00 00 00 00 48 8b 00 49 89 c0 41 80 e0 00 a8 04 48 89 f0 49 0f 45 c0
> > 8b 40 08 f6 00 10 74 14 48 8d 41 f8 48 8b 48 08 48 8d 50 08
> > RIP [<ffffffff810ab4d3>] __queue_work+0x403/0x460
> > RSP <ffff8800c90abb20>
> > CR2: 0000000000000008
> > ---[ end trace 52fce2e9afb9bf59 ]---
> > Kernel panic - not syncing: Fatal exception in interrupt
> >
> > These are both on the same base kernel as yesterday (i.e. Linus tree
> > from this time yesterday morning) plus a few GFS2 patches, but they
> > only affect the GFS2 module which is not loaded.
> >
> > I don't do anything special to trigger the problem - it happens of its
> > own accord during normal boot. The initscripts are those from f14.
>
> That's useful ... at least it's not hard to reproduce ... it's just I
> don't see anything like this in my boot sequence.
>
> James
>
> ---
>
> Index: linux-2.6/block/blk-core.c
> ===================================================================
> --- linux-2.6.orig/block/blk-core.c
> +++ linux-2.6/block/blk-core.c
> @@ -328,7 +328,7 @@ void __blk_run_queue(struct request_queu
> if (!force_kblockd && !queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
> q->request_fn(q);
> queue_flag_clear(QUEUE_FLAG_REENTER, q);
> - } else
> + } else if (!blk_get_queue(q))
> queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
> }
> EXPORT_SYMBOL(__blk_run_queue);
>
>

Messages attached - getting closer I think, but still not quite there.
It is very odd that so far (at least I've seen no other reports) that I
seem to be the only one who hits this. I wonder what is different about
my kernel/hardware...

Steve.



Attachments:
james3.txt (26.45 kB)

2011-04-12 15:14:27

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 15:06 +0100, Steven Whitehouse wrote:
> Messages attached - getting closer I think, but still not quite there.
> It is very odd that so far (at least I've seen no other reports) that I
> seem to be the only one who hits this. I wonder what is different about
> my kernel/hardware...


> csi: killing requests for dead queue
> scsi: killing requests for dead queue
> BUG: unable to handle kernel scsi: killing requests for dead queue
> NULL pointer dereference at 0000000000000028
> IP: [<ffffffff813b34ed>] blk_peek_request+0x1ad/0x220

Hmm, I don't see where this is ... it's a null pointer deref offset by
0x28 bytes ... but I can't see where ... can you resolve this to a line?

James

2011-04-12 15:15:22

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 14:15 +0900, Tejun Heo wrote:
> > A fix might be to shunt more stuff off to workqueues, but that's
> > producing a more complex system which would be prone to entanglements
> > that would be even harder to spot.
>
> I don't agree there. To me, the cause for entanglement seems to be
> request_fn calling all the way through blocking destruction because it
> detected that the final put was called with sleepable context. It's
> just weird and difficult to anticipate to directly call into sleepable
> destruction path from request_fn whether it had sleepable context or
> not. With the yet-to-be-debugged bug caused by the conversion aside,
> I think simply using workqueue is the better solution.

So your idea is that all final puts should go through a workqueue? Like
I said, that would work, but it's not just SCSI ... any call path that
destroys a queue has to be audited.

The problem is nothing to do with sleeping context ... it's that any
work called by the block workqueue can't destroy that queue. In a
refcounted model, that's a bit nasty.

> > Perhaps a better solution is just not to use sync cancellations in
> > block? As long as the work in the queue holds a queue ref, they can be
> > done asynchronously.
>
> Hmmm... maybe but at least I prefer doing explicit shutdown/draining
> on destruction even if the base data structure is refcounted. Things
> become much more predictable that way.

It is pretty much instantaneous. Unless we're executing, we cancel the
work. If the work is already running, we just let it complete instead
of waiting for it.

Synchronous waits are dangerous because they cause entanglement.

James

2011-04-12 16:01:44

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 10:14 -0500, James Bottomley wrote:
> On Tue, 2011-04-12 at 15:06 +0100, Steven Whitehouse wrote:
> > Messages attached - getting closer I think, but still not quite there.
> > It is very odd that so far (at least I've seen no other reports) that I
> > seem to be the only one who hits this. I wonder what is different about
> > my kernel/hardware...
>
>
> > csi: killing requests for dead queue
> > scsi: killing requests for dead queue
> > BUG: unable to handle kernel scsi: killing requests for dead queue
> > NULL pointer dereference at 0000000000000028
> > IP: [<ffffffff813b34ed>] blk_peek_request+0x1ad/0x220
>
> Hmm, I don't see where this is ... it's a null pointer deref offset by
> 0x28 bytes ... but I can't see where ... can you resolve this to a line?
>
> James
>
>

So assuming I've calculated this correctly, it should be just about....

/home/steve/linux-2.6/include/linux/rcupdate.h:677
6de8: a8 08 test $0x8,%al
6dea: 0f 84 65 ff ff ff je 6d55 <blk_peek_request+0xf5>
6df0: e8 00 00 00 00 callq 6df5 <blk_peek_request+0x195>
6df1: R_X86_64_PC32 preempt_schedule-0x4
6df5: e9 5b ff ff ff jmpq 6d55 <blk_peek_request+0xf5>
6dfa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
__elv_next_request():
/home/steve/linux-2.6/block/blk.h:65
if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
return rq;
}

if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
6e00: 49 8b 44 24 18 mov 0x18(%r12),%rax
6e05: 31 f6 xor %esi,%esi
6e07: 48 8b 00 mov (%rax),%rax
6e0a: 4c 89 e7 mov %r12,%rdi <---- here!
6e0d: ff 50 28 callq *0x28(%rax)
6e10: 85 c0 test %eax,%eax
6e12: 0f 85 f4 fe ff ff jne 6d0c <blk_peek_request+0xac>
6e18: 45 31 ed xor %r13d,%r13d
6e1b: e9 70 ff ff ff jmpq 6d90 <blk_peek_request+0x130>
blk_peek_request():
/home/steve/linux-2.6/block/blk-core.c:1912
* so that we don't add it again
*/

The offset I got was 6e0d, but I guess thats the start of the following instruction
by the time the fault is logged. Also I've attached my current .config just in case
that might be a help is reproducing it,

Steve.


Attachments:
test.config (85.21 kB)

2011-04-12 16:27:09

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 17:04 +0100, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2011-04-12 at 10:14 -0500, James Bottomley wrote:
> > On Tue, 2011-04-12 at 15:06 +0100, Steven Whitehouse wrote:
> > > Messages attached - getting closer I think, but still not quite there.
> > > It is very odd that so far (at least I've seen no other reports) that I
> > > seem to be the only one who hits this. I wonder what is different about
> > > my kernel/hardware...
> >
> >
> > > csi: killing requests for dead queue
> > > scsi: killing requests for dead queue
> > > BUG: unable to handle kernel scsi: killing requests for dead queue
> > > NULL pointer dereference at 0000000000000028
> > > IP: [<ffffffff813b34ed>] blk_peek_request+0x1ad/0x220
> >
> > Hmm, I don't see where this is ... it's a null pointer deref offset by
> > 0x28 bytes ... but I can't see where ... can you resolve this to a line?
> >
> > James
> >
> >
>
> So assuming I've calculated this correctly, it should be just about....
>
> /home/steve/linux-2.6/include/linux/rcupdate.h:677
> 6de8: a8 08 test $0x8,%al
> 6dea: 0f 84 65 ff ff ff je 6d55 <blk_peek_request+0xf5>
> 6df0: e8 00 00 00 00 callq 6df5 <blk_peek_request+0x195>
> 6df1: R_X86_64_PC32 preempt_schedule-0x4
> 6df5: e9 5b ff ff ff jmpq 6d55 <blk_peek_request+0xf5>
> 6dfa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> __elv_next_request():
> /home/steve/linux-2.6/block/blk.h:65
> if (!list_empty(&q->queue_head)) {
> rq = list_entry_rq(q->queue_head.next);
> return rq;
> }
>
> if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
> 6e00: 49 8b 44 24 18 mov 0x18(%r12),%rax
> 6e05: 31 f6 xor %esi,%esi
> 6e07: 48 8b 00 mov (%rax),%rax
> 6e0a: 4c 89 e7 mov %r12,%rdi <---- here!
> 6e0d: ff 50 28 callq *0x28(%rax)
> 6e10: 85 c0 test %eax,%eax
> 6e12: 0f 85 f4 fe ff ff jne 6d0c <blk_peek_request+0xac>
> 6e18: 45 31 ed xor %r13d,%r13d
> 6e1b: e9 70 ff ff ff jmpq 6d90 <blk_peek_request+0x130>
> blk_peek_request():
> /home/steve/linux-2.6/block/blk-core.c:1912
> * so that we don't add it again
> */
>
> The offset I got was 6e0d, but I guess thats the start of the following instruction
> by the time the fault is logged. Also I've attached my current .config just in case
> that might be a help is reproducing it,

Good work! it's simpler if you compile with -g ... then the gdb line
command will just tell you all of this. However, it seems clear that
elevator->ops is NULL (elevator_dispatch_fn is 0x28 offset into that in
a 64 bit system).

That's because blk_cleanup_queue called elevator_exit, so we need an out
in the peek case.

So unwrapping the onion, this is the next level

James

---

Index: linux-2.6/block/blk.h
===================================================================
--- linux-2.6.orig/block/blk.h
+++ linux-2.6/block/blk.h
@@ -62,7 +62,7 @@ static inline struct request *__elv_next
return rq;
}

- if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
+ if (!q->elevator->ops || !q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}
}

2011-04-12 16:48:51

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 11:27 -0500, James Bottomley wrote:
> On Tue, 2011-04-12 at 17:04 +0100, Steven Whitehouse wrote:
> > Hi,
> >
> > On Tue, 2011-04-12 at 10:14 -0500, James Bottomley wrote:
> > > On Tue, 2011-04-12 at 15:06 +0100, Steven Whitehouse wrote:
> > > > Messages attached - getting closer I think, but still not quite there.
> > > > It is very odd that so far (at least I've seen no other reports) that I
> > > > seem to be the only one who hits this. I wonder what is different about
> > > > my kernel/hardware...
> > >
> > >
> > > > csi: killing requests for dead queue
> > > > scsi: killing requests for dead queue
> > > > BUG: unable to handle kernel scsi: killing requests for dead queue
> > > > NULL pointer dereference at 0000000000000028
> > > > IP: [<ffffffff813b34ed>] blk_peek_request+0x1ad/0x220
> > >
> > > Hmm, I don't see where this is ... it's a null pointer deref offset by
> > > 0x28 bytes ... but I can't see where ... can you resolve this to a line?
> > >
> > > James
> > >
> > >
> >
> > So assuming I've calculated this correctly, it should be just about....
> >
> > /home/steve/linux-2.6/include/linux/rcupdate.h:677
> > 6de8: a8 08 test $0x8,%al
> > 6dea: 0f 84 65 ff ff ff je 6d55 <blk_peek_request+0xf5>
> > 6df0: e8 00 00 00 00 callq 6df5 <blk_peek_request+0x195>
> > 6df1: R_X86_64_PC32 preempt_schedule-0x4
> > 6df5: e9 5b ff ff ff jmpq 6d55 <blk_peek_request+0xf5>
> > 6dfa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
> > __elv_next_request():
> > /home/steve/linux-2.6/block/blk.h:65
> > if (!list_empty(&q->queue_head)) {
> > rq = list_entry_rq(q->queue_head.next);
> > return rq;
> > }
> >
> > if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
> > 6e00: 49 8b 44 24 18 mov 0x18(%r12),%rax
> > 6e05: 31 f6 xor %esi,%esi
> > 6e07: 48 8b 00 mov (%rax),%rax
> > 6e0a: 4c 89 e7 mov %r12,%rdi <---- here!
> > 6e0d: ff 50 28 callq *0x28(%rax)
> > 6e10: 85 c0 test %eax,%eax
> > 6e12: 0f 85 f4 fe ff ff jne 6d0c <blk_peek_request+0xac>
> > 6e18: 45 31 ed xor %r13d,%r13d
> > 6e1b: e9 70 ff ff ff jmpq 6d90 <blk_peek_request+0x130>
> > blk_peek_request():
> > /home/steve/linux-2.6/block/blk-core.c:1912
> > * so that we don't add it again
> > */
> >
> > The offset I got was 6e0d, but I guess thats the start of the following instruction
> > by the time the fault is logged. Also I've attached my current .config just in case
> > that might be a help is reproducing it,
>
> Good work! it's simpler if you compile with -g ... then the gdb line
> command will just tell you all of this. However, it seems clear that
> elevator->ops is NULL (elevator_dispatch_fn is 0x28 offset into that in
> a 64 bit system).
>
> That's because blk_cleanup_queue called elevator_exit, so we need an out
> in the peek case.
>
> So unwrapping the onion, this is the next level
>
> James
>
> ---
>
> Index: linux-2.6/block/blk.h
> ===================================================================
> --- linux-2.6.orig/block/blk.h
> +++ linux-2.6/block/blk.h
> @@ -62,7 +62,7 @@ static inline struct request *__elv_next
> return rq;
> }
>
> - if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
> + if (!q->elevator->ops || !q->elevator->ops->elevator_dispatch_fn(q, 0))
> return NULL;
> }
> }
>
>

Still not quite there, but looking more hopeful now,

Steve.


Attachments:
james4.txt (38.89 kB)

2011-04-12 17:41:38

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 17:51 +0100, Steven Whitehouse wrote:
> Still not quite there, but looking more hopeful now,

Not sure I share your optimism; but this one

> scsi 0:2:1:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
> scsi: killing requests for dead queue
> ------------[ cut here ]------------
> WARNING: at lib/kref.c:34 kref_get+0x2d/0x30()
> Hardware name: PowerEdge R710
> Modules linked in:
> Pid: 386, comm: kworker/6:1 Not tainted 2.6.39-rc2+ #193
> Call Trace:
> [<ffffffff8108fa9a>] warn_slowpath_common+0x7a/0xb0
> [<ffffffff8108fae5>] warn_slowpath_null+0x15/0x20
> [<ffffffff813c984d>] kref_get+0x2d/0x30
> [<ffffffff813c824a>] kobject_get+0x1a/0x30
> [<ffffffff81460874>] get_device+0x14/0x20
> [<ffffffff81478bd7>] scsi_request_fn+0x37/0x4a0

Is definitely a race between the last put of the SCSI device and the
block delayed work. The signal that mediates that race is supposed to
be the q->queuedata being null, but that doesn't get set until some time
into the release function (by which time the ref is already zero).

Closing the window completely involves setting this to NULL before we do
the final put when we know everything else is gone. So, here's the next
incremental.

James

---

Index: linux-2.6/drivers/scsi/scsi_sysfs.c
===================================================================
--- linux-2.6.orig/drivers/scsi/scsi_sysfs.c
+++ linux-2.6/drivers/scsi/scsi_sysfs.c
@@ -323,7 +323,6 @@ static void scsi_device_dev_release_user
}

if (sdev->request_queue) {
- sdev->request_queue->queuedata = NULL;
/* user context needed to free queue */
scsi_free_queue(sdev->request_queue);
/* temporary expedient, try to catch use of queue lock
@@ -937,6 +936,7 @@ void __scsi_remove_device(struct scsi_de
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
+ sdev->request_queue->queuedata = NULL;
put_device(dev);
}


2011-04-12 18:30:43

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 12:41 -0500, James Bottomley wrote:
> On Tue, 2011-04-12 at 17:51 +0100, Steven Whitehouse wrote:
> > Still not quite there, but looking more hopeful now,
>
> Not sure I share your optimism; but this one
>
Neither do I any more :-) Looks like we are back in blk_peek_request()
again.

> > scsi 0:2:1:0: Direct-Access DELL PERC 6/i 1.22 PQ: 0 ANSI: 5
> > scsi: killing requests for dead queue
> > ------------[ cut here ]------------
> > WARNING: at lib/kref.c:34 kref_get+0x2d/0x30()
> > Hardware name: PowerEdge R710
> > Modules linked in:
> > Pid: 386, comm: kworker/6:1 Not tainted 2.6.39-rc2+ #193
> > Call Trace:
> > [<ffffffff8108fa9a>] warn_slowpath_common+0x7a/0xb0
> > [<ffffffff8108fae5>] warn_slowpath_null+0x15/0x20
> > [<ffffffff813c984d>] kref_get+0x2d/0x30
> > [<ffffffff813c824a>] kobject_get+0x1a/0x30
> > [<ffffffff81460874>] get_device+0x14/0x20
> > [<ffffffff81478bd7>] scsi_request_fn+0x37/0x4a0
>
> Is definitely a race between the last put of the SCSI device and the
> block delayed work. The signal that mediates that race is supposed to
> be the q->queuedata being null, but that doesn't get set until some time
> into the release function (by which time the ref is already zero).
>
> Closing the window completely involves setting this to NULL before we do
> the final put when we know everything else is gone. So, here's the next
> incremental.
>
> James
>
> ---
>
> Index: linux-2.6/drivers/scsi/scsi_sysfs.c
> ===================================================================
> --- linux-2.6.orig/drivers/scsi/scsi_sysfs.c
> +++ linux-2.6/drivers/scsi/scsi_sysfs.c
> @@ -323,7 +323,6 @@ static void scsi_device_dev_release_user
> }
>
> if (sdev->request_queue) {
> - sdev->request_queue->queuedata = NULL;
> /* user context needed to free queue */
> scsi_free_queue(sdev->request_queue);
> /* temporary expedient, try to catch use of queue lock
> @@ -937,6 +936,7 @@ void __scsi_remove_device(struct scsi_de
> if (sdev->host->hostt->slave_destroy)
> sdev->host->hostt->slave_destroy(sdev);
> transport_destroy_device(dev);
> + sdev->request_queue->queuedata = NULL;
> put_device(dev);
> }
>
>
>

__elv_next_request():
/home/steve/linux-2.6/block/blk.h:60
static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;

while (1) {
if (!list_empty(&q->queue_head)) {
6d59: 49 39 dc cmp %rbx,%r12
6d5c: 0f 85 2e ff ff ff jne 6c90 <blk_peek_request+0x30>
/home/steve/linux-2.6/block/blk.h:65
rq = list_entry_rq(q->queue_head.next);
return rq;
}

if (!q->elevator->ops || !q->elevator->ops->elevator_dispatch_fn
(q, 0))
6d62: 49 8b 44 24 18 mov 0x18(%r12),%rax
6d67: 48 8b 00 mov (%rax),%rax
6d6a: 48 85 c0 test %rax,%rax
6d6d: 74 0c je 6d7b <blk_peek_request+0x11b>
6d6f: 31 f6 xor %esi,%esi
6d71: 4c 89 e7 mov %r12,%rdi <----- here
6d74: ff 50 28 callq *0x28(%rax)
6d77: 85 c0 test %eax,%eax
6d79: 75 da jne 6d55 <blk_peek_request+0xf5>
6d7b: 45 31 ed xor %r13d,%r13d
blk_peek_request():
/home/steve/linux-2.6/block/blk-core.c:1929
break;
}
}

return rq;
}



Boot logs attached as usual,

Steve.


Attachments:
james5.txt (24.42 kB)

2011-04-12 19:57:04

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 19:33 +0100, Steven Whitehouse wrote:
> Hi,
>
> On Tue, 2011-04-12 at 12:41 -0500, James Bottomley wrote:
> > On Tue, 2011-04-12 at 17:51 +0100, Steven Whitehouse wrote:
> > > Still not quite there, but looking more hopeful now,
> >
> > Not sure I share your optimism; but this one
> >
> Neither do I any more :-) Looks like we are back in blk_peek_request()
> again.
[...]
> if (!q->elevator->ops || !q->elevator->ops->elevator_dispatch_fn
> (q, 0))
> 6d62: 49 8b 44 24 18 mov 0x18(%r12),%rax
> 6d67: 48 8b 00 mov (%rax),%rax
> 6d6a: 48 85 c0 test %rax,%rax
> 6d6d: 74 0c je 6d7b <blk_peek_request+0x11b>
> 6d6f: 31 f6 xor %esi,%esi
> 6d71: 4c 89 e7 mov %r12,%rdi <----- here
> 6d74: ff 50 28 callq *0x28(%rax)
> 6d77: 85 c0 test %eax,%eax
> 6d79: 75 da jne 6d55 <blk_peek_request+0xf5>
> 6d7b: 45 31 ed xor %r13d,%r13d

Hmm, wrong signal for no elevator then. How about this?

James

---

Index: BUILD-2.6/block/blk.h
===================================================================
--- BUILD-2.6.orig/block/blk.h 2011-04-12 14:51:35.000000000 -0500
+++ BUILD-2.6/block/blk.h 2011-04-12 14:51:53.000000000 -0500
@@ -62,7 +62,8 @@ static inline struct request *__elv_next
return rq;
}

- if (!q->elevator->ops || !q->elevator->ops->elevator_dispatch_fn(q, 0))
+ if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
+ !q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}
}

2011-04-12 20:27:37

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Tue, 2011-04-12 at 14:56 -0500, James Bottomley wrote:
> On Tue, 2011-04-12 at 19:33 +0100, Steven Whitehouse wrote:
> > Hi,
> >
> > On Tue, 2011-04-12 at 12:41 -0500, James Bottomley wrote:
> > > On Tue, 2011-04-12 at 17:51 +0100, Steven Whitehouse wrote:
> > > > Still not quite there, but looking more hopeful now,
> > >
> > > Not sure I share your optimism; but this one
> > >
> > Neither do I any more :-) Looks like we are back in blk_peek_request()
> > again.
> [...]
> > if (!q->elevator->ops || !q->elevator->ops->elevator_dispatch_fn
> > (q, 0))
> > 6d62: 49 8b 44 24 18 mov 0x18(%r12),%rax
> > 6d67: 48 8b 00 mov (%rax),%rax
> > 6d6a: 48 85 c0 test %rax,%rax
> > 6d6d: 74 0c je 6d7b <blk_peek_request+0x11b>
> > 6d6f: 31 f6 xor %esi,%esi
> > 6d71: 4c 89 e7 mov %r12,%rdi <----- here
> > 6d74: ff 50 28 callq *0x28(%rax)
> > 6d77: 85 c0 test %eax,%eax
> > 6d79: 75 da jne 6d55 <blk_peek_request+0xf5>
> > 6d7b: 45 31 ed xor %r13d,%r13d
>
> Hmm, wrong signal for no elevator then. How about this?
>
> James
>
That seems to do the trick... the box has been booted for several
minutes now and no sign of anything untoward so far :-)

Below is the cumulative patch which I now have applied to the kernel.
Many thanks for all your help in debugging this, its greatly
appreciated :-)

Steve.

diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..7f15eb7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -219,6 +219,7 @@ static void blk_delay_work(struct work_struct *work)
spin_lock_irq(q->queue_lock);
__blk_run_queue(q, false);
spin_unlock_irq(q->queue_lock);
+ blk_put_queue(q);
}

/**
@@ -233,7 +234,8 @@ static void blk_delay_work(struct work_struct *work)
*/
void blk_delay_queue(struct request_queue *q, unsigned long msecs)
{
- schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs));
+ if (!blk_get_queue(q))
+ schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs));
}
EXPORT_SYMBOL(blk_delay_queue);

@@ -271,7 +273,8 @@ EXPORT_SYMBOL(blk_start_queue);
**/
void blk_stop_queue(struct request_queue *q)
{
- __cancel_delayed_work(&q->delay_work);
+ if (__cancel_delayed_work(&q->delay_work))
+ blk_put_queue(q);
queue_flag_set(QUEUE_FLAG_STOPPED, q);
}
EXPORT_SYMBOL(blk_stop_queue);
@@ -297,7 +300,8 @@ EXPORT_SYMBOL(blk_stop_queue);
void blk_sync_queue(struct request_queue *q)
{
del_timer_sync(&q->timeout);
- cancel_delayed_work_sync(&q->delay_work);
+ if (__cancel_delayed_work(&q->delay_work))
+ blk_put_queue(q);
queue_sync_plugs(q);
}
EXPORT_SYMBOL(blk_sync_queue);
@@ -324,7 +328,7 @@ void __blk_run_queue(struct request_queue *q, bool force_kblockd)
if (!force_kblockd && !queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
q->request_fn(q);
queue_flag_clear(QUEUE_FLAG_REENTER, q);
- } else
+ } else if (!blk_get_queue(q))
queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
}
EXPORT_SYMBOL(__blk_run_queue);
diff --git a/block/blk.h b/block/blk.h
index 6126346..4df474d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
return rq;
}

- if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
+ if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
+ !q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}
}
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index e44ff64..2e85668 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -323,7 +323,6 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
}

if (sdev->request_queue) {
- sdev->request_queue->queuedata = NULL;
/* user context needed to free queue */
scsi_free_queue(sdev->request_queue);
/* temporary expedient, try to catch use of queue lock
@@ -937,6 +936,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
+ sdev->request_queue->queuedata = NULL;
put_device(dev);
}


2011-04-12 20:44:03

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Tue, 2011-04-12 at 21:30 +0100, Steven Whitehouse wrote:
> > Hmm, wrong signal for no elevator then. How about this?
> >
> > James
> >
> That seems to do the trick... the box has been booted for several
> minutes now and no sign of anything untoward so far :-)
>
> Below is the cumulative patch which I now have applied to the kernel.
> Many thanks for all your help in debugging this, its greatly
> appreciated :-)

Heh, you're welcome, but this is for some value of debug. There's still
something strange going on on your box causing extreme delays in the way
we do stuff. I took advantage of the delays to explore some races in
block and SCSI, which is what got fixed .. I've still no idea what the
actual root cause of the delays is.

Now we need to huddle and decide what we're doing in block and SCSI. I
think Tejun's patch with the moved signal for queue destruction patch
would probably work as well.

James

2011-04-13 05:11:46

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hey, James.

On Tue, Apr 12, 2011 at 10:15:18AM -0500, James Bottomley wrote:
> So your idea is that all final puts should go through a workqueue? Like
> I said, that would work, but it's not just SCSI ... any call path that
> destroys a queue has to be audited.

Yeap.

> The problem is nothing to do with sleeping context ... it's that any
> work called by the block workqueue can't destroy that queue. In a
> refcounted model, that's a bit nasty.

I can see your point but please read on.

> > Hmmm... maybe but at least I prefer doing explicit shutdown/draining
> > on destruction even if the base data structure is refcounted. Things
> > become much more predictable that way.
>
> It is pretty much instantaneous. Unless we're executing, we cancel the
> work. If the work is already running, we just let it complete instead
> of waiting for it.
>
> Synchronous waits are dangerous because they cause entanglement.

There are two different types of dangers involved. One is of getting
trapped into deadlock by recursing and ending up waiting for oneself.
The other of continuing operation on objects which could be in dubious
state. I guess my point is that I prefer the former by a large
margin.

The deadlocks are more reliable in reproducibility. Lockdep and soft
hang check can detect them easily and a single stack dump will point
us right to where the problem is. The latter is much trickier. The
problem is more difficult to trigger and even when it triggers the
effect often wouldn't be obvious. Auditing for correctness is more
difficult too - which fields are safe to access post-mortem? Is there
any chance that the ongoing operation might reach out to hardware
which is already gone or claimed by another software entity?

In this particular case, IMHO it's reasonable for block layer to
require that the destruction function not to be called directly from
request queue path although it definitely could have used better
documentation.

Thank you.

--
tejun

2011-04-13 05:18:53

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hello,

On Tue, Apr 12, 2011 at 03:43:56PM -0500, James Bottomley wrote:
> Heh, you're welcome, but this is for some value of debug. There's still
> something strange going on on your box causing extreme delays in the way
> we do stuff. I took advantage of the delays to explore some races in
> block and SCSI, which is what got fixed .. I've still no idea what the
> actual root cause of the delays is.

A lot of thanks for tracking down all the issues. Something is
definitely strange with Steven's setup tho.

> Now we need to huddle and decide what we're doing in block and SCSI. I
> think Tejun's patch with the moved signal for queue destruction patch
> would probably work as well.

Steven, can you be enticed to try the combination? I'll prep a
combined patch and post it but please beware that I'm currently in a
rather zombish state - sleep deprived, jet lagged and malfunctioning
digestion - so the likelihood of doing something stupid is pretty
high.

Thanks.

--
tejun

2011-04-13 06:06:59

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Wed, Apr 13, 2011 at 02:18:46PM +0900, Tejun Heo wrote:
> Steven, can you be enticed to try the combination? I'll prep a
> combined patch and post it but please beware that I'm currently in a
> rather zombish state - sleep deprived, jet lagged and malfunctioning
> digestion - so the likelihood of doing something stupid is pretty
> high.

Does the following completely untested version work?

diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..bb27804 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -359,6 +359,14 @@ void blk_put_queue(struct request_queue *q)
*/
void blk_cleanup_queue(struct request_queue *q)
{
+ /* mark @q stopped and dead */
+ mutex_lock(&q->sysfs_lock);
+ spin_lock_irq(&q->queue_lock);
+ queue_flag_set(QUEUE_FLAG_STOPPED, q);
+ queue_flag_set(QUEUE_FLAG_DEAD, q);
+ spin_unlock_irq(&q->queue_lock);
+ mutex_unlock(&q->sysfs_lock);
+
/*
* We know we have process context here, so we can be a little
* cautious and ensure that pending block actions on this device
@@ -368,9 +376,6 @@ void blk_cleanup_queue(struct request_queue *q)
blk_sync_queue(q);

del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
- mutex_lock(&q->sysfs_lock);
- queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
- mutex_unlock(&q->sysfs_lock);

if (q->elevator)
elevator_exit(q->elevator);
@@ -1456,7 +1461,7 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
* generic_make_request and the drivers it calls may use bi_next if this
* bio happens to be merged with someone else, and may change bi_dev and
* bi_sector for remaps as it sees fit. So the values of these fields
- * should NOT be depended on after the call to generic_make_request.
+ * should NOT be depended on after thes call to generic_make_request.
*/
static inline void __generic_make_request(struct bio *bio)
{
diff --git a/block/blk.h b/block/blk.h
index 6126346..4df474d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
return rq;
}

- if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
+ if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
+ !q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}
}
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2aeb2e9..12ebcbc 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -70,6 +70,11 @@
#define CREATE_TRACE_POINTS
#include <trace/events/scsi.h>

+/*
+ * Utility multithreaded workqueue for SCSI.
+ */
+struct workqueue_struct *scsi_wq;
+
static void scsi_done(struct scsi_cmnd *cmd);

/*
@@ -1306,11 +1311,14 @@ MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");

static int __init init_scsi(void)
{
- int error;
+ int error = -ENOMEM;

+ scsi_wq = alloc_workqueue("scsi", 0, 0);
+ if (!scsi_wq)
+ return error;
error = scsi_init_queue();
if (error)
- return error;
+ goto cleanup_wq;
error = scsi_init_procfs();
if (error)
goto cleanup_queue;
@@ -1342,6 +1350,8 @@ cleanup_procfs:
scsi_exit_procfs();
cleanup_queue:
scsi_exit_queue();
+cleanup_wq:
+ destroy_workqueue(scsi_wq);
printk(KERN_ERR "SCSI subsystem failed to initialize, error = %d\n",
-error);
return error;
@@ -1356,6 +1366,7 @@ static void __exit exit_scsi(void)
scsi_exit_devinfo();
scsi_exit_procfs();
scsi_exit_queue();
+ destroy_workqueue(scsi_wq);
}

subsys_initcall(init_scsi);
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 087821f..4222b58 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -362,6 +362,16 @@ int scsi_is_target_device(const struct device *dev)
}
EXPORT_SYMBOL(scsi_is_target_device);

+static void scsi_target_reap_usercontext(struct work_struct *work)
+{
+ struct scsi_target *starget =
+ container_of(work, struct scsi_target, reap_work);
+
+ transport_remove_device(&starget->dev);
+ device_del(&starget->dev);
+ scsi_target_destroy(starget);
+}
+
static struct scsi_target *__scsi_find_target(struct device *parent,
int channel, uint id)
{
@@ -427,6 +437,7 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
starget->state = STARGET_CREATED;
starget->scsi_level = SCSI_2;
starget->max_target_blocked = SCSI_DEFAULT_TARGET_BLOCKED;
+ INIT_WORK(&starget->reap_work, scsi_target_reap_usercontext);
retry:
spin_lock_irqsave(shost->host_lock, flags);

@@ -462,21 +473,11 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
}
/* Unfortunately, we found a dying target; need to
* wait until it's dead before we can get a new one */
+ flush_work(&found_target->reap_work);
put_device(&found_target->dev);
- flush_scheduled_work();
goto retry;
}

-static void scsi_target_reap_usercontext(struct work_struct *work)
-{
- struct scsi_target *starget =
- container_of(work, struct scsi_target, ew.work);
-
- transport_remove_device(&starget->dev);
- device_del(&starget->dev);
- scsi_target_destroy(starget);
-}
-
/**
* scsi_target_reap - check to see if target is in use and destroy if not
* @starget: target to be checked
@@ -507,8 +508,7 @@ void scsi_target_reap(struct scsi_target *starget)
if (state == STARGET_CREATED)
scsi_target_destroy(starget);
else
- execute_in_process_context(scsi_target_reap_usercontext,
- &starget->ew);
+ queue_work(scsi_wq, &starget->reap_work);
}

/**
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index e44ff64..e030091 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -300,7 +300,7 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
struct list_head *this, *tmp;
unsigned long flags;

- sdev = container_of(work, struct scsi_device, ew.work);
+ sdev = container_of(work, struct scsi_device, release_work);

parent = sdev->sdev_gendev.parent;
starget = to_scsi_target(parent);
@@ -323,7 +323,6 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
}

if (sdev->request_queue) {
- sdev->request_queue->queuedata = NULL;
/* user context needed to free queue */
scsi_free_queue(sdev->request_queue);
/* temporary expedient, try to catch use of queue lock
@@ -343,8 +342,8 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
static void scsi_device_dev_release(struct device *dev)
{
struct scsi_device *sdp = to_scsi_device(dev);
- execute_in_process_context(scsi_device_dev_release_usercontext,
- &sdp->ew);
+
+ queue_work(scsi_wq, &sdp->release_work);
}

static struct class sdev_class = {
@@ -937,6 +936,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
+ sdev->request_queue->queuedata = NULL;
put_device(dev);
}

@@ -1069,6 +1069,8 @@ void scsi_sysfs_device_initialize(struct scsi_device *sdev)
dev_set_name(&sdev->sdev_dev, "%d:%d:%d:%d",
sdev->host->host_no, sdev->channel, sdev->id, sdev->lun);
sdev->scsi_level = starget->scsi_level;
+ INIT_WORK(&sdev->release_work, scsi_device_dev_release_usercontext);
+
transport_setup_device(&sdev->sdev_gendev);
spin_lock_irqsave(shost->host_lock, flags);
list_add_tail(&sdev->same_target_siblings, &starget->devices);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 2d3ec50..1d11750 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -168,7 +168,7 @@ struct scsi_device {
struct device sdev_gendev,
sdev_dev;

- struct execute_work ew; /* used to get process context on put */
+ struct work_struct release_work; /* for process context on put */

struct scsi_dh_data *scsi_dh_data;
enum scsi_device_state sdev_state;
@@ -259,7 +259,7 @@ struct scsi_target {
#define SCSI_DEFAULT_TARGET_BLOCKED 3

char scsi_level;
- struct execute_work ew;
+ struct work_struct reap_work;
enum scsi_target_state state;
void *hostdata; /* available to low-level driver */
unsigned long starget_data[0]; /* for the transport */
@@ -277,6 +277,8 @@ static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
#define starget_printk(prefix, starget, fmt, a...) \
dev_printk(prefix, &(starget)->dev, fmt, ##a)

+extern struct workqueue_struct *scsi_wq;
+
extern struct scsi_device *__scsi_add_device(struct Scsi_Host *,
uint, uint, uint, void *hostdata);
extern int scsi_add_device(struct Scsi_Host *host, uint channel,

2011-04-13 09:18:07

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Wed, 2011-04-13 at 15:06 +0900, Tejun Heo wrote:
> On Wed, Apr 13, 2011 at 02:18:46PM +0900, Tejun Heo wrote:
> > Steven, can you be enticed to try the combination? I'll prep a
> > combined patch and post it but please beware that I'm currently in a
> > rather zombish state - sleep deprived, jet lagged and malfunctioning
> > digestion - so the likelihood of doing something stupid is pretty
> > high.
>
> Does the following completely untested version work?
>
Almost. With one small change (q->queue_lock is already a pointer to a
spinlock, so it doesn't require an &) it seems to work for me. Let me
know if you'd like me to do any more tests. I've attached an updated
patch below,

Steve.


diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..90de9ac 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -359,6 +359,14 @@ void blk_put_queue(struct request_queue *q)
*/
void blk_cleanup_queue(struct request_queue *q)
{
+ /* mark @q stopped and dead */
+ mutex_lock(&q->sysfs_lock);
+ spin_lock_irq(q->queue_lock);
+ queue_flag_set(QUEUE_FLAG_STOPPED, q);
+ queue_flag_set(QUEUE_FLAG_DEAD, q);
+ spin_unlock_irq(q->queue_lock);
+ mutex_unlock(&q->sysfs_lock);
+
/*
* We know we have process context here, so we can be a little
* cautious and ensure that pending block actions on this device
@@ -368,9 +376,6 @@ void blk_cleanup_queue(struct request_queue *q)
blk_sync_queue(q);

del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
- mutex_lock(&q->sysfs_lock);
- queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
- mutex_unlock(&q->sysfs_lock);

if (q->elevator)
elevator_exit(q->elevator);
@@ -1456,7 +1461,7 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
* generic_make_request and the drivers it calls may use bi_next if this
* bio happens to be merged with someone else, and may change bi_dev and
* bi_sector for remaps as it sees fit. So the values of these fields
- * should NOT be depended on after the call to generic_make_request.
+ * should NOT be depended on after thes call to generic_make_request.
*/
static inline void __generic_make_request(struct bio *bio)
{
diff --git a/block/blk.h b/block/blk.h
index 6126346..4df474d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
return rq;
}

- if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
+ if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
+ !q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}
}
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2aeb2e9..12ebcbc 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -70,6 +70,11 @@
#define CREATE_TRACE_POINTS
#include <trace/events/scsi.h>

+/*
+ * Utility multithreaded workqueue for SCSI.
+ */
+struct workqueue_struct *scsi_wq;
+
static void scsi_done(struct scsi_cmnd *cmd);

/*
@@ -1306,11 +1311,14 @@ MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");

static int __init init_scsi(void)
{
- int error;
+ int error = -ENOMEM;

+ scsi_wq = alloc_workqueue("scsi", 0, 0);
+ if (!scsi_wq)
+ return error;
error = scsi_init_queue();
if (error)
- return error;
+ goto cleanup_wq;
error = scsi_init_procfs();
if (error)
goto cleanup_queue;
@@ -1342,6 +1350,8 @@ cleanup_procfs:
scsi_exit_procfs();
cleanup_queue:
scsi_exit_queue();
+cleanup_wq:
+ destroy_workqueue(scsi_wq);
printk(KERN_ERR "SCSI subsystem failed to initialize, error = %d\n",
-error);
return error;
@@ -1356,6 +1366,7 @@ static void __exit exit_scsi(void)
scsi_exit_devinfo();
scsi_exit_procfs();
scsi_exit_queue();
+ destroy_workqueue(scsi_wq);
}

subsys_initcall(init_scsi);
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 087821f..4222b58 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -362,6 +362,16 @@ int scsi_is_target_device(const struct device *dev)
}
EXPORT_SYMBOL(scsi_is_target_device);

+static void scsi_target_reap_usercontext(struct work_struct *work)
+{
+ struct scsi_target *starget =
+ container_of(work, struct scsi_target, reap_work);
+
+ transport_remove_device(&starget->dev);
+ device_del(&starget->dev);
+ scsi_target_destroy(starget);
+}
+
static struct scsi_target *__scsi_find_target(struct device *parent,
int channel, uint id)
{
@@ -427,6 +437,7 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
starget->state = STARGET_CREATED;
starget->scsi_level = SCSI_2;
starget->max_target_blocked = SCSI_DEFAULT_TARGET_BLOCKED;
+ INIT_WORK(&starget->reap_work, scsi_target_reap_usercontext);
retry:
spin_lock_irqsave(shost->host_lock, flags);

@@ -462,21 +473,11 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
}
/* Unfortunately, we found a dying target; need to
* wait until it's dead before we can get a new one */
+ flush_work(&found_target->reap_work);
put_device(&found_target->dev);
- flush_scheduled_work();
goto retry;
}

-static void scsi_target_reap_usercontext(struct work_struct *work)
-{
- struct scsi_target *starget =
- container_of(work, struct scsi_target, ew.work);
-
- transport_remove_device(&starget->dev);
- device_del(&starget->dev);
- scsi_target_destroy(starget);
-}
-
/**
* scsi_target_reap - check to see if target is in use and destroy if not
* @starget: target to be checked
@@ -507,8 +508,7 @@ void scsi_target_reap(struct scsi_target *starget)
if (state == STARGET_CREATED)
scsi_target_destroy(starget);
else
- execute_in_process_context(scsi_target_reap_usercontext,
- &starget->ew);
+ queue_work(scsi_wq, &starget->reap_work);
}

/**
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index e44ff64..e030091 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -300,7 +300,7 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
struct list_head *this, *tmp;
unsigned long flags;

- sdev = container_of(work, struct scsi_device, ew.work);
+ sdev = container_of(work, struct scsi_device, release_work);

parent = sdev->sdev_gendev.parent;
starget = to_scsi_target(parent);
@@ -323,7 +323,6 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
}

if (sdev->request_queue) {
- sdev->request_queue->queuedata = NULL;
/* user context needed to free queue */
scsi_free_queue(sdev->request_queue);
/* temporary expedient, try to catch use of queue lock
@@ -343,8 +342,8 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
static void scsi_device_dev_release(struct device *dev)
{
struct scsi_device *sdp = to_scsi_device(dev);
- execute_in_process_context(scsi_device_dev_release_usercontext,
- &sdp->ew);
+
+ queue_work(scsi_wq, &sdp->release_work);
}

static struct class sdev_class = {
@@ -937,6 +936,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
+ sdev->request_queue->queuedata = NULL;
put_device(dev);
}

@@ -1069,6 +1069,8 @@ void scsi_sysfs_device_initialize(struct scsi_device *sdev)
dev_set_name(&sdev->sdev_dev, "%d:%d:%d:%d",
sdev->host->host_no, sdev->channel, sdev->id, sdev->lun);
sdev->scsi_level = starget->scsi_level;
+ INIT_WORK(&sdev->release_work, scsi_device_dev_release_usercontext);
+
transport_setup_device(&sdev->sdev_gendev);
spin_lock_irqsave(shost->host_lock, flags);
list_add_tail(&sdev->same_target_siblings, &starget->devices);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index e483108..23eab47 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -545,18 +545,10 @@ static int gfs2_close(struct inode *inode, struct file *file)
/**
* gfs2_fsync - sync the dirty data for a file (across the cluster)
* @file: the file that points to the dentry (we ignore this)
- * @dentry: the dentry that points to the inode to sync
+ * @datasync: set if we can ignore timestamp changes
*
- * The VFS will flush "normal" data for us. We only need to worry
- * about metadata here. For journaled data, we just do a log flush
- * as we can't avoid it. Otherwise we can just bale out if datasync
- * is set. For stuffed inodes we must flush the log in order to
- * ensure that all data is on disk.
- *
- * The call to write_inode_now() is there to write back metadata and
- * the inode itself. It does also try and write the data, but thats
- * (hopefully) a no-op due to the VFS having already called filemap_fdatawrite()
- * for us.
+ * The VFS will flush data for us. We only need to worry
+ * about metadata here.
*
* Returns: errno
*/
@@ -565,22 +557,20 @@ static int gfs2_fsync(struct file *file, int datasync)
{
struct inode *inode = file->f_mapping->host;
int sync_state = inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC);
- int ret = 0;
-
- if (gfs2_is_jdata(GFS2_I(inode))) {
- gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
- return 0;
- }
+ struct gfs2_inode *ip = GFS2_I(inode);
+ int ret;

- if (sync_state != 0) {
- if (!datasync)
- ret = write_inode_now(inode, 0);
+ if (datasync)
+ sync_state &= ~I_DIRTY_SYNC;

- if (gfs2_is_stuffed(GFS2_I(inode)))
- gfs2_log_flush(GFS2_SB(inode), GFS2_I(inode)->i_gl);
+ if (sync_state) {
+ ret = sync_inode_metadata(inode, 1);
+ if (ret)
+ return ret;
+ gfs2_ail_flush(ip->i_gl);
}

- return ret;
+ return 0;
}

/**
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 25eeb2b..7c1b08f 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -28,33 +28,18 @@
#include "trans.h"

/**
- * ail_empty_gl - remove all buffers for a given lock from the AIL
+ * __gfs2_ail_flush - remove all buffers for a given lock from the AIL
* @gl: the glock
*
* None of the buffers should be dirty, locked, or pinned.
*/

-static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
+static void __gfs2_ail_flush(struct gfs2_glock *gl)
{
struct gfs2_sbd *sdp = gl->gl_sbd;
struct list_head *head = &gl->gl_ail_list;
struct gfs2_bufdata *bd;
struct buffer_head *bh;
- struct gfs2_trans tr;
-
- memset(&tr, 0, sizeof(tr));
- tr.tr_revokes = atomic_read(&gl->gl_ail_count);
-
- if (!tr.tr_revokes)
- return;
-
- /* A shortened, inline version of gfs2_trans_begin() */
- tr.tr_reserved = 1 + gfs2_struct2blk(sdp, tr.tr_revokes, sizeof(u64));
- tr.tr_ip = (unsigned long)__builtin_return_address(0);
- INIT_LIST_HEAD(&tr.tr_list_buf);
- gfs2_log_reserve(sdp, tr.tr_reserved);
- BUG_ON(current->journal_info);
- current->journal_info = &tr;

spin_lock(&sdp->sd_ail_lock);
while (!list_empty(head)) {
@@ -76,7 +61,47 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
}
gfs2_assert_withdraw(sdp, !atomic_read(&gl->gl_ail_count));
spin_unlock(&sdp->sd_ail_lock);
+}
+
+
+static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
+{
+ struct gfs2_sbd *sdp = gl->gl_sbd;
+ struct gfs2_trans tr;
+
+ memset(&tr, 0, sizeof(tr));
+ tr.tr_revokes = atomic_read(&gl->gl_ail_count);
+
+ if (!tr.tr_revokes)
+ return;
+
+ /* A shortened, inline version of gfs2_trans_begin() */
+ tr.tr_reserved = 1 + gfs2_struct2blk(sdp, tr.tr_revokes, sizeof(u64));
+ tr.tr_ip = (unsigned long)__builtin_return_address(0);
+ INIT_LIST_HEAD(&tr.tr_list_buf);
+ gfs2_log_reserve(sdp, tr.tr_reserved);
+ BUG_ON(current->journal_info);
+ current->journal_info = &tr;
+
+ __gfs2_ail_flush(gl);
+
+ gfs2_trans_end(sdp);
+ gfs2_log_flush(sdp, NULL);
+}

+void gfs2_ail_flush(struct gfs2_glock *gl)
+{
+ struct gfs2_sbd *sdp = gl->gl_sbd;
+ unsigned int revokes = atomic_read(&gl->gl_ail_count);
+ int ret;
+
+ if (!revokes)
+ return;
+
+ ret = gfs2_trans_begin(sdp, 0, revokes);
+ if (ret)
+ return;
+ __gfs2_ail_flush(gl);
gfs2_trans_end(sdp);
gfs2_log_flush(sdp, NULL);
}
diff --git a/fs/gfs2/glops.h b/fs/gfs2/glops.h
index b3aa2e3..6fce409 100644
--- a/fs/gfs2/glops.h
+++ b/fs/gfs2/glops.h
@@ -23,4 +23,6 @@ extern const struct gfs2_glock_operations gfs2_quota_glops;
extern const struct gfs2_glock_operations gfs2_journal_glops;
extern const struct gfs2_glock_operations *gfs2_glops_list[];

+extern void gfs2_ail_flush(struct gfs2_glock *gl);
+
#endif /* __GLOPS_DOT_H__ */
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 2d3ec50..1d11750 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -168,7 +168,7 @@ struct scsi_device {
struct device sdev_gendev,
sdev_dev;

- struct execute_work ew; /* used to get process context on put */
+ struct work_struct release_work; /* for process context on put */

struct scsi_dh_data *scsi_dh_data;
enum scsi_device_state sdev_state;
@@ -259,7 +259,7 @@ struct scsi_target {
#define SCSI_DEFAULT_TARGET_BLOCKED 3

char scsi_level;
- struct execute_work ew;
+ struct work_struct reap_work;
enum scsi_target_state state;
void *hostdata; /* available to low-level driver */
unsigned long starget_data[0]; /* for the transport */
@@ -277,6 +277,8 @@ static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
#define starget_printk(prefix, starget, fmt, a...) \
dev_printk(prefix, &(starget)->dev, fmt, ##a)

+extern struct workqueue_struct *scsi_wq;
+
extern struct scsi_device *__scsi_add_device(struct Scsi_Host *,
uint, uint, uint, void *hostdata);
extern int scsi_add_device(struct Scsi_Host *host, uint channel,

2011-04-13 13:57:18

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Wed, 2011-04-13 at 10:20 +0100, Steven Whitehouse wrote:
> Hi,
>
> On Wed, 2011-04-13 at 15:06 +0900, Tejun Heo wrote:
> > On Wed, Apr 13, 2011 at 02:18:46PM +0900, Tejun Heo wrote:
> > > Steven, can you be enticed to try the combination? I'll prep a
> > > combined patch and post it but please beware that I'm currently in a
> > > rather zombish state - sleep deprived, jet lagged and malfunctioning
> > > digestion - so the likelihood of doing something stupid is pretty
> > > high.
> >
> > Does the following completely untested version work?
> >
> Almost. With one small change (q->queue_lock is already a pointer to a
> spinlock, so it doesn't require an &) it seems to work for me. Let me
> know if you'd like me to do any more tests. I've attached an updated
> patch below,
>
> Steve.
>
Apologies, some gfs2 bits got included by accident. Here is a cleaned up
version,

Steve.

diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..90de9ac 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -359,6 +359,14 @@ void blk_put_queue(struct request_queue *q)
*/
void blk_cleanup_queue(struct request_queue *q)
{
+ /* mark @q stopped and dead */
+ mutex_lock(&q->sysfs_lock);
+ spin_lock_irq(q->queue_lock);
+ queue_flag_set(QUEUE_FLAG_STOPPED, q);
+ queue_flag_set(QUEUE_FLAG_DEAD, q);
+ spin_unlock_irq(q->queue_lock);
+ mutex_unlock(&q->sysfs_lock);
+
/*
* We know we have process context here, so we can be a little
* cautious and ensure that pending block actions on this device
@@ -368,9 +376,6 @@ void blk_cleanup_queue(struct request_queue *q)
blk_sync_queue(q);

del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
- mutex_lock(&q->sysfs_lock);
- queue_flag_set_unlocked(QUEUE_FLAG_DEAD, q);
- mutex_unlock(&q->sysfs_lock);

if (q->elevator)
elevator_exit(q->elevator);
@@ -1456,7 +1461,7 @@ static inline int bio_check_eod(struct bio *bio, unsigned int nr_sectors)
* generic_make_request and the drivers it calls may use bi_next if this
* bio happens to be merged with someone else, and may change bi_dev and
* bi_sector for remaps as it sees fit. So the values of these fields
- * should NOT be depended on after the call to generic_make_request.
+ * should NOT be depended on after thes call to generic_make_request.
*/
static inline void __generic_make_request(struct bio *bio)
{
diff --git a/block/blk.h b/block/blk.h
index 6126346..4df474d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
return rq;
}

- if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
+ if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
+ !q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}
}
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 2aeb2e9..12ebcbc 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -70,6 +70,11 @@
#define CREATE_TRACE_POINTS
#include <trace/events/scsi.h>

+/*
+ * Utility multithreaded workqueue for SCSI.
+ */
+struct workqueue_struct *scsi_wq;
+
static void scsi_done(struct scsi_cmnd *cmd);

/*
@@ -1306,11 +1311,14 @@ MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");

static int __init init_scsi(void)
{
- int error;
+ int error = -ENOMEM;

+ scsi_wq = alloc_workqueue("scsi", 0, 0);
+ if (!scsi_wq)
+ return error;
error = scsi_init_queue();
if (error)
- return error;
+ goto cleanup_wq;
error = scsi_init_procfs();
if (error)
goto cleanup_queue;
@@ -1342,6 +1350,8 @@ cleanup_procfs:
scsi_exit_procfs();
cleanup_queue:
scsi_exit_queue();
+cleanup_wq:
+ destroy_workqueue(scsi_wq);
printk(KERN_ERR "SCSI subsystem failed to initialize, error = %d\n",
-error);
return error;
@@ -1356,6 +1366,7 @@ static void __exit exit_scsi(void)
scsi_exit_devinfo();
scsi_exit_procfs();
scsi_exit_queue();
+ destroy_workqueue(scsi_wq);
}

subsys_initcall(init_scsi);
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 087821f..4222b58 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -362,6 +362,16 @@ int scsi_is_target_device(const struct device *dev)
}
EXPORT_SYMBOL(scsi_is_target_device);

+static void scsi_target_reap_usercontext(struct work_struct *work)
+{
+ struct scsi_target *starget =
+ container_of(work, struct scsi_target, reap_work);
+
+ transport_remove_device(&starget->dev);
+ device_del(&starget->dev);
+ scsi_target_destroy(starget);
+}
+
static struct scsi_target *__scsi_find_target(struct device *parent,
int channel, uint id)
{
@@ -427,6 +437,7 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
starget->state = STARGET_CREATED;
starget->scsi_level = SCSI_2;
starget->max_target_blocked = SCSI_DEFAULT_TARGET_BLOCKED;
+ INIT_WORK(&starget->reap_work, scsi_target_reap_usercontext);
retry:
spin_lock_irqsave(shost->host_lock, flags);

@@ -462,21 +473,11 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
}
/* Unfortunately, we found a dying target; need to
* wait until it's dead before we can get a new one */
+ flush_work(&found_target->reap_work);
put_device(&found_target->dev);
- flush_scheduled_work();
goto retry;
}

-static void scsi_target_reap_usercontext(struct work_struct *work)
-{
- struct scsi_target *starget =
- container_of(work, struct scsi_target, ew.work);
-
- transport_remove_device(&starget->dev);
- device_del(&starget->dev);
- scsi_target_destroy(starget);
-}
-
/**
* scsi_target_reap - check to see if target is in use and destroy if not
* @starget: target to be checked
@@ -507,8 +508,7 @@ void scsi_target_reap(struct scsi_target *starget)
if (state == STARGET_CREATED)
scsi_target_destroy(starget);
else
- execute_in_process_context(scsi_target_reap_usercontext,
- &starget->ew);
+ queue_work(scsi_wq, &starget->reap_work);
}

/**
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index e44ff64..e030091 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -300,7 +300,7 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
struct list_head *this, *tmp;
unsigned long flags;

- sdev = container_of(work, struct scsi_device, ew.work);
+ sdev = container_of(work, struct scsi_device, release_work);

parent = sdev->sdev_gendev.parent;
starget = to_scsi_target(parent);
@@ -323,7 +323,6 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
}

if (sdev->request_queue) {
- sdev->request_queue->queuedata = NULL;
/* user context needed to free queue */
scsi_free_queue(sdev->request_queue);
/* temporary expedient, try to catch use of queue lock
@@ -343,8 +342,8 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
static void scsi_device_dev_release(struct device *dev)
{
struct scsi_device *sdp = to_scsi_device(dev);
- execute_in_process_context(scsi_device_dev_release_usercontext,
- &sdp->ew);
+
+ queue_work(scsi_wq, &sdp->release_work);
}

static struct class sdev_class = {
@@ -937,6 +936,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
+ sdev->request_queue->queuedata = NULL;
put_device(dev);
}

@@ -1069,6 +1069,8 @@ void scsi_sysfs_device_initialize(struct scsi_device *sdev)
dev_set_name(&sdev->sdev_dev, "%d:%d:%d:%d",
sdev->host->host_no, sdev->channel, sdev->id, sdev->lun);
sdev->scsi_level = starget->scsi_level;
+ INIT_WORK(&sdev->release_work, scsi_device_dev_release_usercontext);
+
transport_setup_device(&sdev->sdev_gendev);
spin_lock_irqsave(shost->host_lock, flags);
list_add_tail(&sdev->same_target_siblings, &starget->devices);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 2d3ec50..1d11750 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -168,7 +168,7 @@ struct scsi_device {
struct device sdev_gendev,
sdev_dev;

- struct execute_work ew; /* used to get process context on put */
+ struct work_struct release_work; /* for process context on put */

struct scsi_dh_data *scsi_dh_data;
enum scsi_device_state sdev_state;
@@ -259,7 +259,7 @@ struct scsi_target {
#define SCSI_DEFAULT_TARGET_BLOCKED 3

char scsi_level;
- struct execute_work ew;
+ struct work_struct reap_work;
enum scsi_target_state state;
void *hostdata; /* available to low-level driver */
unsigned long starget_data[0]; /* for the transport */
@@ -277,6 +277,8 @@ static inline struct scsi_target *scsi_target(struct scsi_device *sdev)
#define starget_printk(prefix, starget, fmt, a...) \
dev_printk(prefix, &(starget)->dev, fmt, ##a)

+extern struct workqueue_struct *scsi_wq;
+
extern struct scsi_device *__scsi_add_device(struct Scsi_Host *,
uint, uint, uint, void *hostdata);
extern int scsi_add_device(struct Scsi_Host *host, uint channel,

2011-04-13 14:15:28

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Wed, 2011-04-13 at 14:11 +0900, Tejun Heo wrote:
> > > Hmmm... maybe but at least I prefer doing explicit shutdown/draining
> > > on destruction even if the base data structure is refcounted. Things
> > > become much more predictable that way.
> >
> > It is pretty much instantaneous. Unless we're executing, we cancel the
> > work. If the work is already running, we just let it complete instead
> > of waiting for it.
> >
> > Synchronous waits are dangerous because they cause entanglement.
>
> There are two different types of dangers involved. One is of getting
> trapped into deadlock by recursing and ending up waiting for oneself.
> The other of continuing operation on objects which could be in dubious
> state. I guess my point is that I prefer the former by a large
> margin.
>
> The deadlocks are more reliable in reproducibility. Lockdep and soft
> hang check can detect them easily and a single stack dump will point
> us right to where the problem is. The latter is much trickier.

I agree, but this is a bit of a false dichotomy. The hang will only
detect the thread waiting on itself. Even in the flush model, we still
have to detect inappropriate object use because others may still have
references.

So, in the sync model, on blk_cleanup_queue() you flush the pending
requests and destroy the elevator. However, because the queue is
refcounted, you still have to cope with the case where one of the
reference holders submits more I/O or does anything else with the queue.
This is what I don't like about the sync then shut down various bits
before the refcount goes to zero. Now we don't have a fully functional
queue so we need state guards on all the entry points to detect this and
error out (that's some of the QUEUE_FLAG_DEAD checks we put in in the
patch).

In the async model, you can either do as above (state guards on the
entry points) and impose a shutting down state, or you can delay
destruction until final put. The former is an identical solution to the
synchronous one, except that you don't have the flush. The latter loses
the state guard on entry points requirements (because the queue is
always fully functional until final put.

> The
> problem is more difficult to trigger and even when it triggers the
> effect often wouldn't be obvious. Auditing for correctness is more
> difficult too - which fields are safe to access post-mortem?

But in a refcounted model you always have to expect postmortem
operations ... you just have to make sure they're properly rejected.
This is true for both sync or async if you impose a dying or dead state
on the model.

> Is there
> any chance that the ongoing operation might reach out to hardware
> which is already gone or claimed by another software entity?

Yes ... in fact SCSI fully expects this ... that's why they dead queue
check in the SCSI request function using a NULL ->queuedata signal
(another thing that wasn't working quite right).

We tear down the device in scsi_remove_device() including destroying the
queue. If we go with the sync model for block, I'd actually move
blk_cleanup_queue() into that function. There's no real point delaying
queue destruction to final put of the SCSI object, since we won't accept
I/O after scsi_remove_device() returns.

> In this particular case, IMHO it's reasonable for block layer to
> require that the destruction function not to be called directly from
> request queue path although it definitely could have used better
> documentation.

I think we've both stated our cases, so it's time for Jens to decide
what he wants to do. Given that block already has a state model which
includes some QUEUE_FLAG_DEAD state guard checks, it probably makes
sense to enhance that, rather than to delay all destruction until final
put.

James

2011-04-13 17:01:25

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

While you still have the problematic system, can you try this patch? It
avoids changing anything in block (other than to add a missing state
guard for the elv_next_request). If it works, we can defer the sync vs
async discussion and use it for a -stable fix.

Thanks,

James

---

diff --git a/block/blk.h b/block/blk.h
index c8db371..11d0d25 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
return rq;
}

- if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
+ if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
+ !q->elevator->ops->elevator_dispatch_fn(q, 0))
return NULL;
}
}
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index e44ff64..5aa4246 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -322,14 +322,9 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
kfree(evt);
}

- if (sdev->request_queue) {
- sdev->request_queue->queuedata = NULL;
- /* user context needed to free queue */
- scsi_free_queue(sdev->request_queue);
- /* temporary expedient, try to catch use of queue lock
- * after free of sdev */
- sdev->request_queue = NULL;
- }
+ /* temporary expedient, try to catch use of queue lock after
+ * free of sdev */
+ sdev->request_queue = NULL;

scsi_target_reap(scsi_target(sdev));

@@ -937,6 +932,11 @@ void __scsi_remove_device(struct scsi_device *sdev)
if (sdev->host->hostt->slave_destroy)
sdev->host->hostt->slave_destroy(sdev);
transport_destroy_device(dev);
+ /* Setting this to NULL causes the request function to reject
+ * any I/O requests */
+ sdev->request_queue->queuedata = NULL;
+ /* Freeing the queue signals to block that we're done */
+ scsi_free_queue(sdev->request_queue);
put_device(dev);
}


2011-04-13 19:32:46

by Steven Whitehouse

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi,

On Wed, 2011-04-13 at 12:01 -0500, James Bottomley wrote:
> While you still have the problematic system, can you try this patch? It
> avoids changing anything in block (other than to add a missing state
> guard for the elv_next_request). If it works, we can defer the sync vs
> async discussion and use it for a -stable fix.
>
> Thanks,
>
> James
>
> ---
>
Yes, that seems to work too,

Steve.

> diff --git a/block/blk.h b/block/blk.h
> index c8db371..11d0d25 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
> return rq;
> }
>
> - if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
> + if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
> + !q->elevator->ops->elevator_dispatch_fn(q, 0))
> return NULL;
> }
> }
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index e44ff64..5aa4246 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -322,14 +322,9 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
> kfree(evt);
> }
>
> - if (sdev->request_queue) {
> - sdev->request_queue->queuedata = NULL;
> - /* user context needed to free queue */
> - scsi_free_queue(sdev->request_queue);
> - /* temporary expedient, try to catch use of queue lock
> - * after free of sdev */
> - sdev->request_queue = NULL;
> - }
> + /* temporary expedient, try to catch use of queue lock after
> + * free of sdev */
> + sdev->request_queue = NULL;
>
> scsi_target_reap(scsi_target(sdev));
>
> @@ -937,6 +932,11 @@ void __scsi_remove_device(struct scsi_device *sdev)
> if (sdev->host->hostt->slave_destroy)
> sdev->host->hostt->slave_destroy(sdev);
> transport_destroy_device(dev);
> + /* Setting this to NULL causes the request function to reject
> + * any I/O requests */
> + sdev->request_queue->queuedata = NULL;
> + /* Freeing the queue signals to block that we're done */
> + scsi_free_queue(sdev->request_queue);
> put_device(dev);
> }
>
>
>

2011-04-13 20:12:40

by Jens Axboe

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On 2011-04-13 19:01, James Bottomley wrote:
> While you still have the problematic system, can you try this patch? It
> avoids changing anything in block (other than to add a missing state
> guard for the elv_next_request). If it works, we can defer the sync vs
> async discussion and use it for a -stable fix.
>
> Thanks,
>
> James
>
> ---
>
> diff --git a/block/blk.h b/block/blk.h
> index c8db371..11d0d25 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
> return rq;
> }
>
> - if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
> + if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
> + !q->elevator->ops->elevator_dispatch_fn(q, 0))
> return NULL;
> }
> }
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index e44ff64..5aa4246 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -322,14 +322,9 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
> kfree(evt);
> }
>
> - if (sdev->request_queue) {
> - sdev->request_queue->queuedata = NULL;
> - /* user context needed to free queue */
> - scsi_free_queue(sdev->request_queue);
> - /* temporary expedient, try to catch use of queue lock
> - * after free of sdev */
> - sdev->request_queue = NULL;
> - }
> + /* temporary expedient, try to catch use of queue lock after
> + * free of sdev */
> + sdev->request_queue = NULL;
>
> scsi_target_reap(scsi_target(sdev));
>
> @@ -937,6 +932,11 @@ void __scsi_remove_device(struct scsi_device *sdev)
> if (sdev->host->hostt->slave_destroy)
> sdev->host->hostt->slave_destroy(sdev);
> transport_destroy_device(dev);
> + /* Setting this to NULL causes the request function to reject
> + * any I/O requests */
> + sdev->request_queue->queuedata = NULL;
> + /* Freeing the queue signals to block that we're done */
> + scsi_free_queue(sdev->request_queue);
> put_device(dev);
> }

This patch looks pretty clean. Shouldn't you serialize that ->queuedata
= NULL assignment with the queue lock, though?

--
Jens Axboe

2011-04-13 20:17:50

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Wed, 2011-04-13 at 22:12 +0200, Jens Axboe wrote:
> On 2011-04-13 19:01, James Bottomley wrote:
> > While you still have the problematic system, can you try this patch? It
> > avoids changing anything in block (other than to add a missing state
> > guard for the elv_next_request). If it works, we can defer the sync vs
> > async discussion and use it for a -stable fix.
> >
> > Thanks,
> >
> > James
> >
> > ---
> >
> > diff --git a/block/blk.h b/block/blk.h
> > index c8db371..11d0d25 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -62,7 +62,8 @@ static inline struct request *__elv_next_request(struct request_queue *q)
> > return rq;
> > }
> >
> > - if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
> > + if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
> > + !q->elevator->ops->elevator_dispatch_fn(q, 0))
> > return NULL;
> > }
> > }
> > diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> > index e44ff64..5aa4246 100644
> > --- a/drivers/scsi/scsi_sysfs.c
> > +++ b/drivers/scsi/scsi_sysfs.c
> > @@ -322,14 +322,9 @@ static void scsi_device_dev_release_usercontext(struct work_struct *work)
> > kfree(evt);
> > }
> >
> > - if (sdev->request_queue) {
> > - sdev->request_queue->queuedata = NULL;
> > - /* user context needed to free queue */
> > - scsi_free_queue(sdev->request_queue);
> > - /* temporary expedient, try to catch use of queue lock
> > - * after free of sdev */
> > - sdev->request_queue = NULL;
> > - }
> > + /* temporary expedient, try to catch use of queue lock after
> > + * free of sdev */
> > + sdev->request_queue = NULL;
> >
> > scsi_target_reap(scsi_target(sdev));
> >
> > @@ -937,6 +932,11 @@ void __scsi_remove_device(struct scsi_device *sdev)
> > if (sdev->host->hostt->slave_destroy)
> > sdev->host->hostt->slave_destroy(sdev);
> > transport_destroy_device(dev);
> > + /* Setting this to NULL causes the request function to reject
> > + * any I/O requests */
> > + sdev->request_queue->queuedata = NULL;
> > + /* Freeing the queue signals to block that we're done */
> > + scsi_free_queue(sdev->request_queue);
> > put_device(dev);
> > }
>
> This patch looks pretty clean. Shouldn't you serialize that ->queuedata
> = NULL assignment with the queue lock, though?

pointer assignment is atomic, isn't it? As in on a 32 bit arch, it
definitely is, and I thought on 64 bits it also was. So a simultaneous
reader either sees previous value or NULL.

James


2011-04-22 18:02:08

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hello, guys.

On Wed, Apr 13, 2011 at 03:17:46PM -0500, James Bottomley wrote:
> On Wed, 2011-04-13 at 22:12 +0200, Jens Axboe wrote:
> > This patch looks pretty clean. Shouldn't you serialize that ->queuedata
> > = NULL assignment with the queue lock, though?
>
> pointer assignment is atomic, isn't it? As in on a 32 bit arch, it
> definitely is, and I thought on 64 bits it also was. So a simultaneous
> reader either sees previous value or NULL.

What was the conclusion of this thread? James' fix seems minimal and
good for this cycle but I don't see it there yet. Jens, James?

Thanks.

--
tejun

2011-04-22 18:03:18

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Wed, Apr 13, 2011 at 03:00:08PM +0100, Steven Whitehouse wrote:
> Apologies, some gfs2 bits got included by accident. Here is a cleaned up
> version,

Great, thanks. I'll clean it up and try to get it included for the
next merge window.

Thank you.

--
tejun

2011-04-22 18:06:23

by James Bottomley

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Fri, 2011-04-22 at 20:01 +0200, Tejun Heo wrote:
> Hello, guys.
>
> On Wed, Apr 13, 2011 at 03:17:46PM -0500, James Bottomley wrote:
> > On Wed, 2011-04-13 at 22:12 +0200, Jens Axboe wrote:
> > > This patch looks pretty clean. Shouldn't you serialize that ->queuedata
> > > = NULL assignment with the queue lock, though?
> >
> > pointer assignment is atomic, isn't it? As in on a 32 bit arch, it
> > definitely is, and I thought on 64 bits it also was. So a simultaneous
> > reader either sees previous value or NULL.
>
> What was the conclusion of this thread? James' fix seems minimal and
> good for this cycle but I don't see it there yet. Jens, James?

Sorry, forgot to cc you.

I've proposed the minimal fixes for this cycle here:

http://marc.info/?l=linux-scsi&m=130348665328233
http://marc.info/?l=linux-scsi&m=130348673628282
http://marc.info/?l=linux-scsi&m=130348680328356

And I'll merge at least the scsi one.

But there's been no long term conclusion yet.

James

2011-04-22 18:30:21

by Tejun Heo

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

On Fri, Apr 22, 2011 at 01:06:18PM -0500, James Bottomley wrote:
> I've proposed the minimal fixes for this cycle here:
>
> http://marc.info/?l=linux-scsi&m=130348665328233
> http://marc.info/?l=linux-scsi&m=130348673628282
> http://marc.info/?l=linux-scsi&m=130348680328356
>
> And I'll merge at least the scsi one.

Great, looks good to me too.

> But there's been no long term conclusion yet.

I'll clean up the workqueue conversion patch and try again.

Thank you.

--
tejun

2011-05-31 06:16:07

by Anton V. Boyarshinov

[permalink] [raw]
Subject: Re: Strange block/scsi/workqueue issue

Hi

> While you still have the problematic system, can you try this patch?
It seems that a scsi_sysfs.c half of this patch there is in 2.6.39, but
not blk.h part. Now we have frequent panics caused NULL in
q->elevator->ops in block/blk.h.

Panic is triggered by connecting USB Storage device without medium
(card reader, android phone). We saw it on different computers with
different kernel builds. Most frequent panics have seen on kernel build
with forsed preemption under relatively havy load (for example, boot).

Regards, Anton

> It avoids changing anything in block (other than to add a missing
> state guard for the elv_next_request). If it works, we can defer the
> sync vs async discussion and use it for a -stable fix.
>
> Thanks,
>
> James
>
> ---
>
> diff --git a/block/blk.h b/
> index c8db371..11d0d25 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -62,7 +62,8 @@ static inline struct request
> *__elv_next_request(struct request_queue *q) return rq;
> }
>
> - if (!q->elevator->ops->elevator_dispatch_fn(q, 0))
> + if (test_bit(QUEUE_FLAG_DEAD, &q->queue_flags) ||
> + !q->elevator->ops->elevator_dispatch_fn(q, 0))
> return NULL;
> }
> }
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index e44ff64..5aa4246 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -322,14 +322,9 @@ static void
> scsi_device_dev_release_usercontext(struct work_struct *work)
> kfree(evt); }
>
> - if (sdev->request_queue) {
> - sdev->request_queue->queuedata = NULL;
> - /* user context needed to free queue */
> - scsi_free_queue(sdev->request_queue);
> - /* temporary expedient, try to catch use of queue
> lock
> - * after free of sdev */
> - sdev->request_queue = NULL;
> - }
> + /* temporary expedient, try to catch use of queue lock after
> + * free of sdev */
> + sdev->request_queue = NULL;
>
> scsi_target_reap(scsi_target(sdev));
>
> @@ -937,6 +932,11 @@ void __scsi_remove_device(struct scsi_device
> *sdev) if (sdev->host->hostt->slave_destroy)
> sdev->host->hostt->slave_destroy(sdev);
> transport_destroy_device(dev);
> + /* Setting this to NULL causes the request function to reject
> + * any I/O requests */
> + sdev->request_queue->queuedata = NULL;
> + /* Freeing the queue signals to block that we're done */
> + scsi_free_queue(sdev->request_queue);
> put_device(dev);
> }
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/