2005-03-27 19:22:41

by Chris Rankin

[permalink] [raw]
Subject: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

[gcc-3.4.3, Linux-2.6.11-SMP, Dual P4 Xeon with HT enabled]

Hi,

My Linux 2.6.11 box oopsed when I tried to logout. I have switched to using the anticipatory
scheduler instead.

Cheers,
Chris

NMI Watchdog detected LOCKUP on CPU1, eip c0275cc7, registers:
Modules linked in: snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib snd_intel8x0 snd_seq_oss
snd_seq_midi snd_emu10k1_synth snd_emu10k1 snd_ac97_codec snd_pcm snd_page_alloc snd_emux_synth
snd_seq_virmidi snd_rawmidi snd_seq_midi_event snd_seq_midi_emul snd_hwdep snd_util_mem snd_seq
snd_seq_device snd_rtctimer snd_timer snd nls_iso8859_1 nls_cp437 vfat fat usb_storage radeon drm
i2c_algo_bit emu10k1_gp gameport deflate zlib_deflate zlib_inflate twofish serpent aes_i586
blowfish des sha256 crypto_null af_key binfmt_misc eeprom i2c_sensor button processor psmouse
pcspkr p4_clockmod speedstep_lib usbserial lp nfsd exportfs md5 ipv6 sd_mod scsi_mod autofs nfs
lockd sunrpc af_packet ohci_hcd parport_pc parport e1000 video1394 raw1394 i2c_i801 i2c_core
ohci1394 ieee1394 ehci_hcd soundcore pwc videodev uhci_hcd usbcore intel_agp agpgart ide_cd cdrom
ext3 jbd
CPU: 1
EIP: 0060:[<c0275cc7>] Not tainted VLI
EFLAGS: 00200086 (2.6.11)
EIP is at _spin_lock+0x7/0xf
eax: f7b8b01c ebx: f7c82b88 ecx: f7c82b94 edx: f6c33714
esi: eb68ad88 edi: f6c33708 ebp: f6c33714 esp: f5b32f70
ds: 007b es: 007b ss: 0068
Process nautilus (pid: 5757, threadinfo=f5b32000 task=f7518020)
Stack: c01f7f79 00200282 f76bda24 f6c323e4 f7518020 00000000 00000000 c01f1d0c
f5b32000 c011d7b3 00000001 00000000 b65ffa40 00000000 f5b32fac 00000000
00000000 00000000 f5b32000 c011d8d6 c0102e7f 00000000 b65ffbf0 b6640bf0
Call Trace:
[<c01f7f79>] cfq_exit_io_context+0x54/0xb3
[<c01f1d0c>] exit_io_context+0x45/0x51
[<c011d7b3>] do_exit+0x205/0x308
[<c011d8d6>] next_thread+0x0/0xc
[<c0102e7f>] syscall_call+0x7/0xb
Code: 05 e8 3a e6 ff ff c3 ba 00 f0 ff ff 21 e2 81 42 14 00 01 00 00 f0 81 28 00 00 00 01 74 05 e8
1d e6 ff ff c3 f0 fe 08 79 09 f3 90 <80> 38 00 7e f9 eb f2 c3 f0 81 28 00 00 00 01 74 05 e8 ff e5
ff
console shuts up ...


Send instant messages to your online friends http://uk.messenger.yahoo.com


2005-03-29 11:32:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Sun, Mar 27 2005, Chris Rankin wrote:
> [gcc-3.4.3, Linux-2.6.11-SMP, Dual P4 Xeon with HT enabled]
>
> Hi,
>
> My Linux 2.6.11 box oopsed when I tried to logout. I have switched to using the anticipatory
> scheduler instead.
>
> Cheers,
> Chris
>
> NMI Watchdog detected LOCKUP on CPU1, eip c0275cc7, registers:
> Modules linked in: snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib snd_intel8x0 snd_seq_oss
> snd_seq_midi snd_emu10k1_synth snd_emu10k1 snd_ac97_codec snd_pcm snd_page_alloc snd_emux_synth
> snd_seq_virmidi snd_rawmidi snd_seq_midi_event snd_seq_midi_emul snd_hwdep snd_util_mem snd_seq
> snd_seq_device snd_rtctimer snd_timer snd nls_iso8859_1 nls_cp437 vfat fat usb_storage radeon drm
> i2c_algo_bit emu10k1_gp gameport deflate zlib_deflate zlib_inflate twofish serpent aes_i586
> blowfish des sha256 crypto_null af_key binfmt_misc eeprom i2c_sensor button processor psmouse
> pcspkr p4_clockmod speedstep_lib usbserial lp nfsd exportfs md5 ipv6 sd_mod scsi_mod autofs nfs
> lockd sunrpc af_packet ohci_hcd parport_pc parport e1000 video1394 raw1394 i2c_i801 i2c_core
> ohci1394 ieee1394 ehci_hcd soundcore pwc videodev uhci_hcd usbcore intel_agp agpgart ide_cd cdrom
> ext3 jbd
> CPU: 1
> EIP: 0060:[<c0275cc7>] Not tainted VLI
> EFLAGS: 00200086 (2.6.11)
> EIP is at _spin_lock+0x7/0xf
> eax: f7b8b01c ebx: f7c82b88 ecx: f7c82b94 edx: f6c33714
> esi: eb68ad88 edi: f6c33708 ebp: f6c33714 esp: f5b32f70
> ds: 007b es: 007b ss: 0068
> Process nautilus (pid: 5757, threadinfo=f5b32000 task=f7518020)
> Stack: c01f7f79 00200282 f76bda24 f6c323e4 f7518020 00000000 00000000 c01f1d0c
> f5b32000 c011d7b3 00000001 00000000 b65ffa40 00000000 f5b32fac 00000000
> 00000000 00000000 f5b32000 c011d8d6 c0102e7f 00000000 b65ffbf0 b6640bf0
> Call Trace:
> [<c01f7f79>] cfq_exit_io_context+0x54/0xb3
> [<c01f1d0c>] exit_io_context+0x45/0x51
> [<c011d7b3>] do_exit+0x205/0x308
> [<c011d8d6>] next_thread+0x0/0xc
> [<c0102e7f>] syscall_call+0x7/0xb
> Code: 05 e8 3a e6 ff ff c3 ba 00 f0 ff ff 21 e2 81 42 14 00 01 00 00 f0 81 28 00 00 00 01 74 05 e8
> 1d e6 ff ff c3 f0 fe 08 79 09 f3 90 <80> 38 00 7e f9 eb f2 c3 f0 81 28 00 00 00 01 74 05 e8 ff e5
> ff
> console shuts up ...

The queue was gone by the time the process exited. What type of storage
do you have attached to the box? At least with SCSI, it has some
problems in this area - it will glady free the scsi device structure
(where the queue lock is located) while the queue reference count still
hasn't dropped to zero.

--
Jens Axboe

2005-03-29 11:59:51

by Chris Rankin

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

I have one IDE hard disc, but I was using a USB memory stick at one point. (Notice the usb-storage
and vfat modules in my list.) Could that be the troublesome SCSI device?

--- Jens Axboe <[email protected]> wrote:
> On Sun, Mar 27 2005, Chris Rankin wrote:
> > [gcc-3.4.3, Linux-2.6.11-SMP, Dual P4 Xeon with HT enabled]
> >
> > Hi,
> >
> > My Linux 2.6.11 box oopsed when I tried to logout. I have switched to using the anticipatory
> > scheduler instead.
> >
> > Cheers,
> > Chris
> >
> > NMI Watchdog detected LOCKUP on CPU1, eip c0275cc7, registers:
> > Modules linked in: snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib snd_intel8x0
> snd_seq_oss
> > snd_seq_midi snd_emu10k1_synth snd_emu10k1 snd_ac97_codec snd_pcm snd_page_alloc
> snd_emux_synth
> > snd_seq_virmidi snd_rawmidi snd_seq_midi_event snd_seq_midi_emul snd_hwdep snd_util_mem
> snd_seq
> > snd_seq_device snd_rtctimer snd_timer snd nls_iso8859_1 nls_cp437 vfat fat usb_storage radeon
> drm
> > i2c_algo_bit emu10k1_gp gameport deflate zlib_deflate zlib_inflate twofish serpent aes_i586
> > blowfish des sha256 crypto_null af_key binfmt_misc eeprom i2c_sensor button processor psmouse
> > pcspkr p4_clockmod speedstep_lib usbserial lp nfsd exportfs md5 ipv6 sd_mod scsi_mod autofs
> nfs
> > lockd sunrpc af_packet ohci_hcd parport_pc parport e1000 video1394 raw1394 i2c_i801 i2c_core
> > ohci1394 ieee1394 ehci_hcd soundcore pwc videodev uhci_hcd usbcore intel_agp agpgart ide_cd
> cdrom
> > ext3 jbd
> > CPU: 1
> > EIP: 0060:[<c0275cc7>] Not tainted VLI
> > EFLAGS: 00200086 (2.6.11)
> > EIP is at _spin_lock+0x7/0xf
> > eax: f7b8b01c ebx: f7c82b88 ecx: f7c82b94 edx: f6c33714
> > esi: eb68ad88 edi: f6c33708 ebp: f6c33714 esp: f5b32f70
> > ds: 007b es: 007b ss: 0068
> > Process nautilus (pid: 5757, threadinfo=f5b32000 task=f7518020)
> > Stack: c01f7f79 00200282 f76bda24 f6c323e4 f7518020 00000000 00000000 c01f1d0c
> > f5b32000 c011d7b3 00000001 00000000 b65ffa40 00000000 f5b32fac 00000000
> > 00000000 00000000 f5b32000 c011d8d6 c0102e7f 00000000 b65ffbf0 b6640bf0
> > Call Trace:
> > [<c01f7f79>] cfq_exit_io_context+0x54/0xb3
> > [<c01f1d0c>] exit_io_context+0x45/0x51
> > [<c011d7b3>] do_exit+0x205/0x308
> > [<c011d8d6>] next_thread+0x0/0xc
> > [<c0102e7f>] syscall_call+0x7/0xb
> > Code: 05 e8 3a e6 ff ff c3 ba 00 f0 ff ff 21 e2 81 42 14 00 01 00 00 f0 81 28 00 00 00 01 74
> 05 e8
> > 1d e6 ff ff c3 f0 fe 08 79 09 f3 90 <80> 38 00 7e f9 eb f2 c3 f0 81 28 00 00 00 01 74 05 e8 ff
> e5
> > ff
> > console shuts up ...
>
> The queue was gone by the time the process exited. What type of storage
> do you have attached to the box? At least with SCSI, it has some
> problems in this area - it will glady free the scsi device structure
> (where the queue lock is located) while the queue reference count still
> hasn't dropped to zero.
>
> --
> Jens Axboe
>
>

Send instant messages to your online friends http://uk.messenger.yahoo.com

2005-03-29 12:05:14

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Tue, Mar 29 2005, Chris Rankin wrote:

(please don't top post)

> --- Jens Axboe <[email protected]> wrote:
> > On Sun, Mar 27 2005, Chris Rankin wrote:
> > > [gcc-3.4.3, Linux-2.6.11-SMP, Dual P4 Xeon with HT enabled]
> > >
> > > Hi,
> > >
> > > My Linux 2.6.11 box oopsed when I tried to logout. I have switched to using the anticipatory
> > > scheduler instead.
> > >
> > > Cheers,
> > > Chris
> > >
> > > NMI Watchdog detected LOCKUP on CPU1, eip c0275cc7, registers:
> > > Modules linked in: snd_pcm_oss snd_mixer_oss snd_usb_audio snd_usb_lib snd_intel8x0
> > snd_seq_oss
> > > snd_seq_midi snd_emu10k1_synth snd_emu10k1 snd_ac97_codec snd_pcm snd_page_alloc
> > snd_emux_synth
> > > snd_seq_virmidi snd_rawmidi snd_seq_midi_event snd_seq_midi_emul snd_hwdep snd_util_mem
> > snd_seq
> > > snd_seq_device snd_rtctimer snd_timer snd nls_iso8859_1 nls_cp437 vfat fat usb_storage radeon
> > drm
> > > i2c_algo_bit emu10k1_gp gameport deflate zlib_deflate zlib_inflate twofish serpent aes_i586
> > > blowfish des sha256 crypto_null af_key binfmt_misc eeprom i2c_sensor button processor psmouse
> > > pcspkr p4_clockmod speedstep_lib usbserial lp nfsd exportfs md5 ipv6 sd_mod scsi_mod autofs
> > nfs
> > > lockd sunrpc af_packet ohci_hcd parport_pc parport e1000 video1394 raw1394 i2c_i801 i2c_core
> > > ohci1394 ieee1394 ehci_hcd soundcore pwc videodev uhci_hcd usbcore intel_agp agpgart ide_cd
> > cdrom
> > > ext3 jbd
> > > CPU: 1
> > > EIP: 0060:[<c0275cc7>] Not tainted VLI
> > > EFLAGS: 00200086 (2.6.11)
> > > EIP is at _spin_lock+0x7/0xf
> > > eax: f7b8b01c ebx: f7c82b88 ecx: f7c82b94 edx: f6c33714
> > > esi: eb68ad88 edi: f6c33708 ebp: f6c33714 esp: f5b32f70
> > > ds: 007b es: 007b ss: 0068
> > > Process nautilus (pid: 5757, threadinfo=f5b32000 task=f7518020)
> > > Stack: c01f7f79 00200282 f76bda24 f6c323e4 f7518020 00000000 00000000 c01f1d0c
> > > f5b32000 c011d7b3 00000001 00000000 b65ffa40 00000000 f5b32fac 00000000
> > > 00000000 00000000 f5b32000 c011d8d6 c0102e7f 00000000 b65ffbf0 b6640bf0
> > > Call Trace:
> > > [<c01f7f79>] cfq_exit_io_context+0x54/0xb3
> > > [<c01f1d0c>] exit_io_context+0x45/0x51
> > > [<c011d7b3>] do_exit+0x205/0x308
> > > [<c011d8d6>] next_thread+0x0/0xc
> > > [<c0102e7f>] syscall_call+0x7/0xb
> > > Code: 05 e8 3a e6 ff ff c3 ba 00 f0 ff ff 21 e2 81 42 14 00 01 00 00 f0 81 28 00 00 00 01 74
> > 05 e8
> > > 1d e6 ff ff c3 f0 fe 08 79 09 f3 90 <80> 38 00 7e f9 eb f2 c3 f0 81 28 00 00 00 01 74 05 e8 ff
> > e5
> > > ff
> > > console shuts up ...
> >
> > The queue was gone by the time the process exited. What type of storage
> > do you have attached to the box? At least with SCSI, it has some
> > problems in this area - it will glady free the scsi device structure
> > (where the queue lock is located) while the queue reference count still
> > hasn't dropped to zero.
>
> I have one IDE hard disc, but I was using a USB memory stick at one
> point. (Notice the usb-storage and vfat modules in my list.) Could
> that be the troublesome SCSI device?

Yes, it probably is. What happens is that you insert the stick and do io
against it, which sets up a process io context for that device. That
context persists until the process exits (or later, if someone still
holds a reference to it), but the queue_lock will be dead when you yank
the usb device.

It is quite a serious problem, not just for CFQ. SCSI referencing is
badly broken there.

--
Jens Axboe

2005-03-29 12:24:52

by Chris Rankin

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

>> > I have one IDE hard disc, but I was using a USB memory stick at one
> > point. (Notice the usb-storage and vfat modules in my list.) Could
> > that be the troublesome SCSI device?

--- Jens Axboe <[email protected]> wrote:
> Yes, it probably is. What happens is that you insert the stick and do io
> against it, which sets up a process io context for that device. That
> context persists until the process exits (or later, if someone still
> holds a reference to it), but the queue_lock will be dead when you yank
> the usb device.
>
> It is quite a serious problem, not just for CFQ. SCSI referencing is
> badly broken there.

That would explain why it was nautilus which caused the oops then. Does this mean that the major
distros aren't using the CFQ then? Because how else can they be avoiding this oops with USB
storage devices?

Send instant messages to your online friends http://uk.messenger.yahoo.com

2005-03-29 12:29:16

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Tue, Mar 29 2005, Chris Rankin wrote:
> >> > I have one IDE hard disc, but I was using a USB memory stick at one
> > > point. (Notice the usb-storage and vfat modules in my list.) Could
> > > that be the troublesome SCSI device?
>
> --- Jens Axboe <[email protected]> wrote:
> > Yes, it probably is. What happens is that you insert the stick and do io
> > against it, which sets up a process io context for that device. That
> > context persists until the process exits (or later, if someone still
> > holds a reference to it), but the queue_lock will be dead when you yank
> > the usb device.
> >
> > It is quite a serious problem, not just for CFQ. SCSI referencing is
> > badly broken there.
>
> That would explain why it was nautilus which caused the oops then.
> Does this mean that the major distros aren't using the CFQ then?
> Because how else can they be avoiding this oops with USB storage
> devices?

CFQ with io contexts is relatively new, only there since 2.6.10 or so.
On UP, we don't grab the queue lock effetively so the problem isn't seen
there.

You can work around this issue by using a different default io scheduler
at boot time, and then select cfq for your ide hard drive when the
system has booted with:

# echo cfq > /sys/block/hda/queue/scheduler

(substitute hda for any other solid storage device).

--
Jens Axboe

2005-04-06 12:31:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Tue, Mar 29 2005, Jens Axboe wrote:
> On Tue, Mar 29 2005, Chris Rankin wrote:
> > >> > I have one IDE hard disc, but I was using a USB memory stick at one
> > > > point. (Notice the usb-storage and vfat modules in my list.) Could
> > > > that be the troublesome SCSI device?
> >
> > --- Jens Axboe <[email protected]> wrote:
> > > Yes, it probably is. What happens is that you insert the stick and do io
> > > against it, which sets up a process io context for that device. That
> > > context persists until the process exits (or later, if someone still
> > > holds a reference to it), but the queue_lock will be dead when you yank
> > > the usb device.
> > >
> > > It is quite a serious problem, not just for CFQ. SCSI referencing is
> > > badly broken there.
> >
> > That would explain why it was nautilus which caused the oops then.
> > Does this mean that the major distros aren't using the CFQ then?
> > Because how else can they be avoiding this oops with USB storage
> > devices?
>
> CFQ with io contexts is relatively new, only there since 2.6.10 or so.
> On UP, we don't grab the queue lock effetively so the problem isn't seen
> there.
>
> You can work around this issue by using a different default io scheduler
> at boot time, and then select cfq for your ide hard drive when the
> system has booted with:
>
> # echo cfq > /sys/block/hda/queue/scheduler
>
> (substitute hda for any other solid storage device).

Can you check if this work-around solves the problem for you? Thanks!

===== include/linux/blkdev.h 1.162 vs edited =====
--- 1.162/include/linux/blkdev.h 2005-03-29 03:42:37 +02:00
+++ edited/include/linux/blkdev.h 2005-04-06 11:22:44 +02:00
@@ -279,6 +288,7 @@
typedef int (issue_flush_fn) (request_queue_t *, struct gendisk *, sector_t *);
typedef int (prepare_flush_fn) (request_queue_t *, struct request *);
typedef void (end_flush_fn) (request_queue_t *, struct request *);
+typedef void (release_queue_data_fn) (request_queue_t *);

enum blk_queue_state {
Queue_down,
@@ -324,6 +334,7 @@
issue_flush_fn *issue_flush_fn;
prepare_flush_fn *prepare_flush_fn;
end_flush_fn *end_flush_fn;
+ release_queue_data_fn *release_queue_data_fn;

/*
* Auto-unplugging state
===== drivers/scsi/scsi_sysfs.c 1.72 vs edited =====
--- 1.72/drivers/scsi/scsi_sysfs.c 2005-03-28 07:03:53 +02:00
+++ edited/drivers/scsi/scsi_sysfs.c 2005-04-06 11:24:27 +02:00
@@ -175,9 +175,6 @@

scsi_target_reap(scsi_target(sdev));

- kfree(sdev->inquiry);
- kfree(sdev);
-
if (parent)
put_device(parent);
}
===== drivers/scsi/scsi_lib.c 1.153 vs edited =====
--- 1.153/drivers/scsi/scsi_lib.c 2005-03-30 21:49:45 +02:00
+++ edited/drivers/scsi/scsi_lib.c 2005-04-06 11:24:32 +02:00
@@ -1420,6 +1420,18 @@
}
EXPORT_SYMBOL(scsi_calculate_bounce_limit);

+static void scsi_release_queue_data(request_queue_t *q)
+{
+ struct scsi_device *sdev = q->queuedata;
+
+ if (sdev) {
+ kfree(sdev->inquiry);
+ kfree(sdev);
+ }
+
+ q->queuedata = NULL;
+}
+
struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
{
struct Scsi_Host *shost = sdev->host;
@@ -1437,6 +1449,8 @@
blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
blk_queue_segment_boundary(q, shost->dma_boundary);
blk_queue_issue_flush_fn(q, scsi_issue_flush_fn);
+
+ q->release_queue_data_fn = scsi_release_queue_data;

/*
* ordered tags are superior to flush ordering

--
Jens Axboe

2005-04-06 12:52:25

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

> @@ -324,6 +334,7 @@
> issue_flush_fn *issue_flush_fn;
> prepare_flush_fn *prepare_flush_fn;
> end_flush_fn *end_flush_fn;
> + release_queue_data_fn *release_queue_data_fn;
>
> /*
> * Auto-unplugging state

where does this function method actually get called?

2005-04-06 12:55:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Wed, Apr 06 2005, Arjan van de Ven wrote:
> > @@ -324,6 +334,7 @@
> > issue_flush_fn *issue_flush_fn;
> > prepare_flush_fn *prepare_flush_fn;
> > end_flush_fn *end_flush_fn;
> > + release_queue_data_fn *release_queue_data_fn;
> >
> > /*
> > * Auto-unplugging state
>
> where does this function method actually get called?

I missed the hunk in ll_rw_blk.c, rmk pointed the same thing out not 5
minutes ago :-)

The patch would not work anyways, as scsi_sysfs.c clears queuedata
unconditionally. This is a better work-around, it just makes the queue
hold a reference to the device as well only killing it when the queue is
torn down.

Still not super happy with it, but I don't see how to solve the circular
dependency problem otherwise.


===== drivers/scsi/scsi_lib.c 1.153 vs edited =====
--- 1.153/drivers/scsi/scsi_lib.c 2005-03-30 21:49:45 +02:00
+++ edited/drivers/scsi/scsi_lib.c 2005-04-06 14:41:15 +02:00
@@ -1420,6 +1420,13 @@
}
EXPORT_SYMBOL(scsi_calculate_bounce_limit);

+static void scsi_release_queue_data(request_queue_t *q)
+{
+ struct scsi_device *sdev = q->queuedata;
+
+ put_device(&sdev->sdev_gendev);
+}
+
struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
{
struct Scsi_Host *shost = sdev->host;
@@ -1437,6 +1444,12 @@
blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
blk_queue_segment_boundary(q, shost->dma_boundary);
blk_queue_issue_flush_fn(q, scsi_issue_flush_fn);
+
+ /*
+ * let the queue drop a reference, when it is killed
+ */
+ get_device(&sdev->sdev_gendev);
+ q->release_queue_data_fn = scsi_release_queue_data;

/*
* ordered tags are superior to flush ordering
===== include/linux/blkdev.h 1.162 vs edited =====
--- 1.162/include/linux/blkdev.h 2005-03-29 03:42:37 +02:00
+++ edited/include/linux/blkdev.h 2005-04-06 11:22:44 +02:00
@@ -279,6 +288,7 @@
typedef int (issue_flush_fn) (request_queue_t *, struct gendisk *, sector_t *);
typedef int (prepare_flush_fn) (request_queue_t *, struct request *);
typedef void (end_flush_fn) (request_queue_t *, struct request *);
+typedef void (release_queue_data_fn) (request_queue_t *);

enum blk_queue_state {
Queue_down,
@@ -324,6 +334,7 @@
issue_flush_fn *issue_flush_fn;
prepare_flush_fn *prepare_flush_fn;
end_flush_fn *end_flush_fn;
+ release_queue_data_fn *release_queue_data_fn;

/*
* Auto-unplugging state
===== drivers/block/ll_rw_blk.c 1.288 vs edited =====
--- 1.288/drivers/block/ll_rw_blk.c 2005-03-31 12:47:54 +02:00
+++ edited/drivers/block/ll_rw_blk.c 2005-04-06 11:24:51 +02:00
@@ -1621,6 +1623,9 @@

blk_sync_queue(q);

+ if (q->release_queue_data_fn)
+ q->release_queue_data_fn(q);
+
if (rl->rq_pool)
mempool_destroy(rl->rq_pool);


--
Jens Axboe

2005-04-06 13:39:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

Jens Axboe wrote:
> On Wed, Apr 06 2005, Arjan van de Ven wrote:
>
>>>@@ -324,6 +334,7 @@
>>> issue_flush_fn *issue_flush_fn;
>>> prepare_flush_fn *prepare_flush_fn;
>>> end_flush_fn *end_flush_fn;
>>>+ release_queue_data_fn *release_queue_data_fn;
>>>
>>> /*
>>> * Auto-unplugging state
>>
>>where does this function method actually get called?
>
>
> I missed the hunk in ll_rw_blk.c, rmk pointed the same thing out not 5
> minutes ago :-)
>
> The patch would not work anyways, as scsi_sysfs.c clears queuedata
> unconditionally. This is a better work-around, it just makes the queue
> hold a reference to the device as well only killing it when the queue is
> torn down.
>
> Still not super happy with it, but I don't see how to solve the circular
> dependency problem otherwise.
>

Hello, Jens.

I've been thinking about it for a while. The problem is that we're
reference counting two different objects to track lifetime of one
entity. This happens in both SCSI upper and mid layers. In the upper
layer, genhd and scsi_disk (or scsi_cd, ...) are ref'ed separately while
they share their destiny together (not really different entity) and in
the middle layer scsi_device and request_queue does the same thing.
Circular dependency is occuring because we separate one entity into two
and reference counting them separately. Two are actually one and
necessarily want each other. (until death aparts. Wow, serious. :-)

IMHO, what we need to do is consolidate ref counting such that in each
layer only one object is reference counted, and the other object is
freed when the ref counted object is released. The object of choice
would be genhd in upper layer and request_queue in mid layer. All
ref-counting should be updated to only ref those objects. We'll need to
add a release callback to genhd and make request_queue properly
reference counted.

Conceptually, scsi_disk extends genhd and scsi_device extends
request_queue. So, to go one step further, as what UL represents is
genhd (disk device) and ML request_queue (request-based device),
embedding scsi_disk into genhd and scsi_device into request_queue will
make the architecture clearer. To do this, we'll need something like
alloc_disk_with_udata(int minors, size_t udata_len) and the equivalent
for request_queue.

I've done this half-way and then doing it without fixing the SCSI
model seemed silly so got into working on the state model. (BTW, the
state model is almost done, I'm about to run tests.)

What do you think? Jens?

--
tejun

2005-04-06 17:48:28

by James Bottomley

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Tue, 2005-03-29 at 14:03 +0200, Jens Axboe wrote:
> It is quite a serious problem, not just for CFQ. SCSI referencing is
> badly broken there.

OK ... I accept that with regard to the queue lock.

However, rather than trying to work out a way to tie all the refcounted
objects together, what about the simpler solution of making the lock
bound to the lifetime of the queue?

As far as SCSI is concerned, we could simply move the lock into the
request_queue structure and everything would work since the device holds
a reference to the queue. The way it would work is that we'd simply
have a lock in the request_queue structure, but it would be up to the
device to pass it in in blk_init_queue. Then we'd alter the scsi_device
sdev_lock to be a pointer to the queue lock? This scheme would also
work for the current users who have a global lock (they simply wouldn't
use the lock int the request_queue).

The only could on the horizon with this scheme is that there may
genuinely be places where we want multiple queues to share a non-global
lock: situations where we have shared issue queues (like IDE), or
shared tag resources are a possibility. To cope with those, we'd
probably have to have a separately allocated, reference counted lock.

However, I'm happy to implement the simpler solution (lock in
requuest_queue) if you agree.

James


James


2005-04-06 17:59:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Wed, Apr 06 2005, James Bottomley wrote:
> On Tue, 2005-03-29 at 14:03 +0200, Jens Axboe wrote:
> > It is quite a serious problem, not just for CFQ. SCSI referencing is
> > badly broken there.
>
> OK ... I accept that with regard to the queue lock.

It is much deeper than that. The recent hack to kill requests is yet
another example of that. At least this work-around makes it a little
better, but the mid layer assumption that sdev going to zero implies the
queue going away at the same time is inherently broken.

> However, rather than trying to work out a way to tie all the refcounted
> objects together, what about the simpler solution of making the lock
> bound to the lifetime of the queue?

That's essentially what the work-around does.

> As far as SCSI is concerned, we could simply move the lock into the
> request_queue structure and everything would work since the device holds
> a reference to the queue. The way it would work is that we'd simply
> have a lock in the request_queue structure, but it would be up to the
> device to pass it in in blk_init_queue. Then we'd alter the scsi_device
> sdev_lock to be a pointer to the queue lock? This scheme would also
> work for the current users who have a global lock (they simply wouldn't
> use the lock int the request_queue).
>
> The only could on the horizon with this scheme is that there may
> genuinely be places where we want multiple queues to share a non-global
> lock: situations where we have shared issue queues (like IDE), or
> shared tag resources are a possibility. To cope with those, we'd
> probably have to have a separately allocated, reference counted lock.
>
> However, I'm happy to implement the simpler solution (lock in
> requuest_queue) if you agree.

I rather like the queue lock being a pointer, so you can share at
whatever level you want. Lets not grow the request_queue a full lock
just to work around a bug elsewhere.

--
Jens Axboe

2005-04-06 18:02:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Wed, Apr 06 2005, Tejun Heo wrote:
> Jens Axboe wrote:
> >On Wed, Apr 06 2005, Arjan van de Ven wrote:
> >
> >>>@@ -324,6 +334,7 @@
> >>> issue_flush_fn *issue_flush_fn;
> >>> prepare_flush_fn *prepare_flush_fn;
> >>> end_flush_fn *end_flush_fn;
> >>>+ release_queue_data_fn *release_queue_data_fn;
> >>>
> >>> /*
> >>> * Auto-unplugging state
> >>
> >>where does this function method actually get called?
> >
> >
> >I missed the hunk in ll_rw_blk.c, rmk pointed the same thing out not 5
> >minutes ago :-)
> >
> >The patch would not work anyways, as scsi_sysfs.c clears queuedata
> >unconditionally. This is a better work-around, it just makes the queue
> >hold a reference to the device as well only killing it when the queue is
> >torn down.
> >
> >Still not super happy with it, but I don't see how to solve the circular
> >dependency problem otherwise.
> >
>
> Hello, Jens.
>
> I've been thinking about it for a while. The problem is that we're
> reference counting two different objects to track lifetime of one
> entity. This happens in both SCSI upper and mid layers. In the upper
> layer, genhd and scsi_disk (or scsi_cd, ...) are ref'ed separately while
> they share their destiny together (not really different entity) and in
> the middle layer scsi_device and request_queue does the same thing.
> Circular dependency is occuring because we separate one entity into two
> and reference counting them separately. Two are actually one and
> necessarily want each other. (until death aparts. Wow, serious. :-)

That's precisely correct.

> IMHO, what we need to do is consolidate ref counting such that in each
> layer only one object is reference counted, and the other object is
> freed when the ref counted object is released. The object of choice
> would be genhd in upper layer and request_queue in mid layer. All
> ref-counting should be updated to only ref those objects. We'll need to
> add a release callback to genhd and make request_queue properly
> reference counted.
>
> Conceptually, scsi_disk extends genhd and scsi_device extends
> request_queue. So, to go one step further, as what UL represents is
> genhd (disk device) and ML request_queue (request-based device),
> embedding scsi_disk into genhd and scsi_device into request_queue will
> make the architecture clearer. To do this, we'll need something like
> alloc_disk_with_udata(int minors, size_t udata_len) and the equivalent
> for request_queue.
>
> I've done this half-way and then doing it without fixing the SCSI
> model seemed silly so got into working on the state model. (BTW, the
> state model is almost done, I'm about to run tests.)
>
> What do you think? Jens?

It is of course the optimal solution to really solve the hierarchy of
references, but more involved. If you have time / desire to do it, I'd
be happy to review it :-)

For now I'm happy with the work-around. It's not too ugly, and at least
it makes it possible to kill the worse work-around of
scsi_kill_requests().

--
Jens Axboe

2005-04-06 18:20:24

by James Bottomley

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Wed, 2005-04-06 at 19:58 +0200, Jens Axboe wrote:
> I rather like the queue lock being a pointer, so you can share at
> whatever level you want. Lets not grow the request_queue a full lock
> just to work around a bug elsewhere.

I'm not proposing that it not be a pointer, merely that it could be
intialiased to point to a lock structure within the request queue.

Doing this looks much simpler than your current patch ... one of the
problems with which looks to be that removing the scsi_driver module is
in trouble because we currently have the queue_release in the sdev
release (which won't get called while the queue holds a reference).

I think the correct model for all of this is that the block driver
shouldn't care (or be tied to) the scsi one. Thus, as long as SCSI can
reject requests from a queue whose device has been released (without
checking the device) then everything is fine as long as we sort out the
lock lifetime problem.

James

2005-04-06 19:10:35

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Wed, Apr 06 2005, James Bottomley wrote:
> On Wed, 2005-04-06 at 19:58 +0200, Jens Axboe wrote:
> > I rather like the queue lock being a pointer, so you can share at
> > whatever level you want. Lets not grow the request_queue a full lock
> > just to work around a bug elsewhere.
>
> I'm not proposing that it not be a pointer, merely that it could be
> intialiased to point to a lock structure within the request queue.

Sorry that's what I meant, you are adding a lock inside the queue. I
didn't mean removing q->queue_lock or making it a non-pointer.

> Doing this looks much simpler than your current patch ... one of the
> problems with which looks to be that removing the scsi_driver module is
> in trouble because we currently have the queue_release in the sdev
> release (which won't get called while the queue holds a reference).

It is simpler, but it only solves this particular problem of the lock
going away.

> I think the correct model for all of this is that the block driver
> shouldn't care (or be tied to) the scsi one. Thus, as long as SCSI can
> reject requests from a queue whose device has been released (without
> checking the device) then everything is fine as long as we sort out the
> lock lifetime problem.

But you are tying it completely to the SCSI problem, since we have no
locking problems of this sort elsewhere. At least the notified could
potentially be used for something else. The lock is just hack number two
to work around the problem.

--
Jens Axboe

2005-04-06 20:33:50

by Mike Anderson

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

Tejun Heo [[email protected]] wrote:
> Jens Axboe wrote:
> >On Wed, Apr 06 2005, Arjan van de Ven wrote:
> >
> >>>@@ -324,6 +334,7 @@
> >>> issue_flush_fn *issue_flush_fn;
> >>> prepare_flush_fn *prepare_flush_fn;
> >>> end_flush_fn *end_flush_fn;
> >>>+ release_queue_data_fn *release_queue_data_fn;
> >>>
> >>> /*
> >>> * Auto-unplugging state
> >>
> >>where does this function method actually get called?
> >
> >
> >I missed the hunk in ll_rw_blk.c, rmk pointed the same thing out not 5
> >minutes ago :-)
> >
> >The patch would not work anyways, as scsi_sysfs.c clears queuedata
> >unconditionally. This is a better work-around, it just makes the queue
> >hold a reference to the device as well only killing it when the queue is
> >torn down.
> >
> >Still not super happy with it, but I don't see how to solve the circular
> >dependency problem otherwise.
> >
>
> Hello, Jens.
>
> I've been thinking about it for a while. The problem is that we're
> reference counting two different objects to track lifetime of one
> entity. This happens in both SCSI upper and mid layers. In the upper
> layer, genhd and scsi_disk (or scsi_cd, ...) are ref'ed separately while
> they share their destiny together (not really different entity) and in
> the middle layer scsi_device and request_queue does the same thing.
> Circular dependency is occuring because we separate one entity into two
> and reference counting them separately. Two are actually one and
> necessarily want each other. (until death aparts. Wow, serious. :-)
>
> IMHO, what we need to do is consolidate ref counting such that in each
> layer only one object is reference counted, and the other object is
> freed when the ref counted object is released. The object of choice
> would be genhd in upper layer and request_queue in mid layer. All
> ref-counting should be updated to only ref those objects. We'll need to
> add a release callback to genhd and make request_queue properly
> reference counted.
>
> Conceptually, scsi_disk extends genhd and scsi_device extends
> request_queue. So, to go one step further, as what UL represents is
> genhd (disk device) and ML request_queue (request-based device),
> embedding scsi_disk into genhd and scsi_device into request_queue will
> make the architecture clearer. To do this, we'll need something like
> alloc_disk_with_udata(int minors, size_t udata_len) and the equivalent
> for request_queue.
>
> I've done this half-way and then doing it without fixing the SCSI
> model seemed silly so got into working on the state model. (BTW, the
> state model is almost done, I'm about to run tests.)
>
> What do you think? Jens?

Well I think extends is one way to look at the subsystem objects,
Couldn't it also be said that these objects from each subsystem have just
a relationship (parent / child, etc). As reference counting has been
implemented in each subsystem sometimes interfaces that cross subsystem
boundaries (had / have) not been converted to use similar life time rules.

Well your solution tries to solve the problem by creating a new larger
object that contains both of the old objects. Another solution would be to
use a consistent lifetime rules and stay with smaller objects. Unless
going to large objects helps with allocation fragmentation or we get some
other benefit it would seem that these combined structures may sometime in
the future limit creation of lighter or flexible objects.

It would appear another solution is that when you allocate a resource from
another subsystem (i.e. blk_init_queue) that both subsystems participate
in the same reference counting model and in the allocation routine you
past in your object to be referenced counted by the allocating subsystem.
Then when it is time to shutdown you do not free the others subsystems
object directly, but use the normal release routines.


-andmike
--
Michael Anderson
[email protected]

2005-04-06 21:10:20

by James Bottomley

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Wed, 2005-04-06 at 21:08 +0200, Jens Axboe wrote:
> > I think the correct model for all of this is that the block driver
> > shouldn't care (or be tied to) the scsi one. Thus, as long as SCSI can
> > reject requests from a queue whose device has been released (without
> > checking the device) then everything is fine as long as we sort out the
> > lock lifetime problem.
>
> But you are tying it completely to the SCSI problem, since we have no
> locking problems of this sort elsewhere. At least the notified could
> potentially be used for something else. The lock is just hack number two
> to work around the problem.

Then help me understand the problem as you see it.

My current understanding is that these problems occur because we've
mixed data in two objects of different lifetimes. So far, this is stack
independent.

My proposal is to correct this by moving the data back to the correct
object, and make any object using it hold a reference, so this would
make the provider of the block request_fn hold a reference to the queue
and initialise the queue lock pointer with the lock currently in the
queue. Drivers that still use a global lock would be unaffected. This
also means that any provider of a request_fn may expect to process (and
reject) requests for a period of time after blk_cleanup_queue().
Really, this refcounting is inherent in blk_init_queue(),
blk_cleanup_queue() so the only additional requirement is sanely
processing queue requests after you think it's been cleaned up. This is
request_fn() provider independent, I think (providers who currently
don't violate the lifetime rules don't need fixing).

I claim that this proposal solves the current problem, and any other
ones we run across that occur because of data mixing.

Your current patch tries to solve the problem by tying the two objects
together; unifying the lifetimes. I agree this can be done (it was how
the sd/sr close race was fixed). But the current way the freeing
functions thread across the subsystems doesn't look nice and I don't
currently see a way to get the queue freed: We free the queue on scsi
device release; if the queue holds a reference to the scsi_device, the
release function will never be called. Our only other choice is to try
to free the queue on device_del instead, but that's too early, I think.

James


2005-04-07 06:50:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Wed, Apr 06 2005, James Bottomley wrote:
> On Wed, 2005-04-06 at 21:08 +0200, Jens Axboe wrote:
> > > I think the correct model for all of this is that the block driver
> > > shouldn't care (or be tied to) the scsi one. Thus, as long as SCSI can
> > > reject requests from a queue whose device has been released (without
> > > checking the device) then everything is fine as long as we sort out the
> > > lock lifetime problem.
> >
> > But you are tying it completely to the SCSI problem, since we have no
> > locking problems of this sort elsewhere. At least the notified could
> > potentially be used for something else. The lock is just hack number two
> > to work around the problem.
>
> Then help me understand the problem as you see it.
>
> My current understanding is that these problems occur because we've
> mixed data in two objects of different lifetimes. So far, this is stack
> independent.

Correct.

> My proposal is to correct this by moving the data back to the correct
> object, and make any object using it hold a reference, so this would
> make the provider of the block request_fn hold a reference to the queue
> and initialise the queue lock pointer with the lock currently in the
> queue. Drivers that still use a global lock would be unaffected. This

But this is the current requirement, as long as you use the queue you
must hold a reference to it.

> also means that any provider of a request_fn may expect to process (and
> reject) requests for a period of time after blk_cleanup_queue().
> Really, this refcounting is inherent in blk_init_queue(),
> blk_cleanup_queue() so the only additional requirement is sanely
> processing queue requests after you think it's been cleaned up. This is
> request_fn() provider independent, I think (providers who currently
> don't violate the lifetime rules don't need fixing).
>
> I claim that this proposal solves the current problem, and any other
> ones we run across that occur because of data mixing.

The lock is just one piece, but I guess it is the only remaining after
the ->request_fn switchover killing requests. So perhaps it is just
easier to embed the lock to kill off that problem.

What do you think of the attached, then? Allow NULL lock to be passed
in, in which case we use the queue private lock (that no one should ever
ever touch). It looks a little confusing that
sdev->request_queue->queue_lock now protects some sdev structures, if
you want we can retain ->sdev_lock but as a pointer to the queue lock
instead.

> Your current patch tries to solve the problem by tying the two objects
> together; unifying the lifetimes. I agree this can be done (it was how
> the sd/sr close race was fixed). But the current way the freeing
> functions thread across the subsystems doesn't look nice and I don't
> currently see a way to get the queue freed: We free the queue on scsi
> device release; if the queue holds a reference to the scsi_device, the
> release function will never be called. Our only other choice is to try
> to free the queue on device_del instead, but that's too early, I think.

Hmm yes, it probably doesn't work. Unless we properly embed the affected
structures, I don't think it can work.


===== drivers/block/ll_rw_blk.c 1.288 vs edited =====
--- 1.288/drivers/block/ll_rw_blk.c 2005-03-31 12:47:54 +02:00
+++ edited/drivers/block/ll_rw_blk.c 2005-04-07 08:38:01 +02:00
@@ -1714,6 +1711,15 @@ request_queue_t *blk_init_queue(request_
if (blk_init_free_list(q))
goto out_init;

+ /*
+ * if caller didn't supply a lock, they get per-queue locking with
+ * our embedded lock
+ */
+ if (!lock) {
+ spin_lock_init(&q->__queue_lock);
+ lock = &q->__queue_lock;
+ }
+
q->request_fn = rfn;
q->back_merge_fn = ll_back_merge_fn;
q->front_merge_fn = ll_front_merge_fn;
===== drivers/scsi/scsi_lib.c 1.153 vs edited =====
--- 1.153/drivers/scsi/scsi_lib.c 2005-03-30 21:49:45 +02:00
+++ edited/drivers/scsi/scsi_lib.c 2005-04-07 08:42:30 +02:00
@@ -360,9 +360,9 @@ void scsi_device_unbusy(struct scsi_devi
shost->host_failed))
scsi_eh_wakeup(shost);
spin_unlock(shost->host_lock);
- spin_lock(&sdev->sdev_lock);
+ spin_lock(sdev->request_queue->queue_lock);
sdev->device_busy--;
- spin_unlock_irqrestore(&sdev->sdev_lock, flags);
+ spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
}

/*
@@ -1425,7 +1425,7 @@ struct request_queue *scsi_alloc_queue(s
struct Scsi_Host *shost = sdev->host;
struct request_queue *q;

- q = blk_init_queue(scsi_request_fn, &sdev->sdev_lock);
+ q = blk_init_queue(scsi_request_fn, NULL);
if (!q)
return NULL;

===== drivers/scsi/scsi_scan.c 1.143 vs edited =====
--- 1.143/drivers/scsi/scsi_scan.c 2005-03-23 22:58:13 +01:00
+++ edited/drivers/scsi/scsi_scan.c 2005-04-07 08:40:53 +02:00
@@ -249,7 +249,6 @@ static struct scsi_device *scsi_alloc_sd
*/
sdev->borken = 1;

- spin_lock_init(&sdev->sdev_lock);
sdev->request_queue = scsi_alloc_queue(sdev);
if (!sdev->request_queue) {
/* release fn is set up in scsi_sysfs_device_initialise, so
===== include/linux/blkdev.h 1.162 vs edited =====
--- 1.162/include/linux/blkdev.h 2005-03-29 03:42:37 +02:00
+++ edited/include/linux/blkdev.h 2005-04-07 08:36:06 +02:00
@@ -355,8 +364,11 @@ struct request_queue
unsigned long queue_flags;

/*
- * protects queue structures from reentrancy
+ * protects queue structures from reentrancy. ->__queue_lock should
+ * _never_ be used directly, it is queue private. always use
+ * ->queue_lock.
*/
+ spinlock_t __queue_lock;
spinlock_t *queue_lock;

/*
===== include/scsi/scsi_device.h 1.33 vs edited =====
--- 1.33/include/scsi/scsi_device.h 2005-03-23 22:58:05 +01:00
+++ edited/include/scsi/scsi_device.h 2005-04-07 08:41:09 +02:00
@@ -44,7 +44,6 @@ struct scsi_device {
struct list_head same_target_siblings; /* just the devices sharing same target id */

volatile unsigned short device_busy; /* commands actually active on low-level */
- spinlock_t sdev_lock; /* also the request queue_lock */
spinlock_t list_lock;
struct list_head cmd_list; /* queue of in use SCSI Command structures */
struct list_head starved_entry;


--
Jens Axboe

2005-04-07 13:19:52

by James Bottomley

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, 2005-04-07 at 08:49 +0200, Jens Axboe wrote:
> On Wed, Apr 06 2005, James Bottomley wrote:
> > My proposal is to correct this by moving the data back to the correct
> > object, and make any object using it hold a reference, so this would
> > make the provider of the block request_fn hold a reference to the queue
> > and initialise the queue lock pointer with the lock currently in the
> > queue. Drivers that still use a global lock would be unaffected. This
>
> But this is the current requirement, as long as you use the queue you
> must hold a reference to it.

Exactly! that's why I think this solution must work independently of
subsystem.

> What do you think of the attached, then? Allow NULL lock to be passed
> in, in which case we use the queue private lock (that no one should ever
> ever touch). It looks a little confusing that
> sdev->request_queue->queue_lock now protects some sdev structures, if
> you want we can retain ->sdev_lock but as a pointer to the queue lock
> instead.

Looks good. How about the attached modification? It makes sdev_lock a
pointer that uses the queue lock which we null out when we release it
(not that I don't trust SCSI or anything ;-)

James

===== drivers/block/ll_rw_blk.c 1.287 vs edited =====
--- 1.287/drivers/block/ll_rw_blk.c 2005-03-11 15:32:27 -05:00
+++ edited/drivers/block/ll_rw_blk.c 2005-04-07 09:05:19 -04:00
@@ -1714,6 +1714,15 @@
if (blk_init_free_list(q))
goto out_init;

+ /*
+ * if caller didn't supply a lock, they get per-queue locking with
+ * our embedded lock
+ */
+ if (!lock) {
+ spin_lock_init(&q->__queue_lock);
+ lock = &q->__queue_lock;
+ }
+
q->request_fn = rfn;
q->back_merge_fn = ll_back_merge_fn;
q->front_merge_fn = ll_front_merge_fn;
===== drivers/scsi/scsi_lib.c 1.151 vs edited =====
--- 1.151/drivers/scsi/scsi_lib.c 2005-02-17 14:17:22 -05:00
+++ edited/drivers/scsi/scsi_lib.c 2005-04-07 09:10:31 -04:00
@@ -349,9 +349,9 @@
shost->host_failed))
scsi_eh_wakeup(shost);
spin_unlock(shost->host_lock);
- spin_lock(&sdev->sdev_lock);
+ spin_lock(sdev->sdev_lock);
sdev->device_busy--;
- spin_unlock_irqrestore(&sdev->sdev_lock, flags);
+ spin_unlock_irqrestore(sdev->sdev_lock, flags);
}

/*
@@ -1391,7 +1391,7 @@
struct Scsi_Host *shost = sdev->host;
struct request_queue *q;

- q = blk_init_queue(scsi_request_fn, &sdev->sdev_lock);
+ q = blk_init_queue(scsi_request_fn, NULL);
if (!q)
return NULL;

===== drivers/scsi/scsi_scan.c 1.142 vs edited =====
--- 1.142/drivers/scsi/scsi_scan.c 2005-02-16 13:21:35 -05:00
+++ edited/drivers/scsi/scsi_scan.c 2005-04-07 09:10:01 -04:00
@@ -249,7 +249,6 @@
*/
sdev->borken = 1;

- spin_lock_init(&sdev->sdev_lock);
sdev->request_queue = scsi_alloc_queue(sdev);
if (!sdev->request_queue) {
/* release fn is set up in scsi_sysfs_device_initialise, so
@@ -257,7 +256,8 @@
put_device(&starget->dev);
goto out;
}
-
+ /* Now make the sdev_lock point to the request queue lock */
+ sdev->sdev_lock = q->queue_lock;
sdev->request_queue->queuedata = sdev;
scsi_adjust_queue_depth(sdev, 0, sdev->host->cmd_per_lun);

===== drivers/scsi/scsi_sysfs.c 1.69 vs edited =====
--- 1.69/drivers/scsi/scsi_sysfs.c 2005-02-16 20:05:37 -05:00
+++ edited/drivers/scsi/scsi_sysfs.c 2005-04-07 09:12:17 -04:00
@@ -168,6 +168,9 @@
list_del(&sdev->starved_entry);
spin_unlock_irqrestore(sdev->host->host_lock, flags);

+ /* After releasing the queue we may no longer access its lock */
+ BUG_ON(spin_is_locked(sdev->sdev_lock));
+ sdev->sdev_lock = NULL;
if (sdev->request_queue)
scsi_free_queue(sdev->request_queue);

===== include/linux/blkdev.h 1.161 vs edited =====
--- 1.161/include/linux/blkdev.h 2005-03-09 03:03:24 -05:00
+++ edited/include/linux/blkdev.h 2005-04-07 09:05:21 -04:00
@@ -355,8 +355,11 @@
unsigned long queue_flags;

/*
- * protects queue structures from reentrancy
+ * protects queue structures from reentrancy. ->__queue_lock should
+ * _never_ be used directly, it is queue private. always use
+ * ->queue_lock.
*/
+ spinlock_t __queue_lock;
spinlock_t *queue_lock;

/*
===== include/scsi/scsi_device.h 1.32 vs edited =====
--- 1.32/include/scsi/scsi_device.h 2005-02-17 14:15:51 -05:00
+++ edited/include/scsi/scsi_device.h 2005-04-07 09:08:25 -04:00
@@ -44,7 +44,7 @@
struct list_head same_target_siblings; /* just the devices sharing same target id */

volatile unsigned short device_busy; /* commands actually active on low-level */
- spinlock_t sdev_lock; /* also the request queue_lock */
+ spinlock_t *sdev_lock; /* pointer to the request queue_lock */
spinlock_t list_lock;
struct list_head cmd_list; /* queue of in use SCSI Command structures */
struct list_head starved_entry;


2005-04-07 13:22:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, Apr 07, 2005 at 09:18:38AM -0400, James Bottomley wrote:
> On Thu, 2005-04-07 at 08:49 +0200, Jens Axboe wrote:
> > On Wed, Apr 06 2005, James Bottomley wrote:
> > > My proposal is to correct this by moving the data back to the correct
> > > object, and make any object using it hold a reference, so this would
> > > make the provider of the block request_fn hold a reference to the queue
> > > and initialise the queue lock pointer with the lock currently in the
> > > queue. Drivers that still use a global lock would be unaffected. This
> >
> > But this is the current requirement, as long as you use the queue you
> > must hold a reference to it.
>
> Exactly! that's why I think this solution must work independently of
> subsystem.
>
> > What do you think of the attached, then? Allow NULL lock to be passed
> > in, in which case we use the queue private lock (that no one should ever
> > ever touch). It looks a little confusing that
> > sdev->request_queue->queue_lock now protects some sdev structures, if
> > you want we can retain ->sdev_lock but as a pointer to the queue lock
> > instead.
>
> Looks good. How about the attached modification? It makes sdev_lock a
> pointer that uses the queue lock which we null out when we release it
> (not that I don't trust SCSI or anything ;-)

Do we really need the sdev_lock pointer? There's just a single place
where we're using it and the code would be much more clear if it had just
one name.

2005-04-07 13:25:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, Apr 07 2005, Christoph Hellwig wrote:
> On Thu, Apr 07, 2005 at 09:18:38AM -0400, James Bottomley wrote:
> > On Thu, 2005-04-07 at 08:49 +0200, Jens Axboe wrote:
> > > On Wed, Apr 06 2005, James Bottomley wrote:
> > > > My proposal is to correct this by moving the data back to the correct
> > > > object, and make any object using it hold a reference, so this would
> > > > make the provider of the block request_fn hold a reference to the queue
> > > > and initialise the queue lock pointer with the lock currently in the
> > > > queue. Drivers that still use a global lock would be unaffected. This
> > >
> > > But this is the current requirement, as long as you use the queue you
> > > must hold a reference to it.
> >
> > Exactly! that's why I think this solution must work independently of
> > subsystem.
> >
> > > What do you think of the attached, then? Allow NULL lock to be passed
> > > in, in which case we use the queue private lock (that no one should ever
> > > ever touch). It looks a little confusing that
> > > sdev->request_queue->queue_lock now protects some sdev structures, if
> > > you want we can retain ->sdev_lock but as a pointer to the queue lock
> > > instead.
> >
> > Looks good. How about the attached modification? It makes sdev_lock a
> > pointer that uses the queue lock which we null out when we release it
> > (not that I don't trust SCSI or anything ;-)
>
> Do we really need the sdev_lock pointer? There's just a single place
> where we're using it and the code would be much more clear if it had just
> one name.

A comment would work equally well, and save space of course :-)

--
Jens Axboe

2005-04-07 13:29:51

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, Apr 07 2005, James Bottomley wrote:
> On Thu, 2005-04-07 at 08:49 +0200, Jens Axboe wrote:
> > On Wed, Apr 06 2005, James Bottomley wrote:
> > > My proposal is to correct this by moving the data back to the correct
> > > object, and make any object using it hold a reference, so this would
> > > make the provider of the block request_fn hold a reference to the queue
> > > and initialise the queue lock pointer with the lock currently in the
> > > queue. Drivers that still use a global lock would be unaffected. This
> >
> > But this is the current requirement, as long as you use the queue you
> > must hold a reference to it.
>
> Exactly! that's why I think this solution must work independently of
> subsystem.
>
> > What do you think of the attached, then? Allow NULL lock to be passed
> > in, in which case we use the queue private lock (that no one should ever
> > ever touch). It looks a little confusing that
> > sdev->request_queue->queue_lock now protects some sdev structures, if
> > you want we can retain ->sdev_lock but as a pointer to the queue lock
> > instead.
>
> Looks good. How about the attached modification? It makes sdev_lock a
> pointer that uses the queue lock which we null out when we release it
> (not that I don't trust SCSI or anything ;-)

That's fine with me, up to you. As I mentioned, it does look less
confusing to have the lock associated with sdev still.

--
Jens Axboe

2005-04-07 13:32:05

by James Bottomley

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, 2005-04-07 at 14:22 +0100, Christoph Hellwig wrote:
> Do we really need the sdev_lock pointer? There's just a single place
> where we're using it and the code would be much more clear if it had just
> one name.

Humour me for a while. I don't believe we have any way the lock can be
used after calling queue free, but nulling the sdev_lock pointer will
surely catch them. If nothing turns up after a few kernel revisions,
feel free to kill it.

James


2005-04-07 13:33:44

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, Apr 07 2005, James Bottomley wrote:
> On Thu, 2005-04-07 at 14:22 +0100, Christoph Hellwig wrote:
> > Do we really need the sdev_lock pointer? There's just a single place
> > where we're using it and the code would be much more clear if it had just
> > one name.
>
> Humour me for a while. I don't believe we have any way the lock can be
> used after calling queue free, but nulling the sdev_lock pointer will
> surely catch them. If nothing turns up after a few kernel revisions,
> feel free to kill it.

I think Christophs point is that why add sdev_lock as a pointer, instead
of just killing it? It's only used in one location, so it's not really
that confusing (and a comment could fix that).

--
Jens Axboe

2005-04-07 13:40:15

by James Bottomley

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, 2005-04-07 at 15:32 +0200, Jens Axboe wrote:
> I think Christophs point is that why add sdev_lock as a pointer, instead
> of just killing it? It's only used in one location, so it's not really
> that confusing (and a comment could fix that).

Because any use of sdev->request_queue->queue_lock would likely succeed
even after we've freed the device and released the queue. If it's a
pointer and we null it after free and release, then any later use will
trigger an immediate NULL deref oops.

Since we've had so many nasty problems around refcounting, I just would
like to assure myself that we're doing everything correctly (I really
believe we are, but empirical evidence is also nice).

James


2005-04-07 14:47:15

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, Apr 07 2005, James Bottomley wrote:
> On Thu, 2005-04-07 at 15:32 +0200, Jens Axboe wrote:
> > I think Christophs point is that why add sdev_lock as a pointer, instead
> > of just killing it? It's only used in one location, so it's not really
> > that confusing (and a comment could fix that).
>
> Because any use of sdev->request_queue->queue_lock would likely succeed
> even after we've freed the device and released the queue. If it's a
> pointer and we null it after free and release, then any later use will
> trigger an immediate NULL deref oops.

So clear ->request_queue instead.

--
Jens Axboe

2005-04-08 13:06:18

by James Bottomley

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Thu, 2005-04-07 at 16:45 +0200, Jens Axboe wrote:
> So clear ->request_queue instead.


Will do. Did you want me to look after your patch and add this, or do
you want to send it to Linus (after the purdah is over)?

James


2005-04-08 13:18:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [OOPS] 2.6.11 - NMI lockup with CFQ scheduler

On Fri, Apr 08 2005, James Bottomley wrote:
> On Thu, 2005-04-07 at 16:45 +0200, Jens Axboe wrote:
> > So clear ->request_queue instead.
>
>
> Will do. Did you want me to look after your patch and add this, or do
> you want to send it to Linus (after the purdah is over)?

Just queue it with the rest of your changes, that is fine with me.

--
Jens Axboe