2003-03-15 18:48:10

by Zwane Mwaikambo

[permalink] [raw]
Subject: Any hope for ide-scsi (error handling)?

I've been trying to upgrade to 2.5 and one of my stumbling blocks is
cdwriting, i have both a scsi cdwriter and an ide one and haven't managed
to write with either (SCSI bug report later). I know i shouldn't be using
ide-scsi but it gets tiring typing since the gui interfaces to
cdrecord are still playing catchup (sorry for not being hardcore).
Unfortunately due to the nature of IDE it also completely takes out the
harddisk attached to the same cable. I've setup a test rig for this so if
anyone has suggestions i'm up for testing.

Here is a rather long paste of an attempt at writing with ide-scsi. There
is a deadlock at the end of it all.

kernel is 2.5.64 (Alan should i be using 2.5.64-ac?)

cdrecord S CB565DC0 4211491704 1310 1250 1311
(NOTLB)
Call Trace:
[<c011cfb2>] schedule+0x182/0x320
[<c0318a81>] sg_ioctl+0xc71/0xe20

(an ls -l i tried on a harddisk on the same channel)

ls D C02BDC24 4217658044 1313 1180 1249
(NOTLB)
Call Trace:
[<c02bdc24>] do_ide_request+0x14/0x20
[<c027b510>] generic_unplug_device+0xd0/0xe0
[<c011cfb2>] schedule+0x182/0x320
[<c011e779>] io_schedule+0x29/0x40
[<c015c9dd>] __wait_on_buffer+0xad/0xb0

...

hdd: irq timeout: status=0xd0 { Busy }
ide-scsi: abort called for 530
Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Debug: sleeping function called from illegal context at
include/asm/semaphore.h:119
Call Trace:
[<c02dc9ec>] scsi_sleep+0x7c/0xa0
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c01082fe>] __down+0xfe/0x1d0
[<c011d1a0>] default_wake_function+0x0/0x20
[<c010ac1b>] show_trace+0x3b/0x80
[<c010869b>] __down_failed+0xb/0x14
[<c02dd450>] .text.lock.scsi_error+0x97/0xc7
[<c02dc950>] scsi_sleep_done+0x0/0x20
[<c03127fc>] idescsi_abort+0x1bc/0x1d0
[<c02dc1ba>] scsi_try_to_abort_cmd+0x9a/0xf0
[<c02dc30e>] scsi_eh_abort_cmds+0x3e/0x80
[<c02dcef3>] scsi_unjam_host+0xf3/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

ide-scsi: reset called for 530
bad: scheduling while atomic!
Call Trace:
[<c011d143>] schedule+0x313/0x320
[<c012bd84>] schedule_timeout+0x64/0xb0
[<c012bd10>] process_timeout+0x0/0x10
[<c0312973>] idescsi_reset+0x163/0x180
[<c02dc3da>] scsi_try_bus_device_reset+0x8a/0x100
[<c02dc499>] scsi_eh_bus_device_reset+0x49/0xc0
[<c02dcd47>] scsi_eh_ready_devs+0x17/0x60
[<c02dcf07>] scsi_unjam_host+0x107/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

NMI Watchdog detected LOCKUP on CPU0, eip c02be73d, registers:
CPU: 0
EIP: 0060:[<c02be73d>] Tainted: PF
EFLAGS: 00000086
EIP is at .text.lock.ide_io+0x40/0x93
eax: d1e6cdcc ebx: c1616000 ecx: 00000106 edx: 00000001
esi: c12d7500 edi: d1e6cdcc ebp: c1617e08 esp: c1617de0
ds: 007b es: 007b ss: 0068
Process scsi_eh_1 (pid: 12, threadinfo=c1616000 task=c1760080)
Stack: c05c3160 c12d7500 00000282 c1617e08 c03557e1 c05c3160 0018e20c d1e6cdf0
c12d7500 c02bdcc0 c1617e30 c012c272 d1e6cdcc d1e6cdcc fffffffd c1617e50
00000000 00000001 c0561ee8 fffffffd c1617e50 c0127698 c12d7500 00000000
Call Trace:
[<c03557e1>] i8042_timer_func+0x21/0x30
[<c02bdcc0>] ide_timer_expiry+0x0/0x310
[<c012c272>] __run_timers+0xd2/0x1f7
[<c0127698>] do_softirq+0xc8/0xd0
[<c0117307>] smp_apic_timer_interrupt+0x77/0x80
[<c010a9ba>] apic_timer_interrupt+0x1a/0x20
[<c011d036>] schedule+0x206/0x320
[<c012bd84>] schedule_timeout+0x64/0xb0
[<c012bd10>] process_timeout+0x0/0x10
[<c0312973>] idescsi_reset+0x163/0x180
[<c02dc3da>] scsi_try_bus_device_reset+0x8a/0x100
[<c02dc499>] scsi_eh_bus_device_reset+0x49/0xc0
[<c02dcd47>] scsi_eh_ready_devs+0x17/0x60
[<c02dcf07>] scsi_unjam_host+0x107/0x120
[<c02dd026>] scsi_error_handler+0x106/0x140
[<c02dcf20>] scsi_error_handler+0x0/0x140
[<c0107435>] kernel_thread_helper+0x5/0x10

Code: f3 90 80 3d e0 20 56 c0 00 7e f5 e9 b7 f5 ff ff f3 90 80 3d


2003-03-15 19:15:35

by dan carpenter

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Saturday 15 March 2003 07:55 pm, Zwane Mwaikambo wrote:
>
> bad: scheduling while atomic!
> Call Trace:
>

887 spin_lock_irqsave(&ide_lock, flags);
888 while (HWGROUP(drive)->handler) {
889 HWGROUP(drive)->handler = NULL;
890 schedule_timeout(1);
891 }

Here is at least one bad call to schedule() in
static int idescsi_reset (Scsi_Cmnd *cmd)

regards,
dan



2003-03-15 19:46:11

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, 15 Mar 2003, dan carpenter wrote:

> On Saturday 15 March 2003 07:55 pm, Zwane Mwaikambo wrote:
> >
> > bad: scheduling while atomic!
> > Call Trace:
> >
>
> 887 spin_lock_irqsave(&ide_lock, flags);
> 888 while (HWGROUP(drive)->handler) {
> 889 HWGROUP(drive)->handler = NULL;
> 890 schedule_timeout(1);
> 891 }
>
> Here is at least one bad call to schedule() in
> static int idescsi_reset (Scsi_Cmnd *cmd)

Apart from the schedule with the ide_lock held, what is that code actually
doing?

Zwane
--
function.linuxpower.ca

2003-03-15 20:02:14

by dan carpenter

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Saturday 15 March 2003 08:53 pm, Zwane Mwaikambo wrote:
> On Sat, 15 Mar 2003, dan carpenter wrote:
> > On Saturday 15 March 2003 07:55 pm, Zwane Mwaikambo wrote:
> > > bad: scheduling while atomic!
> > > Call Trace:
> >
> > 887 spin_lock_irqsave(&ide_lock, flags);
> > 888 while (HWGROUP(drive)->handler) {
> > 889 HWGROUP(drive)->handler = NULL;
> > 890 schedule_timeout(1);
> > 891 }
> >
> > Here is at least one bad call to schedule() in
> > static int idescsi_reset (Scsi_Cmnd *cmd)
>
> Apart from the schedule with the ide_lock held, what is that code actually
> doing?
>
> Zwane

Hm... Good question. I have no idea what the while loop is for.

regards,
dan carpenter

2003-03-15 20:14:26

by bert hubert

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, Mar 15, 2003 at 03:52:21AM +0100, dan carpenter wrote:
> > >
> > > 887 spin_lock_irqsave(&ide_lock, flags);
> > > 888 while (HWGROUP(drive)->handler) {
> > > 889 HWGROUP(drive)->handler = NULL;
> > > 890 schedule_timeout(1);
> > > 891 }
> > >
> > > Here is at least one bad call to schedule() in
> > > static int idescsi_reset (Scsi_Cmnd *cmd)
> >
> > Apart from the schedule with the ide_lock held, what is that code actually
> > doing?
> >
> > Zwane
>
> Hm... Good question. I have no idea what the while loop is for.

A construct like this was suggested for use in swsusp too to make sure that
only *one* request is outstanding for a controler. This was also mentioned
to be the unclean way and that there are taskfile interfaces which are
cleaner.

Bert.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
http://netherlabs.nl Consulting

2003-03-15 20:16:19

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, 15 Mar 2003, dan carpenter wrote:

> > Apart from the schedule with the ide_lock held, what is that code actually
> > doing?
> >
> > Zwane
>
> Hm... Good question. I have no idea what the while loop is for.

I suppose the magik is in the comments;

/* first null the handler for the drive and let any process
* doing IO (on another CPU) run to (partial) completion
* the lock prevents processing new requests */


--
function.linuxpower.ca

2003-03-15 20:33:50

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, 15 Mar 2003, bert hubert wrote:

> A construct like this was suggested for use in swsusp too to make sure that
> only *one* request is outstanding for a controler. This was also mentioned
> to be the unclean way and that there are taskfile interfaces which are
> cleaner.

hmm..

cpu0:

do_IRQ();
handle_IRQ_event();
...
static ide_startstop_t read_intr (ide_drive_t *drive)
{
msect = drive->mult_count; /* let msect = 0 */
...

if (i > 0) {
if (msect)
goto read_next;
if (HWGROUP(drive)->handler != NULL) <- [1]
BUG();
ide_set_handler(drive, &read_intr, WAIT_CMD, NULL); <- [2]
return ide_started;
}

}

cpu1:
swsusp code;
...
spin_lock_irqsave(&ide_lock, flags);
while (HWGROUP(drive)->handler) {
HWGROUP(drive)->handler = NULL; <- [1]
schedule_timeout(1);
}
<- [2]
...

Ok so at event [1] we have a ->handler set on cpu0 so we pass that check.
Then cpu1 acquires ide_lock and NULLs it out. cpu0 blocks on the lock in
ide_set_handler and when cpu1 releases it it re-assigns ->handler. What
happens then?

Zwane

2003-03-15 20:50:17

by Willem Riede

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On 2003.03.15 15:23 Zwane Mwaikambo wrote:
> On Sat, 15 Mar 2003, dan carpenter wrote:
>
> > > Apart from the schedule with the ide_lock held, what is that code actually
> > > doing?
> > >
> > > Zwane
> >
> > Hm... Good question. I have no idea what the while loop is for.
>
> I suppose the magik is in the comments;
>
> /* first null the handler for the drive and let any process
> * doing IO (on another CPU) run to (partial) completion
> * the lock prevents processing new requests */
>
Indeed. If you get there, the command in progress is hung.
To be able to restart the device, the old command needs to be
aborted. But that is an inherently racy undertaking.

Nominally, I just want to set HWGROUP(drive)->handler = NULL.
But there is a small chance, that there is actually (interrupt)
activity going on for the command, which would result in a new
entry in HWGROUP(drive)->handler popping up after it is cleared.

The loop as programmed significantly increases the odds that
the old command is really aborted.

It may not be elegant to schedule(1) with the lock taken, but it
does work.

However, my latest patch doesn't seem to be applied, since in my
version I have a set_current_state(TASK_INTERRUPTIBLE); before
the schedule.

Regards, Willem Riede.

2003-03-15 20:48:30

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, 15 Mar 2003, Zwane Mwaikambo wrote:

> On Sat, 15 Mar 2003, bert hubert wrote:
>
> > A construct like this was suggested for use in swsusp too to make sure that
> > only *one* request is outstanding for a controler. This was also mentioned
> > to be the unclean way and that there are taskfile interfaces which are
> > cleaner.

[snip]

> Ok so at event [1] we have a ->handler set on cpu0 so we pass that check.
> Then cpu1 acquires ide_lock and NULLs it out. cpu0 blocks on the lock in
> ide_set_handler and when cpu1 releases it it re-assigns ->handler. What
> happens then?

Skip that, i forgot swsusp doesn't do SMP.

Zwane

2003-03-15 21:03:56

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, 15 Mar 2003, Willem Riede wrote:

> Indeed. If you get there, the command in progress is hung.
> To be able to restart the device, the old command needs to be
> aborted. But that is an inherently racy undertaking.
>
> Nominally, I just want to set HWGROUP(drive)->handler = NULL.
> But there is a small chance, that there is actually (interrupt)
> activity going on for the command, which would result in a new
> entry in HWGROUP(drive)->handler popping up after it is cleared.
>
> The loop as programmed significantly increases the odds that
> the old command is really aborted.
>
> It may not be elegant to schedule(1) with the lock taken, but it
> does work.
>
> However, my latest patch doesn't seem to be applied, since in my
> version I have a set_current_state(TASK_INTERRUPTIBLE); before
> the schedule.

Yeah but what happens when a task tries to acquire ide_lock? Incidentally
this one crept in via timer softirq so if ide_timer_expiry gets there
before your scheduled timeout timer?

NMI Watchdog detected LOCKUP on CPU0, eip c02be73d, registers:
CPU: 0
EIP: 0060:[<c02be73d>] Tainted: PF
EFLAGS: 00000086
EIP is at .text.lock.ide_io+0x40/0x93

Call Trace:
[<c03557e1>] i8042_timer_func+0x21/0x30
[<c02bdcc0>] ide_timer_expiry+0x0/0x310

--
function.linuxpower.ca

2003-03-15 21:53:41

by dan carpenter

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Saturday 15 March 2003 10:01 pm, Willem Riede wrote:
> It may not be elegant to schedule(1) with the lock taken, but it
> does work.
>
> However, my latest patch doesn't seem to be applied, since in my
> version I have a set_current_state(TASK_INTERRUPTIBLE); before
> the schedule.

I don't see how it works.

spin_lock_irqsave() increments preempt_count()
in_atomic checks is defined as:
# define in_atomic() ((preempt_count() & ~PREEMPT_ACTIVE) != kernel_locked())
kernel_locked() is defined as:
#define kernel_locked() (current->lock_depth >= 0)

If you call schedule while in_atomic() then it prints out the error
"bad: scheduling while atomic!\n".

As far as I can see set_current_state() doesn't affect
preempt_count() or current->lock_depth. I must be
missing something...

regards,
dan carpenter

2003-03-15 22:18:19

by Willem Riede

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On 2003.03.15 16:11 Zwane Mwaikambo wrote:
> On Sat, 15 Mar 2003, Willem Riede wrote:
>
> > It may not be elegant to schedule(1) with the lock taken, but it
> > does work.
> >
> > However, my latest patch doesn't seem to be applied, since in my
> > version I have a set_current_state(TASK_INTERRUPTIBLE); before
> > the schedule.
>
> Yeah but what happens when a task tries to acquire ide_lock?

It gets to wait until I'm done.

Regards, Willem Riede.

2003-03-15 23:45:59

by Alan

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, 2003-03-15 at 02:52, dan carpenter wrote:
> > > Here is at least one bad call to schedule() in
> > > static int idescsi_reset (Scsi_Cmnd *cmd)
> >
> > Apart from the schedule with the ide_lock held, what is that code actually
> > doing?
> >
> > Zwane
>
> Hm... Good question. I have no idea what the while loop is for.

See 2.4.21-pre5-ac2 or later. You are discussing obsolete and known
broken code otherwise. The -ac code is closer to working and has most
but alas it seems not all the reset stuff fixed. While this works for
2.4 the 2.5 ide-scsi error handling has been rewritten by someone for
the scsi changes and so differs from both the 2.4 style scsi error
handling and reality by some margin. Once the reset code settles down
in 2.4.21-pre5-ac and 2.5.64-ac I'll take a crack at ide-scsi again.

2003-03-16 00:03:29

by Alan

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sat, 2003-03-15 at 21:01, Willem Riede wrote:
> It may not be elegant to schedule(1) with the lock taken, but it
> does work.

You can't sleep holding a lock. You also can't null the hwgroup that
way and you have to deal with some other locking concerns. Have a look
how HDIO_*_RESET is handled in the very latest 2.4/2.5-ac code and
you should be able to follow from that. Note your code paths will be
much like the ioctl as the existing reset code paths are for paths
where we abort from a known safe state (drive initiated or locked).

With the stuff there now you should be able to abort the command
fairly cleanly and then reset the interface.

2003-03-16 00:45:56

by Willem Riede

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On 2003.03.15 20:23 Alan Cox wrote:
> On Sat, 2003-03-15 at 21:01, Willem Riede wrote:
> > It may not be elegant to schedule(1) with the lock taken, but it
> > does work.
>
> You can't sleep holding a lock. You also can't null the hwgroup that
> way and you have to deal with some other locking concerns. Have a look
> how HDIO_*_RESET is handled in the very latest 2.4/2.5-ac code and
> you should be able to follow from that. Note your code paths will be
> much like the ioctl as the existing reset code paths are for paths
> where we abort from a known safe state (drive initiated or locked).
>
> With the stuff there now you should be able to abort the command
> fairly cleanly and then reset the interface.
>
I hear you, and I will take a hard look at that code. But please
realize, that if we get to this code segment, the drive has _not_
responded to the regular command and is in an _unknown_ state.

Thanks, Willem Riede.

2003-03-16 20:20:54

by Alan

[permalink] [raw]
Subject: Re: Any hope for ide-scsi (error handling)?

On Sun, 2003-03-16 at 00:55, Willem Riede wrote:
> I hear you, and I will take a hard look at that code. But please
> realize, that if we get to this code segment, the drive has _not_
> responded to the regular command and is in an _unknown_ state.

Indeed. The error/abort distinction is more about locking/recovery
than drive level semantics. We will still issue an ATAPI reset and
if need be fall back to an ATA reset. What we handle differently is
the assumption that the drive status registers are valid (they are
not), that we are in an interrupt handle an IDE phase (we are not)
and that we want to recovery/reissue or issue the next command
from this context.