2002-03-08 11:15:37

by Zwane Mwaikambo

[permalink] [raw]
Subject: [PATCH][2.5] BUG check in elevator.c:237

Please refer to Subject [PANIC] 2.5.5-pre1 elevator.c for more detailed
explanation.

I don't really like this patch mainly because it _really_ feels like a
bandaid for a larger problem in ide-cd, namely it violating the ide
layers command requirements. But it does stop my box from oopsing and
lets it finish the "dd" i was doing. Oops is at the end.

diffed against 2.5.6-pre3 (one down 2 quadrillion to go ;)

--- linux-2.5.6-pre/drivers/ide/ide-cd.c.orig Fri Mar 8 12:09:10 2002
+++ linux-2.5.6-pre/drivers/ide/ide-cd.c Fri Mar 8 12:04:08 2002
@@ -666,9 +666,11 @@
cdrom_end_request(drive, 0);
}

- /* If we got a CHECK_CONDITION status,
- queue a request sense command. */
- if ((stat & ERR_STAT) != 0)
+ /* If we got a CHECK_CONDITION status, queue a request sense command,
+ however if we're generating spurious errors, make sure we don't
+ attempt to queue an an already started request.
+ */
+ if (((stat & ERR_STAT) != 0) && !(rq->flags & REQ_STARTED))
cdrom_queue_request_sense(drive, NULL, NULL, NULL);
} else
blk_dump_rq_flags(rq, "ide-cd bad flags");



invalid operand: 0000
CPU: 0
EIP: 0010:[<c01ebb68>] Not tainted
EFLAGS: 00010086
eax: 0000001e ebx: cfdd6c50 ecx: ffffff43 edx: 000017b6
esi: c03753c8 edi: c03753c8 ebp: c159ca64 esp: c02fdc68
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, threadinfo=c02fc000 task=c02e3ec0)
Stack: c02bc7c5 000000ed 00000000 c03753c8 c03753c8 00000002 00000002 c01fd4a9
c03753c8 cfdd6c50 c03753c8 c15ab37c 00000000 00000000 c02fdcb0 c02fdcb0
00000000 00000000 c02fdcb0 c02fdcb0 00000000 cfdd3a78 c03753c8 00000001
Call Trace: [<c01fd4a9>] [<c02070ec>] [<c0207471>] [<c0121075>] [<c01213ed>]
[<c010d54b>] [<c010a1ea>] [<c010a434>] [<c01153d3>] [<c01319f8>] [<c022e91b>]
[<c022ea82>] [<c0230a5d>] [<c028fbf0>] [<c028d4d3>] [<c025f478>] [<c025f835>]
[<c0232c09>] [<c024118b>] [<c0241524>] [<c01fd2c6>] [<c0207400>] [<c010a1ea>]
[<c010a3e1>] [<c0106d50>] [<c0106d50>] [<c0106d50>] [<c0106d74>] [<c0106df2>]
[<c0105000>]

Code: 0f 0b 58 5a 8b 43 1c a9 04 00 00 00 75 0a 83 e0 01 8b 44 85
<0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

>>EIP; c01ebb68 <elevator_linus_add_request+38/90> <=====
Trace; c01fd4a9 <ide_do_drive_cmd+c9/120>
Trace; c02070ec <cdrom_decode_status+1fc/230>
Trace; c0207471 <cdrom_read_intr+71/330>
Trace; c0121075 <update_process_times+25/30>
Trace; c01213ed <do_timer+3d/70>
Trace; c010d54b <timer_interrupt+13b/150>
Trace; c010a1ea <handle_IRQ_event+3a/70>
Trace; c010a434 <do_IRQ+e4/f0>
Trace; c01153d3 <try_to_wake_up+183/190>
Trace; c01319f8 <kfree+1d8/270>
Trace; c022e91b <kfree_skbmem+b/60>
Trace; c022ea82 <__kfree_skb+112/120>
Trace; c0230a5d <skb_free_datagram+1d/30>
Trace; c028fbf0 <rpc_unlock_task+90/a0>
Trace; c028d4d3 <udp_data_ready+243/280>
Trace; c025f478 <udp_queue_rcv_skb+178/1c0>
Trace; c025f835 <udp_rcv+165/2c0>
Trace; c0232c09 <netif_rx+79/f0>
Trace; c024118b <ip_local_deliver+12b/1c0>
Trace; c0241524 <ip_rcv+304/370>
Trace; c01fd2c6 <ide_intr+d6/150>
Trace; c0207400 <cdrom_read_intr+0/330>
Trace; c010a1ea <handle_IRQ_event+3a/70>
Trace; c010a3e1 <do_IRQ+91/f0>
Trace; c0106d50 <default_idle+0/30>
Trace; c0106d50 <default_idle+0/30>
Trace; c0106d50 <default_idle+0/30>
Trace; c0106d74 <default_idle+24/30>
Trace; c0106df2 <cpu_idle+32/50>
Trace; c0105000 <_stext+0/0>
Code; c01ebb68 <elevator_linus_add_request+38/90>
00000000 <_EIP>:
Code; c01ebb68 <elevator_linus_add_request+38/90> <=====


2002-03-08 12:01:09

by Martin Dalecki

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

Please let me elaborate a bit on this, to give you may be
some hints about where to look for an actual solution of
the problem:

Zwane Mwaikambo wrote:
> Please refer to Subject [PANIC] 2.5.5-pre1 elevator.c for more detailed
> explanation.
>
> I don't really like this patch mainly because it _really_ feels like a
> bandaid for a larger problem in ide-cd, namely it violating the ide
> layers command requirements. But it does stop my box from oopsing and
> lets it finish the "dd" i was doing. Oops is at the end.
>
> diffed against 2.5.6-pre3 (one down 2 quadrillion to go ;)
>
> --- linux-2.5.6-pre/drivers/ide/ide-cd.c.orig Fri Mar 8 12:09:10 2002
> +++ linux-2.5.6-pre/drivers/ide/ide-cd.c Fri Mar 8 12:04:08 2002
> @@ -666,9 +666,11 @@
> cdrom_end_request(drive, 0);
> }
>
> - /* If we got a CHECK_CONDITION status,
> - queue a request sense command. */
> - if ((stat & ERR_STAT) != 0)
> + /* If we got a CHECK_CONDITION status, queue a request sense command,
> + however if we're generating spurious errors, make sure we don't
> + attempt to queue an an already started request.
> + */
> + if (((stat & ERR_STAT) != 0) && !(rq->flags & REQ_STARTED))
> cdrom_queue_request_sense(drive, NULL, NULL, NULL);
> } else
> blk_dump_rq_flags(rq, "ide-cd bad flags");
>

After having a look at the oops I think that this may be very well a
symptom of another problem in the ide-cd.c drivers overall
way of working. Please let me elaborate a bit.

In ide.c there is one central interrupt handler, namely:

void ide_timer_expiry(unsigned long data)

This function is called upon finish of every command.

However for cd-rom there are commands, which can
take quite a long time. Therefore there is the possiblity there
to provide a polling function, which will be engaged after the
interrupt happens in the above function:

/* continue */
if ((wait = expiry(drive)) != 0) {
/* reengage timer */
hwgroup->timer.expires = jiffies + wait;
add_timer(&hwgroup->timer);
spin_unlock_irqrestore(&ide_lock, flags);
return;
}
And plase guess whot? CD-ROM is the only driver which is using
this facility. Please have a look at the last
argument of ide_set_handler(). The second argument is the
interrutp handler for a command. The third is supposed to be
the poll timerout function. But if you look at the
actual poll function found in ide-cd.c (and only there).
You may as well feel to try to just execute its commands directly in
ide_timer_expiry, thus reducing tons of possible races ind the
overall intr handling found currently there.

2002-03-08 12:12:59

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

On Fri, 8 Mar 2002, Martin Dalecki wrote:

> Please let me elaborate a bit on this, to give you may be
> some hints about where to look for an actual solution of
> the problem:

Thanks for taking the time to explain.

> However for cd-rom there are commands, which can
> take quite a long time. Therefore there is the possiblity there
> to provide a polling function, which will be engaged after the
> interrupt happens in the above function:

So are you suggesting perhaps that we change the request servicing to
polling? I'm a bit confused as to how this would fit in with
cdrom_decode_status (which in this case is called from the read_intr). You
might need to whip out a larger clue stick ;)

Thanks again,
Zwane


2002-03-08 12:38:24

by Martin Dalecki

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

Zwane Mwaikambo wrote:
> On Fri, 8 Mar 2002, Martin Dalecki wrote:
>
>
>>Please let me elaborate a bit on this, to give you may be
>>some hints about where to look for an actual solution of
>>the problem:
>>
>
> Thanks for taking the time to explain.
>
>
>>However for cd-rom there are commands, which can
>>take quite a long time. Therefore there is the possiblity there
>>to provide a polling function, which will be engaged after the
>>interrupt happens in the above function:
>>
>
> So are you suggesting perhaps that we change the request servicing to
> polling? I'm a bit confused as to how this would fit in with

At lest we should change the way the transition between intr
controlled mode and polling is done.

> cdrom_decode_status (which in this case is called from the read_intr). You
> might need to whip out a larger clue stick ;)

Well if your error is deterministically reproducable, it's
quite propably I would dare to have a look after it.
Could you just explain how to trigger it (Unfortunately I have
already deleted yours mail about this...)


2002-03-08 12:34:44

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

On Fri, 8 Mar 2002, Zwane Mwaikambo wrote:

> > However for cd-rom there are commands, which can
> > take quite a long time. Therefore there is the possiblity there
> > to provide a polling function, which will be engaged after the
> > interrupt happens in the above function:
>
> So are you suggesting perhaps that we change the request servicing to
> polling? I'm a bit confused as to how this would fit in with
> cdrom_decode_status (which in this case is called from the read_intr).

Ok how about the scenario i'm seeing. The cdrom is spewing constant I/O
errors and we go and check on these, they are actually _not_ in fact
taking a long time and actually preempting each other! Looking at the
code and the usage of ide_preempt, my guess is that checking the status is
a higher priority hence the queue jumping. How can we use a timer for this
one?

Thanks,
Zwane


2002-03-08 12:44:44

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

On Fri, 8 Mar 2002, Martin Dalecki wrote:
> > So are you suggesting perhaps that we change the request servicing to
> > polling? I'm a bit confused as to how this would fit in with
>
> At lest we should change the way the transition between intr
> controlled mode and polling is done.

To something like what some other subsystem drivers do? ie

interrupt triggered
ISR hands off work to a BH

Or is that different from what you had in mind?

> Well if your error is deterministically reproducable, it's
> quite propably I would dare to have a look after it.
> Could you just explain how to trigger it (Unfortunately I have
> already deleted yours mail about this...)

I forwarded the email to you. Its easily reproducible, but i haven't used
any other CD apart from the FreeBSD 4.4-REL install disk which initially
triggered it.

Cheers,
Zwane


2002-03-08 12:51:14

by Martin Dalecki

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

Zwane Mwaikambo wrote:
> On Fri, 8 Mar 2002, Martin Dalecki wrote:
>
>>>So are you suggesting perhaps that we change the request servicing to
>>>polling? I'm a bit confused as to how this would fit in with
>>>
>>At lest we should change the way the transition between intr
>>controlled mode and polling is done.
>>
>
> To something like what some other subsystem drivers do? ie
>
> interrupt triggered
> ISR hands off work to a BH
>
> Or is that different from what you had in mind?

No this is precisely what I had in mind.
Network adapter and SCSI drivers are good points where
to have a look.

2002-03-08 13:05:15

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

Ok cool, regarding the ide-cd current problem, i still don't see how
moving it to a BH would help in this case, the only difference i can see
is that the ISR won't have to be so heavy and we can get more interrupt
handling done (assuming the BH processing keeps up). The problem isn't
that the read_intr is preempting other code and injecting already active
requests into the queue, but that ide-cd is allowing running requests to
be sent multiple times.

If you can just explain marginally a possible workaround other than the
kludge i diffed up i can try investigating further.

Thanks,
Zwane

2002-03-08 14:02:12

by Martin Dalecki

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

Zwane Mwaikambo wrote:
> Ok cool, regarding the ide-cd current problem, i still don't see how
> moving it to a BH would help in this case, the only difference i can see
> is that the ISR won't have to be so heavy and we can get more interrupt
> handling done (assuming the BH processing keeps up). The problem isn't
> that the read_intr is preempting other code and injecting already active
> requests into the queue, but that ide-cd is allowing running requests to
> be sent multiple times.

Yeep...

> If you can just explain marginally a possible workaround other than the
> kludge i diffed up i can try investigating further.

Actually I have right now no *precise* idea (please note the difference
between *no idea* and *no precise idea*) about the optimal fix.
However making things leaner in first step may give you direct insight
into the proper solution as a side effect (OK. I know this is a bit
philosophing, but this method works really frequently... it's called
deduction...).

And it may actually turn out that the diff could be shorter
as some wordish explanation ;-).

2002-03-11 09:46:29

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

On Fri, Mar 08 2002, Martin Dalecki wrote:
> After having a look at the oops I think that this may be very well a
> symptom of another problem in the ide-cd.c drivers overall
> way of working. Please let me elaborate a bit.
>
> In ide.c there is one central interrupt handler, namely:
>
> void ide_timer_expiry(unsigned long data)
>
> This function is called upon finish of every command.

Eh? ide_timer_expiry is the timer handler called if a interrupt timeout
occurs.

> However for cd-rom there are commands, which can
> take quite a long time. Therefore there is the possiblity there
> to provide a polling function, which will be engaged after the
> interrupt happens in the above function:
>
> /* continue */
> if ((wait = expiry(drive)) != 0) {

[snip]

That's nonsense too. I added the expiry hook to let lower levels decide
what should happen when an interrupt timeout occurs. So there's been
_no_ interrupt if we enter this from the timer handler.

> And plase guess whot? CD-ROM is the only driver which is using
> this facility. Please have a look at the last

Right, it was added to handle long commands like format unit etc.

> argument of ide_set_handler(). The second argument is the
> interrutp handler for a command. The third is supposed to be
> the poll timerout function. But if you look at the
> actual poll function found in ide-cd.c (and only there).
> You may as well feel to try to just execute its commands directly in
> ide_timer_expiry, thus reducing tons of possible races ind the
> overall intr handling found currently there.

I don't know what tangent you are going off on here, I think you should
re-read this code a lot more carefully. There's no polling going on
here.

--
Jens Axboe

2002-03-11 10:25:10

by Martin Dalecki

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

Jens Axboe wrote:

> That's nonsense too. I added the expiry hook to let lower levels decide
> what should happen when an interrupt timeout occurs. So there's been
> _no_ interrupt if we enter this from the timer handler.

No interrupt from the same drive right.

>>And plase guess whot? CD-ROM is the only driver which is using
>>this facility. Please have a look at the last
>>
>
> Right, it was added to handle long commands like format unit etc.

Hmm seeks on tapes can take awfully long as well...

>>argument of ide_set_handler(). The second argument is the
>>interrutp handler for a command. The third is supposed to be
>>the poll timerout function. But if you look at the
>>actual poll function found in ide-cd.c (and only there).
>>You may as well feel to try to just execute its commands directly in
>>ide_timer_expiry, thus reducing tons of possible races ind the
>>overall intr handling found currently there.
>>
>
> I don't know what tangent you are going off on here, I think you should
> re-read this code a lot more carefully. There's no polling going on
> here.

I think the term polling used by me is the only problem here ;-).
(I consider every command controll which goes without irq notification
just polling... whatever it polls once or not ;-).

2002-03-11 11:21:14

by Andre Hedrick

[permalink] [raw]
Subject: Re: [PATCH][2.5] BUG check in elevator.c:237

On Mon, 11 Mar 2002, Martin Dalecki wrote:

> Jens Axboe wrote:
>
> > That's nonsense too. I added the expiry hook to let lower levels decide
> > what should happen when an interrupt timeout occurs. So there's been
> > _no_ interrupt if we enter this from the timer handler.
>
> No interrupt from the same drive right.

No reported interrupt from the drive.
If you have gone into one of the old SFF overlap modes then you have
attempted to release service time.

> >>And plase guess whot? CD-ROM is the only driver which is using
> >>this facility. Please have a look at the last
> >>
> >
> > Right, it was added to handle long commands like format unit etc.
>
> Hmm seeks on tapes can take awfully long as well...

See above, streaming media is different.
It is a noise maker both on the bus and in life.

> >>argument of ide_set_handler(). The second argument is the
> >>interrutp handler for a command. The third is supposed to be
> >>the poll timerout function. But if you look at the
> >>actual poll function found in ide-cd.c (and only there).
> >>You may as well feel to try to just execute its commands directly in
> >>ide_timer_expiry, thus reducing tons of possible races ind the
> >>overall intr handling found currently there.
> >>
> >
> > I don't know what tangent you are going off on here, I think you should
> > re-read this code a lot more carefully. There's no polling going on
> > here.
>
> I think the term polling used by me is the only problem here ;-).
> (I consider every command controll which goes without irq notification
> just polling... whatever it polls once or not ;-).

It is not polling, follow Jens' suggestion and read the code.
Rewind, before you started making changes. You will begin to have a
better grasp of the issues. You will need that since the legacy hardware
in the device side will remain for a while.


Regards,

Andre Hedrick
Linux Disk Certification Project Linux ATA Development