Hi Martin,
here at work I have i845 chipset, with one UDMA100 disk connected
to the primary channel, and one UDMA100 disk and one CD-DVD on the
secondary one. CD-DVD driver is not loaded at all, all three devices
are configured for UDMA by kernel.
Today 2.5.29-cset511 died when rebooting to 2.5.29-cset536 (rmap.c:212
BUG(), but I believe that it is fixed by Paulus's page->index patch
(cset520)) and after reboot I'm not able to fsck /dev/hdc1. It dies with
hdc: ide_dma_intr: status=0x58 [ drive ready,seek complete,data request]
hdc: request error, nr. 1
and fsck is D, and channel is stopped :-( First something easy: I think
that we should use ", " as a separator in dump_bits, and if there is
space after opening "[", there should be also space before closing "]".
Second problem is that read operation which ends with
"drive ready, seek complete, data request" (why it happened in first
place?) will just read one sector from drive (it was DMA transfer,
so drive->mult_count == 0), and then it returns from ata_error
with ATA_OP_CONTINUES. But what continues? Drive told us that
current operation is done, and no new operation was started, so
there is very low chance that some IRQ will ever come, and timer was
just removed by ata_irq_request(), so channel will never awake.
And third, why this happens at all? When I instrumented ide_dma_intr
with printk, udma_stop() returns zero: it means that everything went
fine, UDMA engine asked for interrupt, no error, UDMA engine stopped.
Only reason I can invent is that drive did not clear DRQ bit yet, or
that we programmed UDMA engine with too few bytes to transfer. Either
of these explanations looks strange to me, as this does not explain
why it happens only when both channels are in use simultaneously.
And last thing: problem does not happen when only one of channels is
active, it is triggered only when both channels are active, and
channel #1 is always one which dies. Channel #0 uses IRQ14, channel #1
IRQ15, so there should be no sharing involved.
Thanks,
Petr Vandrovec
[email protected]
Petr Vandrovec wrote:
> Hi Martin,
> here at work I have i845 chipset, with one UDMA100 disk connected
> to the primary channel, and one UDMA100 disk and one CD-DVD on the
> secondary one. CD-DVD driver is not loaded at all, all three devices
> are configured for UDMA by kernel.
>
> Today 2.5.29-cset511 died when rebooting to 2.5.29-cset536 (rmap.c:212
> BUG(), but I believe that it is fixed by Paulus's page->index patch
> (cset520)) and after reboot I'm not able to fsck /dev/hdc1. It dies with
>
> hdc: ide_dma_intr: status=0x58 [ drive ready,seek complete,data request]
> hdc: request error, nr. 1
That is usually indicating that some operation was started before
some other really finished.
> and fsck is D, and channel is stopped :-( First something easy: I think
> that we should use ", " as a separator in dump_bits, and if there is
> space after opening "[", there should be also space before closing "]".
Yeep. No problem.
>
> Second problem is that read operation which ends with
> "drive ready, seek complete, data request" (why it happened in first
> place?) will just read one sector from drive (it was DMA transfer,
> so drive->mult_count == 0), and then it returns from ata_error
> with ATA_OP_CONTINUES. But what continues? Drive told us that
> current operation is done, and no new operation was started, so
> there is very low chance that some IRQ will ever come, and timer was
> just removed by ata_irq_request(), so channel will never awake.
What should continue is the retry of the operation, since otherwise
it will be abondoned in do_ide_request(). However I will recheck.
> And third, why this happens at all? When I instrumented ide_dma_intr
> with printk, udma_stop() returns zero: it means that everything went
> fine, UDMA engine asked for interrupt, no error, UDMA engine stopped.
> Only reason I can invent is that drive did not clear DRQ bit yet, or
> that we programmed UDMA engine with too few bytes to transfer. Either
> of these explanations looks strange to me, as this does not explain
> why it happens only when both channels are in use simultaneously.
>
> And last thing: problem does not happen when only one of channels is
> active, it is triggered only when both channels are active, and
> channel #1 is always one which dies. Channel #0 uses IRQ14, channel #1
> IRQ15, so there should be no sharing involved.
Hmm, the order of channels matters for the way the queues are feed.
I think we could expirence reentrancy problems. Or there
are some errors in ata_irq_handler() in dispatching the incomming
IRQs. It should be a good idea to add an IRQ number parameter to the
IRQ handler type, since this would allow to detect such situtations.
One check that could help would be to discover the drive to serive next,
based on drive->queue in do_ide_request() instead of naively looking
through all drives in do_ide_request(). At least comparing it to the
queue parameter after selection would make sense.
Do you do unmasking of IRQs? Holding them a bit longer could have some
impact as well...
Thanks for the input, I will have to think through it a bit longer :-).
On 30 Jul 02 at 16:25, Marcin Dalecki wrote:
> > Second problem is that read operation which ends with
> > "drive ready, seek complete, data request" (why it happened in first
> > place?) will just read one sector from drive (it was DMA transfer,
> > so drive->mult_count == 0), and then it returns from ata_error
> > with ATA_OP_CONTINUES. But what continues? Drive told us that
> > current operation is done, and no new operation was started, so
> > there is very low chance that some IRQ will ever come, and timer was
> > just removed by ata_irq_request(), so channel will never awake.
>
> What should continue is the retry of the operation, since otherwise
> it will be abondoned in do_ide_request(). However I will recheck.
It looks to me like that we only issue idle immediate and reset
to the drive. And even if we reset drive, we do not reissue
command, not even talking about resetting handler. And because of
ide_dma_intr -> ata_error will report ATA_OP_CONTINUES, ata_irq_request
will think that handler reissued command, and it will leave IDE_BUSY set.
So we are left with IDE_BUSY set, idle hardware, no handler and no timer
active, and with one request on the fly lost somewhere in the system.
Probably code which reissued hardware was dropped sometime in the past
changes?
Another problem I found: ata_error calls ata_status_poll, which can
call back to ata_error. Hardwiring BUSY_STAT bit to 1 (== unplugging
drive from system, for example) can cause this loop, as far as I can see.
Fortunately on my system it reads 0x7F from status register after disk
unplug, but it still does not look correct.
> > And last thing: problem does not happen when only one of channels is
> > active, it is triggered only when both channels are active, and
> > channel #1 is always one which dies. Channel #0 uses IRQ14, channel #1
> > IRQ15, so there should be no sharing involved.
>
> Do you do unmasking of IRQs? Holding them a bit longer could have some
> impact as well...
It was happening with default configuration, with unmaskirq=1. Now I tried
hdparm -u 0 /dev/hda; hdparm -u 0 /dev/hdc
vmware-config.pl -default & fsck -f /dev/hdc1
and it again died. vmware-config.pl is used as simple compile test,
it happens with 'ls -lRta /' too, but with 'vmware-config.pl' it happens
much faster.
Stack trace when this problem happens is:
ide_dma_intr + b8/cc (here I added printstate() call)
ata_irq_request + 11e/1cc
handle_IRQ_event + 29/4c
do_IRQ + df/190
common_interrupt + 18/20
madvise_willneed + 10/94
radix_tree_lookup + 18/60
do_page_cache_readahead + 92/13c
do_generic_file_read + 57/2a8
generic_file_read + 11c/138
file_read_actor + 0/8c
vfs_read + b4/134
sys_read + 2a/3c
syscall_call + 7/b
It is UP machine (with SMP non-preemptible kernel). Stack trace does not
look like that it was caused by some race.
Best regards,
Petr Vandrovec
[email protected]
I wrote:
> On 30 Jul 02 at 16:25, Marcin Dalecki wrote:
> > > Second problem is that read operation which ends with
> > > "drive ready, seek complete, data request" (why it happened in first
> > > place?) will just read one sector from drive (it was DMA transfer,
> > > so drive->mult_count == 0), and then it returns from ata_error
> > > with ATA_OP_CONTINUES. But what continues? Drive told us that
> > > current operation is done, and no new operation was started, so
> > > there is very low chance that some IRQ will ever come, and timer was
> > > just removed by ata_irq_request(), so channel will never awake.
> >
> > What should continue is the retry of the operation, since otherwise
> > it will be abondoned in do_ide_request(). However I will recheck.
>
> It is UP machine (with SMP non-preemptible kernel). Stack trace does not
> look like that it was caused by some race.
There is something severely broken... I reenabled
ide: unexpected interrupt in ata_irq_request and to my surprise here
we get one suprious interrupt for each request we do, on both
channels - primary and secondary.
It looked:
udma_pci_init: sending read command to drive
ata_irq_request: IRQ arrived, for us, calling handler
ata_irq_request: handler returned 0
ide: unexpected interrupt 1 15 handler=00000000
callstack: ata_irq_request + 7e/234, handle_IRQ_event + 29/4c,
do_IRQ + df/190, common_interrupt + 18/20, do_softirq + 50/ac,
do_IRQ + 179/190, common_interrupt + 18/20
udma_pci_init: sending read command to drive
ata_irq_request: IRQ arrived, for us, calling handler
ata_irq_request: handler returned 0
ide: unexpected interrupt 1 15 handler=00000000
callstack: same as above
udma_pci_init: sending read command to drive
ata_irq_request: IRQ arrived, for us, calling handler
ata_irq_request: handler returned 0
udma_pci_init: sending read command to drive
ata_irq_request: command immediately queued by do_ide_request
ata_irq_request: IRQ arrived, for us, calling handler
oops: ide_dma_intr: udmastatus=00, diskstatus=58
So we are getting one spurious interrupt for each UDMA request.
Until we do not issue new command to the drive immediately, IRQ
is silently ignored, and everybody is happy (?). But when we
queue command immediately by call to do_ide_request in
ata_irq_request, sooner or later spurious interrupt will
arrive with wrong timming, and we'll think that command is
done while it is still in progress.
I see same spurious interrupt problem on primary channel too,
but somehow timming is different with UDMA100, and we always find
command done instead of in progress when spurious interrupt happens.
Unfortunately ATA/ATAPIv7 says that single interrupt is triggered
after command is done and all data transfered, and we do not play
with select bit. But we play with nIEN bit of disk. Do you see
any reason why this should cause spurious interrupt? (system is using
XT-PIC, FYI)
Thanks,
Petr Vandrovec
[email protected]
I wrote:
>
> Unfortunately ATA/ATAPIv7 says that single interrupt is triggered
> after command is done and all data transfered, and we do not play
> with select bit. But we play with nIEN bit of disk. Do you see
> any reason why this should cause spurious interrupt? (system is using
> XT-PIC, FYI)
OK. As I am using only one device on each channel, I commented
out ata_irq_enable(drive, 1) in ide-disk.c when issuing command,
and removed disabling irq in ide_do_request in ide.c when we
do not issue command to the drive, and spurious interrupts disappeared.
So now I'm getting only half of IRQs for channel 0, and system still
works as before ;-)
Unfortunately, problem is still here: when kernel was in idedisk_do_request
performed on channel 0, IRQ for channel 1 arrived, and this irq found
channel 1 DMA engine ready, but drive had DRQ set... oops. Shortly after
that IRQ for channel 1 arrived again, but as it was unexpected, nothing
happened.
I hope that i845 is not simplex device, but first (unexpected) IRQ arrived
just when channel 0 code wrote new value to its IDE_SELECT_REG register.
Now I even disconnected DVD drive, so it is simple two masters, two
channels configuration, but it still happens.
And as always, something else: ata_error does:
OUT_BYTE(WIN_NOP, ch->ports[IDE_CONTROL_OFFSET])
I'd say that it should use 0x00 instead of WIN_NOP, and also that
comment above OUT_BYTE(0x04, ch->ports[IDE_CONTROL_OFFSET]) is bogus.
Command register is IDE_COMMAND, not IDE_CONTROL ;-)
Best regards,
Petr Vandrovec
[email protected]
Petr Vandrovec wrote:
> I wrote:
>
>>Unfortunately ATA/ATAPIv7 says that single interrupt is triggered
>>after command is done and all data transfered, and we do not play
>>with select bit. But we play with nIEN bit of disk. Do you see
>>any reason why this should cause spurious interrupt? (system is using
>>XT-PIC, FYI)
>
>
> OK. As I am using only one device on each channel, I commented
> out ata_irq_enable(drive, 1) in ide-disk.c when issuing command,
> and removed disabling irq in ide_do_request in ide.c when we
> do not issue command to the drive, and spurious interrupts disappeared.
> So now I'm getting only half of IRQs for channel 0, and system still
> works as before ;-)
Well OK this was my next idea, but apparently you already did the
experient on your own. Thanks for the result. I'm still scratching my
head and I have already observed this before myself.
It's always funny to see what happens when one stops a driver
from deliberately disabling IRQs for eons of jiffies :-).
> Unfortunately, problem is still here: when kernel was in idedisk_do_request
> performed on channel 0, IRQ for channel 1 arrived, and this irq found
> channel 1 DMA engine ready, but drive had DRQ set... oops. Shortly after
> that IRQ for channel 1 arrived again, but as it was unexpected, nothing
> happened.
>
> I hope that i845 is not simplex device, but first (unexpected) IRQ arrived
> just when channel 0 code wrote new value to its IDE_SELECT_REG register.
> Now I even disconnected DVD drive, so it is simple two masters, two
> channels configuration, but it still happens.
One idea and one experiment I was already thinking about is
to change do_ide_request to actually *not* select delibreately which
device do handle. (The big for loop found there...)
One can instead search for a device on the channel which is matching
the queue for which do_ide_request() was called.
for (unit = 0; unit < MAX_DEVICES; ++unit) {
....
if (tmp->queue == q) {
drive = tmp;
break;
}
}
if (!drive)
BUG();
Just please forget temporarly that there is a mechanism for "sleeping".
It is bogous anyway (doesn give time back to anybody) and the only
consumer of it is ide-cd (easly removed there) and ide-tape.c (don't
care the driver was never usable in 2.5.xx)
> And as always, something else: ata_error does:
>
> OUT_BYTE(WIN_NOP, ch->ports[IDE_CONTROL_OFFSET])
>
> I'd say that it should use 0x00 instead of WIN_NOP, and also tha
> comment above OUT_BYTE(0x04, ch->ports[IDE_CONTROL_OFFSET]) is bogus.
> Command register is IDE_COMMAND, not IDE_CONTROL ;-)
Yes I know already about this I will remove the comment.
(Must have forgotten about it.)
Petr Vandrovec wrote:
> I wrote:
>
>>On 30 Jul 02 at 16:25, Marcin Dalecki wrote:
>>
>>>>Second problem is that read operation which ends with
>>>>"drive ready, seek complete, data request" (why it happened in first
>>>>place?) will just read one sector from drive (it was DMA transfer,
>>>>so drive->mult_count == 0), and then it returns from ata_error
>>>>with ATA_OP_CONTINUES. But what continues? Drive told us that
>>>>current operation is done, and no new operation was started, so
>>>>there is very low chance that some IRQ will ever come, and timer was
>>>>just removed by ata_irq_request(), so channel will never awake.
>>>
>>>What should continue is the retry of the operation, since otherwise
>>>it will be abondoned in do_ide_request(). However I will recheck.
>>
>>It is UP machine (with SMP non-preemptible kernel). Stack trace does not
>>look like that it was caused by some race.
>
>
> There is something severely broken... I reenabled
> ide: unexpected interrupt in ata_irq_request and to my surprise here
> we get one suprious interrupt for each request we do, on both
> channels - primary and secondary.
>
> It looked:
>
> udma_pci_init: sending read command to drive
> ata_irq_request: IRQ arrived, for us, calling handler
> ata_irq_request: handler returned 0
> ide: unexpected interrupt 1 15 handler=00000000
> callstack: ata_irq_request + 7e/234, handle_IRQ_event + 29/4c,
> do_IRQ + df/190, common_interrupt + 18/20, do_softirq + 50/ac,
> do_IRQ + 179/190, common_interrupt + 18/20
> udma_pci_init: sending read command to drive
> ata_irq_request: IRQ arrived, for us, calling handler
> ata_irq_request: handler returned 0
> ide: unexpected interrupt 1 15 handler=00000000
> callstack: same as above
> udma_pci_init: sending read command to drive
> ata_irq_request: IRQ arrived, for us, calling handler
> ata_irq_request: handler returned 0
> udma_pci_init: sending read command to drive
> ata_irq_request: command immediately queued by do_ide_request
> ata_irq_request: IRQ arrived, for us, calling handler
> oops: ide_dma_intr: udmastatus=00, diskstatus=58
>
> So we are getting one spurious interrupt for each UDMA request.
> Until we do not issue new command to the drive immediately, IRQ
> is silently ignored, and everybody is happy (?). But when we
> queue command immediately by call to do_ide_request in
> ata_irq_request, sooner or later spurious interrupt will
> arrive with wrong timming, and we'll think that command is
> done while it is still in progress.
>
> I see same spurious interrupt problem on primary channel too,
> but somehow timming is different with UDMA100, and we always find
> command done instead of in progress when spurious interrupt happens.
>
> Unfortunately ATA/ATAPIv7 says that single interrupt is triggered
> after command is done and all data transfered, and we do not play
> with select bit. But we play with nIEN bit of disk. Do you see
> any reason why this should cause spurious interrupt? (system is using
> XT-PIC, FYI)
What I actually try to do is to maintain the nIEN bit enabled the
times we don't do any transfer to the disk in question.
Precisely to prevent the disk from spewing IRQs at times
when it should not. And yes this bit is acting in a reversed manner.
But I'm sure you already know this.
You could of course try to make the ata_irq_enbale()
function a no-op and see whatever this is changing anything.
(Me: Scratching my head with a puzzled expression on the face...;-)
On Wed, Jul 31 2002, Marcin Dalecki wrote:
> >Unfortunately, problem is still here: when kernel was in idedisk_do_request
> >performed on channel 0, IRQ for channel 1 arrived, and this irq found
> >channel 1 DMA engine ready, but drive had DRQ set... oops. Shortly after
> >that IRQ for channel 1 arrived again, but as it was unexpected, nothing
> >happened.
> >
> >I hope that i845 is not simplex device, but first (unexpected) IRQ arrived
> >just when channel 0 code wrote new value to its IDE_SELECT_REG register.
> >Now I even disconnected DVD drive, so it is simple two masters, two
> >channels configuration, but it still happens.
>
> One idea and one experiment I was already thinking about is
> to change do_ide_request to actually *not* select delibreately which
> device do handle. (The big for loop found there...)
> One can instead search for a device on the channel which is matching
> the queue for which do_ide_request() was called.
>
> for (unit = 0; unit < MAX_DEVICES; ++unit) {
> ....
> if (tmp->queue == q) {
> drive = tmp;
> break;
> }
> }
> if (!drive)
> BUG();
hey that sucks :-)
seriously, the better way to do this would be to change the q->queuedata
to be a pointer to drive instead of the channel.
that would work, but I think it would seriously starve the other device
on the same channel.
--
Jens Axboe
On Thu, Aug 01 2002, Marcin Dalecki wrote:
> >hey that sucks :-)
>
> Since IDE 111 not any more...
Yeah I just saw that 110 was the 'broken' solution, 111 made it right.
Good.
> >seriously, the better way to do this would be to change the q->queuedata
> >to be a pointer to drive instead of the channel.
>
> ... becouse this is already *done* there :-).
:-)
> >that would work, but I think it would seriously starve the other device
> >on the same channel.
>
> We starve anyway, becouse the kernel isn't real time and we can't
> guarantee "sleeping" for some maximum time and comming back.
> We don't reschedule the kernel during this kind of "sleeping".
> And we can't know that a command on the "mate" will not take
> extraordinary amounts of time. It's only a problem if mixing travan
> tapes with disks on a channel.
I'm thinking about the alternation of the devices so one device can't
starve the other device off the channel.
--
Jens Axboe
Jens Axboe wrote:
> On Wed, Jul 31 2002, Marcin Dalecki wrote:
>
>>>Unfortunately, problem is still here: when kernel was in idedisk_do_request
>>>performed on channel 0, IRQ for channel 1 arrived, and this irq found
>>>channel 1 DMA engine ready, but drive had DRQ set... oops. Shortly after
>>>that IRQ for channel 1 arrived again, but as it was unexpected, nothing
>>>happened.
>>>
>>>I hope that i845 is not simplex device, but first (unexpected) IRQ arrived
>>>just when channel 0 code wrote new value to its IDE_SELECT_REG register.
>>>Now I even disconnected DVD drive, so it is simple two masters, two
>>>channels configuration, but it still happens.
>>
>>One idea and one experiment I was already thinking about is
>>to change do_ide_request to actually *not* select delibreately which
>>device do handle. (The big for loop found there...)
>>One can instead search for a device on the channel which is matching
>>the queue for which do_ide_request() was called.
>>
>>for (unit = 0; unit < MAX_DEVICES; ++unit) {
>> ....
>> if (tmp->queue == q) {
>> drive = tmp;
>> break;
>> }
>>}
>>if (!drive)
>> BUG();
>
>
> hey that sucks :-)
Since IDE 111 not any more...
> seriously, the better way to do this would be to change the q->queuedata
> to be a pointer to drive instead of the channel.
... becouse this is already *done* there :-).
> that would work, but I think it would seriously starve the other device
> on the same channel.
We starve anyway, becouse the kernel isn't real time and we can't
guarantee "sleeping" for some maximum time and comming back.
We don't reschedule the kernel during this kind of "sleeping".
And we can't know that a command on the "mate" will not take
extraordinary amounts of time. It's only a problem if mixing travan
tapes with disks on a channel.
Jens Axboe wrote:
>>>that would work, but I think it would seriously starve the other device
>>>on the same channel.
>>
>>We starve anyway, becouse the kernel isn't real time and we can't
>>guarantee "sleeping" for some maximum time and comming back.
>>We don't reschedule the kernel during this kind of "sleeping".
>>And we can't know that a command on the "mate" will not take
>>extraordinary amounts of time. It's only a problem if mixing travan
>>tapes with disks on a channel.
>
>
> I'm thinking about the alternation of the devices so one device can't
> starve the other device off the channel.
Ah so you are thinking about two equally powered devices
competing for the channel. Something I would call the "sumo fight"
situation. Well disks didn't use the "sleeping" mechanism at all anyway
and the chances someone would do cp from CD-ROM to CD-ROM are low.
Finally I think that the proper granularity of scheduling requests to
the drive is, well, the request layer. The queue processing layer should
handle this becouse otherwise we would have two "competing" optimization
mechanisms. And there we are indeed able to actually relinquish some CPU
time. If you look at an request processing optimization as a low pass
signal filter it's immediately obvious that the effects of chaining them
can be, well at least "counter intuitive".
On Thu, Aug 01 2002, Marcin Dalecki wrote:
> Jens Axboe wrote:
>
> >>>that would work, but I think it would seriously starve the other device
> >>>on the same channel.
> >>
> >>We starve anyway, becouse the kernel isn't real time and we can't
> >>guarantee "sleeping" for some maximum time and comming back.
> >>We don't reschedule the kernel during this kind of "sleeping".
> >>And we can't know that a command on the "mate" will not take
> >>extraordinary amounts of time. It's only a problem if mixing travan
> >>tapes with disks on a channel.
> >
> >
> >I'm thinking about the alternation of the devices so one device can't
> >starve the other device off the channel.
>
> Ah so you are thinking about two equally powered devices
> competing for the channel. Something I would call the "sumo fight"
> situation. Well disks didn't use the "sleeping" mechanism at all anyway
> and the chances someone would do cp from CD-ROM to CD-ROM are low.
>
> Finally I think that the proper granularity of scheduling requests to
> the drive is, well, the request layer. The queue processing layer should
> handle this becouse otherwise we would have two "competing" optimization
> mechanisms. And there we are indeed able to actually relinquish some CPU
> time. If you look at an request processing optimization as a low pass
> signal filter it's immediately obvious that the effects of chaining them
> can be, well at least "counter intuitive".
Actually, I'm thinking of a much simple scenario: basically any two
devices on the same channel, both with pending requests on the queue.
This could be a hard drive and a cd writer, for instance. If you have 60
requests pending for the hard drive, queue gets unplugged, you start the
first one. Correct me if I'm wrong, but now you pass back the drive to
the request handler when the first request completes, and you select a
new request from that very same drive without considering device
starvation? Any run of the cd writer queue would do nothing, since it
would just find the channel busy.
This sort of thing cannot be solved at the block layer. The two queues
are independent seen from that layer, the channel-busy dependency cannot
be solved there.
--
Jens Axboe
On 31 Jul 02 at 22:01, Marcin Dalecki wrote:
>
> Well OK this was my next idea, but apparently you already did the
> experient on your own. Thanks for the result. I'm still scratching my
> head and I have already observed this before myself.
> It's always funny to see what happens when one stops a driver
> from deliberately disabling IRQs for eons of jiffies :-).
I finally managed to compile older kernels, and I found that
2.5.27 (and 2.4.19-rc1 and 2.5.26) works fine (modulo endless loop
in ide_do_request... but it takes at least 5 minutes to trigger it),
while 2.5.28 dies in one second with UDMA status 0x25 (irq requested,
transfer in progress) and IDE status 0x58 (drq asserted).
Because of only change in IDE system between 2.5.27 and 2.5.28 is
renaming __save_flags => local_save_flags, fixing get_request for
ioctl commands (so 2.5.28 should be correct while 2.5.27 is not),
and moving some ioctls around, it looks like that problem is triggered
by something else.
I currently suspect IRQ handling changes, but maybe someone has
better idea? Also, I cannot reproduce problem with Seagate UDMA66
drive switched to UDMA33 mode, so it looks like that problem is
timming/firmware (Toshiba MK6409MAV) dependent.
And I did all these tests with UP kernel, just to eliminate cli/sti
changes.
Petr Vandrovec
[email protected]
On Thu, Aug 01, 2002 at 07:07:21PM +0200, Petr Vandrovec wrote:
> On 31 Jul 02 at 22:01, Marcin Dalecki wrote:
> >
> > Well OK this was my next idea, but apparently you already did the
> > experient on your own. Thanks for the result. I'm still scratching my
> > head and I have already observed this before myself.
> > It's always funny to see what happens when one stops a driver
> > from deliberately disabling IRQs for eons of jiffies :-).
>
> I currently suspect IRQ handling changes, but maybe someone has
> better idea? Also, I cannot reproduce problem with Seagate UDMA66
> drive switched to UDMA33 mode, so it looks like that problem is
> timming/firmware (Toshiba MK6409MAV) dependent.
I'd like to apologize to Ingo, his changes were completely innocent.
Problem was triggered by Al's 'block device size cleanups' (currently
cset 1.403.160.5 on bkbits).
Before this change, my system was using 4KB block size when reading
from /dev/hdc1, because of blk_size[][] (which is in 1kB units) of this
partition was multiple of 2, and so i_size % 4096 was 0. But after
Al's change partition size is read from gendisk, and not from blk_size,
and gendisk partition size is in 512 bytes units: and, as you can
probably guess, now my partition had i_size % 4096 == 512, and so only
512 byte block size was choosen. And with 512 bytes block size my
harddisk refuses to cooperate.
I was trying to find reason in code, why 512 byte block size should
not work, but I was not able to reveal any. Maybe I/O gurus here
will know?
For now, I'm using patch below. It fixes problems for me, block size = 1024
is sufficient in my configuration. If you have any insights what can be
a problem, please tell me. Problem apparently is not in i_size not being
multiple of 1024: without changing bsize problem still occurs, even if
I shrink i_size down to be multiple of 4K.
After some more testing I found that my other drive (120GB WD) handles
bsize=512 quite happily, so it looks like that just my Toshiba disk
does not like 512B back to back transfers?! Are there any plans to
read from block devices in 4KB blocks for all reads/writes except for
the last partial page?
Thanks,
Petr Vandrovec
[email protected]
--- linux-2.5.29-c548/fs/block_dev.c.orig 2002-07-31 12:48:23.000000000 +0200
+++ linux-2.5.29-c548/fs/block_dev.c 2002-08-01 23:20:43.000000000 +0200
@@ -608,6 +608,11 @@
break;
bsize <<= 1;
}
+ if (bsize == 512) {
+ printk(KERN_ERR "Found 512b device! Using larger block size...\n");
+ bdev->bd_inode->i_size -= 512;
+ bsize = 1024;
+ }
bdev->bd_block_size = bsize;
bdev->bd_inode->i_blkbits = blksize_bits(bsize);
if (p->queue)
Uz.ytkownik Petr Vandrovec napisa?:
> On Thu, Aug 01, 2002 at 07:07:21PM +0200, Petr Vandrovec wrote:
>
>>On 31 Jul 02 at 22:01, Marcin Dalecki wrote:
>>
>>>Well OK this was my next idea, but apparently you already did the
>>>experient on your own. Thanks for the result. I'm still scratching my
>>>head and I have already observed this before myself.
>>>It's always funny to see what happens when one stops a driver
>>>from deliberately disabling IRQs for eons of jiffies :-).
>>
>>I currently suspect IRQ handling changes, but maybe someone has
>>better idea? Also, I cannot reproduce problem with Seagate UDMA66
>>drive switched to UDMA33 mode, so it looks like that problem is
>>timming/firmware (Toshiba MK6409MAV) dependent.
>
>
> I'd like to apologize to Ingo, his changes were completely innocent.
> Problem was triggered by Al's 'block device size cleanups' (currently
> cset 1.403.160.5 on bkbits).
>
> Before this change, my system was using 4KB block size when reading
> from /dev/hdc1, because of blk_size[][] (which is in 1kB units) of this
> partition was multiple of 2, and so i_size % 4096 was 0. But after
> Al's change partition size is read from gendisk, and not from blk_size,
> and gendisk partition size is in 512 bytes units: and, as you can
> probably guess, now my partition had i_size % 4096 == 512, and so only
> 512 byte block size was choosen. And with 512 bytes block size my
> harddisk refuses to cooperate.
>
> I was trying to find reason in code, why 512 byte block size should
> not work, but I was not able to reveal any. Maybe I/O gurus here
> will know?
Petr. First - I wish to express my respect (for whatever it's
worth). Once again you are fscking sharp and up the point in problem
analysis.
For what few things I know about the situation is that the SATA
people are having great problems with the mediocre physical sector size
and they are pushing hard toward bigger sector
sizes. This may explain a bit why there is a propability why one
should be awake in this area.
Would you mind sending me hdparm -i /dev/hdx and hdparm -I /dev/hdx
for documentation purposes? The host controller chip could be the
one to blame as well.
I fear the need for jet another black list.
On 2 Aug 02 at 0:13, Marcin Dalecki wrote:
>
> Would you mind sending me hdparm -i /dev/hdx and hdparm -I /dev/hdx
> for documentation purposes? The host controller chip could be the
> one to blame as well.
>
> I fear the need for jet another black list.
Here they are. This is with i845 (82801BA rev B5) host chip. I'll check
i440BX tomorrow and PDC20265 on sunday. I believe that PDC20265
worked because of I did not notice problem at home, only at work.
Petr Vandrovec
[email protected]
/dev/hdc:
Model=TOSHIBA MK6409MAV, FwRev=F1.03 A, SerialNo=58S40974
Config={ Fixed }
RawCHS=13424/15/63, TrkSize=0, SectSize=0, ECCbytes=36
BuffType=unknown, BuffSize=0kB, MaxMultSect=16, MultSect=off
CurCHS=13424/15/63, CurSects=12685680, LBA=yes, LBAsects=12685680
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: sdma0 sdma1 sdma2 mdma0 mdma1 mdma2 udma0 udma1 *udma2
AdvancedPM=no WriteCache=enabled
Drive Supports : Reserved : ATA-1 ATA-2 ATA-3 ATA-4
/dev/hdc:
non-removable ATA device, with non-removable media
Model Number: TOSHIBA MK6409MAV
Serial Number: 58S40974
Firmware Revision: F1.03 A
Standards:
Supported: 1 2 3 4
Likely used: 4
Configuration:
Logical max current
cylinders 13424 13424
heads 15 15
sectors/track 63 63
bytes/track: 0 (obsolete)
bytes/sector: 0 (obsolete)
current sector capacity: 12685680
LBA user addressable sectors = 12685680
Capabilities:
LBA, IORDY(can be disabled)
ECC bytes: 36 Queue depth: 1
Standby timer values: spec'd by Vendor, no device specific minimum
r/w multiple sector transfer: Max = 16 Current = 16
DMA: sdma0 sdma1 sdma2 mdma0 mdma1 mdma2 udma0 udma1 *udma2
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* NOP cmd
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
* look-ahead
* write cache
* Power Management feature set
Security Mode feature set
* SMART feature set
Security:
supported
not enabled
not locked
frozen
not expired: security count
not supported: enhanced erase
22min for SECURITY ERASE UNIT.
00:1f.1 IDE interface: Intel Corp. 82801BA IDE U100 (rev 05) (prog-if 80 [Master])
Subsystem: Intel Corp.: Unknown device 5054
00:1f.1 Class 0101: 8086:244b (rev 05) (prog-if 80 [Master])
Subsystem: 8086:5054
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Region 4: I/O ports at ffa0 [size=16]
00: 86 80 4b 24 05 00 80 02 05 80 01 01 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: a1 ff 00 00 00 00 00 00 00 00 00 00 86 80 54 50
30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
40: 47 e3 47 e3 00 00 00 00 05 00 01 02 00 00 00 00
50: 00 00 00 00 10 14 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 47 0f 00 00 00 00 00 00
pcibus = 33333
00:1f.1 vendor=8086 device=244b class=0101 irq=0 base4=ffa1
----------PIIX BusMastering IDE Configuration---------------
Driver Version: 1.3
South Bridge: 9291
Revision: IDE 0x5
Highest DMA rate: UDMA100
BM-DMA base: 0xffa0
PCI clock: 33.3MHz
-----------------------Primary IDE-------Secondary IDE------
Enabled: yes yes
Simplex only: no no
Cable Type: 80w 40w
-------------------drive0----drive1----drive2----drive3-----
Prefetch+Post: yes yes yes yes
Transfer Mode: UDMA PIO UDMA PIO
Address Setup: 90ns 90ns 90ns 90ns
Cmd Active: 360ns 360ns 360ns 360ns
Cmd Recovery: 540ns 540ns 540ns 540ns
Data Active: 90ns 360ns 90ns 360ns
Data Recovery: 30ns 540ns 30ns 540ns
Cycle Time: 22ns 900ns 60ns 900ns
Transfer Rate: 88.8MB/s 2.2MB/s 33.3MB/s 2.2MB/s
(whee, Intel defines UDMA(100) to 88.8MB/s?)
On Fri, 2 Aug 2002, Marcin Dalecki wrote:
> > I'd like to apologize to Ingo, his changes were completely innocent.
> > Problem was triggered by Al's 'block device size cleanups' (currently
> > cset 1.403.160.5 on bkbits).
> >
> > Before this change, my system was using 4KB block size when reading
> > from /dev/hdc1, because of blk_size[][] (which is in 1kB units) of this
> > partition was multiple of 2, and so i_size % 4096 was 0. But after
> > Al's change partition size is read from gendisk, and not from blk_size,
> > and gendisk partition size is in 512 bytes units: and, as you can
> > probably guess, now my partition had i_size % 4096 == 512, and so only
> > 512 byte block size was choosen. And with 512 bytes block size my
> > harddisk refuses to cooperate.
Uh-oh...
Let me see if I got it straight:
a) your disk doesn't work with half-Kb requests
b) you have a partition with odd number of sectors
c) hardsect_size is set to half-Kb
d) old code worked since it rounded size to multiple of kilobyte.
Correct?
On 1 Aug 02 at 18:39, Alexander Viro wrote:
> On Fri, 2 Aug 2002, Marcin Dalecki wrote:
>
> > > I'd like to apologize to Ingo, his changes were completely innocent.
> > > Problem was triggered by Al's 'block device size cleanups' (currently
> > > cset 1.403.160.5 on bkbits).
> > >
> > > Before this change, my system was using 4KB block size when reading
> > > from /dev/hdc1, because of blk_size[][] (which is in 1kB units) of this
> > > partition was multiple of 2, and so i_size % 4096 was 0. But after
> > > Al's change partition size is read from gendisk, and not from blk_size,
> > > and gendisk partition size is in 512 bytes units: and, as you can
> > > probably guess, now my partition had i_size % 4096 == 512, and so only
> > > 512 byte block size was choosen. And with 512 bytes block size my
> > > harddisk refuses to cooperate.
>
> Uh-oh...
>
> Let me see if I got it straight:
>
> a) your disk doesn't work with half-Kb requests
> b) you have a partition with odd number of sectors
> c) hardsect_size is set to half-Kb
> d) old code worked since it rounded size to multiple of kilobyte.
>
> Correct?
Yes, exactly. Replacing disk is not an option...
Petr Vandrovec
On Fri, 2 Aug 2002, Petr Vandrovec wrote:
> > Uh-oh...
> >
> > Let me see if I got it straight:
> >
> > a) your disk doesn't work with half-Kb requests
> > b) you have a partition with odd number of sectors
> > c) hardsect_size is set to half-Kb
> > d) old code worked since it rounded size to multiple of kilobyte.
> >
> > Correct?
>
> Yes, exactly. Replacing disk is not an option...
OK. At the very least we need a way for driver to tell what the sector
size is. And that can be a problem - AFAICS IDE shares the queue for
master and slave and sector size is queue property.
BTW, what type of partition table do you have there?
On Thu, 1 Aug 2002, Linus Torvalds wrote:
> You probably saw this. Looks like blocksize has been buggered somehow.
> Apparently Petr has a 1kB blocksize optical device..
Yeah - with partition boundaries set not on a physical sector boundary ;-/
He's actually lucky that beginning of partition was not in the middle of
a physical sector...
Looks like we need
a) accurate hardsect_size for these beasts (which is a problem
with current setup, since it's per-queue and not per-device; master and
slave can have different hardsect sizes).
b) extra check in check_partitions() that would verify that
partition doesn't end in the middle of a sector (and round it down
if it does).
Basically, old code worked by accident on that setup - Petr had half-Kb
in the end of partition unaccessible and do_open() didn't notice that.
Now it does and tries to give such access. Disk is not happy...
On Fri, 2 Aug 2002, Petr Vandrovec wrote:
> Normal DOS partition, with 512 byte block size, as this is 512B block
> device, at least I believed to it until now. As start=63, it apparently
> also handles 1024B requests on odd address (I believe that sfdisk -d dumps
> start 0-based).
>
> # partition table of /dev/hdc
> unit: sectors
>
> /dev/hdc1 : start= 63, size=12685617, Id=83, bootable
Blacklist time. That, or decrementing size to 12675616, depending on whether
you want that last half-Kb or not.
On 1 Aug 02 at 18:52, Alexander Viro wrote:
> On Fri, 2 Aug 2002, Petr Vandrovec wrote:
>
> > > Uh-oh...
> > >
> > > Let me see if I got it straight:
> > >
> > > a) your disk doesn't work with half-Kb requests
> > > b) you have a partition with odd number of sectors
> > > c) hardsect_size is set to half-Kb
> > > d) old code worked since it rounded size to multiple of kilobyte.
> > >
> > > Correct?
> >
> > Yes, exactly. Replacing disk is not an option...
>
> OK. At the very least we need a way for driver to tell what the sector
> size is. And that can be a problem - AFAICS IDE shares the queue for
> master and slave and sector size is queue property.
>
> BTW, what type of partition table do you have there?
Normal DOS partition, with 512 byte block size, as this is 512B block
device, at least I believed to it until now. As start=63, it apparently
also handles 1024B requests on odd address (I believe that sfdisk -d dumps
start 0-based).
# partition table of /dev/hdc
unit: sectors
/dev/hdc1 : start= 63, size=12685617, Id=83, bootable
/dev/hdc2 : start= 0, size= 0, Id= 0
/dev/hdc3 : start= 0, size= 0, Id= 0
/dev/hdc4 : start= 0, size= 0, Id= 0
Petr Vandrovec
On Fri, 2 Aug 2002, Petr Vandrovec wrote:
> On 1 Aug 02 at 18:45, Alexander Viro wrote:
> >
> > On Thu, 1 Aug 2002, Linus Torvalds wrote:
> >
> > > You probably saw this. Looks like blocksize has been buggered somehow.
> > > Apparently Petr has a 1kB blocksize optical device..
>
> Just to correct you: it is normal magnetic disk with 512 byte sectors,
> from notebook. It works with 512B UDMA requests if we talk to the drive
> slowly, with pauses here and there. If we talk to it back-to-back, it
> dies. Apparently it forgets that it is doing UDMA transfers and tries
> to do normal PIO or MDMA or what - host terminates transfer in the middle,
> and disk is signaling that it has more data to go.
_Ouch_. Then I have to agree with Martin - it's a blacklist time. There's
not much partition code could do with that - you really have a partition
with a chunk that _can't_ be handled by 1Kb request.
Old code (pretty much by accident) hid it from you, so I'd suggest just
decrementing partition size - it's not that you had anything in that last
half-Kb. At least nothing that could be accessed by old kernels.
On 1 Aug 02 at 18:45, Alexander Viro wrote:
>
> On Thu, 1 Aug 2002, Linus Torvalds wrote:
>
> > You probably saw this. Looks like blocksize has been buggered somehow.
> > Apparently Petr has a 1kB blocksize optical device..
Just to correct you: it is normal magnetic disk with 512 byte sectors,
from notebook. It works with 512B UDMA requests if we talk to the drive
slowly, with pauses here and there. If we talk to it back-to-back, it
dies. Apparently it forgets that it is doing UDMA transfers and tries
to do normal PIO or MDMA or what - host terminates transfer in the middle,
and disk is signaling that it has more data to go.
> Looks like we need
> a) accurate hardsect_size for these beasts (which is a problem
> with current setup, since it's per-queue and not per-device; master and
> slave can have different hardsect sizes).
> b) extra check in check_partitions() that would verify that
> partition doesn't end in the middle of a sector (and round it down
> if it does).
It will not help. Device is reporting 512B sectors, and it even supports
them in PIO.
Petr Vandrovec
[email protected]
On 1 Aug 02 at 19:05, Alexander Viro wrote:
>
> On Fri, 2 Aug 2002, Petr Vandrovec wrote:
>
> > Normal DOS partition, with 512 byte block size, as this is 512B block
> > device, at least I believed to it until now. As start=63, it apparently
> > also handles 1024B requests on odd address (I believe that sfdisk -d dumps
> > start 0-based).
> >
> > # partition table of /dev/hdc
> > unit: sectors
> >
> > /dev/hdc1 : start= 63, size=12685617, Id=83, bootable
>
> Blacklist time. That, or decrementing size to 12675616, depending on whether
> you want that last half-Kb or not.
Last half-KB is useless, as filesystem on it is ext2 with 4KB blocks...
Only problem is that previously stable system was now dying in e2fsck. I'll
try to invent some solution before 2.6 ;-)
Maybe fix to e2fsck would be sufficient, I always thought that it reads disk
in blocksize (4KB) chunks, so disk will not see 512B request. But
apparently it either reads partition in 512B chunks, or block layer does not
do merging correctly.
Petr Vandrovec
[email protected]
On Fri, 2 Aug 2002, Petr Vandrovec wrote:
>
> Just to correct you: it is normal magnetic disk with 512 byte sectors,
> from notebook. It works with 512B UDMA requests if we talk to the drive
> slowly, with pauses here and there. If we talk to it back-to-back, it
> dies.
Ugh.
You apparently use udma2 - can you try forcing it to udma0/1 or the other
DMA modes? It may just be that the drive simply cannot take udma2
reliably.
Linus
Uz.ytkownik Alexander Viro napisa?:
>
> On Thu, 1 Aug 2002, Linus Torvalds wrote:
>
>
>>You probably saw this. Looks like blocksize has been buggered somehow.
>>Apparently Petr has a 1kB blocksize optical device..
>
>
> Yeah - with partition boundaries set not on a physical sector boundary ;-/
>
> He's actually lucky that beginning of partition was not in the middle of
> a physical sector...
>
> Looks like we need
> a) accurate hardsect_size for these beasts (which is a problem
> with current setup, since it's per-queue and not per-device; master and
> slave can have different hardsect sizes).
FYI: In the ATA driver area all queues *are* explicitely per device.
> b) extra check in check_partitions() that would verify that
> partition doesn't end in the middle of a sector (and round it down
> if it does).
>
> Basically, old code worked by accident on that setup - Petr had half-Kb
> in the end of partition unaccessible and do_open() didn't notice that.
> Now it does and tries to give such access. Disk is not happy...
Uz.ytkownik Alexander Viro napisa?:
>
> On Fri, 2 Aug 2002, Petr Vandrovec wrote:
>
>
>>>Uh-oh...
>>>
>>>Let me see if I got it straight:
>>>
>>>a) your disk doesn't work with half-Kb requests
>>>b) you have a partition with odd number of sectors
>>>c) hardsect_size is set to half-Kb
>>>d) old code worked since it rounded size to multiple of kilobyte.
>>>
>>>Correct?
>>
>>Yes, exactly. Replacing disk is not an option...
>
>
> OK. At the very least we need a way for driver to tell what the sector
> size is. And that can be a problem - AFAICS IDE shares the queue for
> master and slave and sector size is queue property.
Wrong. It is sharing the queue lock not the queue itself.
On Fri, 2002-08-02 at 00:13, Petr Vandrovec wrote:
> Last half-KB is useless, as filesystem on it is ext2 with 4KB blocks...
> Only problem is that previously stable system was now dying in e2fsck. I'll
> try to invent some solution before 2.6 ;-)
Guess where EFI puts partition tables 8(