2005-11-23 10:27:12

by Chris Ross

[permalink] [raw]
Subject: Re: Kernel panic reading bad disk sector



Russell King - ARM Linux escreveu:
> On Wed, Nov 23, 2005 at 09:25:40AM +0000, Chris Ross wrote:
>>Greg Ungerer escreveu:
>>>Chris Ross wrote:
>>>
>>>>According System.map it is in the function ide_dma_timeout_retry.
>>>
>>>Ok, that is good information. I would try and figure out which
>>>line of code in there is dereferencing a NULL pointer.
>>
>>It would seem to be this line
>>
>> rq->errors = 0;

because rq is set to NULL by earlier the line

ret = DRIVER(drive)->error(drive, "dma timeout retry",
hwif->INB(IDE_STATUS_REG));


> I'd strongly suggest that you talk to IDE folk about this - I
> suspect HWGROUP(drive)->rq should never be NULL while a request
> is being handled on drive.

Which list specifically? I've taken your advice and "promoted" this to
LKML so if that was wrong please correct it politely.

For those just tuning in this is about an ARM system with a Promise
20275 IDE controller which suffers a kernel panic when attempting to
read from a bad sector on the disk.

Regards,
Chris R.


2005-11-23 10:49:31

by Chris Ross

[permalink] [raw]
Subject: Re: Kernel panic reading bad disk sector



Chris Ross escreveu:
> For those just tuning in this is about an ARM system with a Promise
> 20275 IDE controller which suffers a kernel panic when attempting to
> read from a bad sector on the disk.

And the vital piece of information I missed - on kernels 2.4.31-uc0 and
2.4.27-uc1 at least. At this stage it's believed to be from upstream -
i.e. it's probably also in vanilla 2.4.32 although I haven't tested that.

Regards,
Chris R.

2005-11-23 12:52:26

by Chris Ross

[permalink] [raw]
Subject: Re: Kernel panic reading bad disk sector


Chris Ross escreveu:
> Russell King - ARM Linux escreveu:
>> On Wed, Nov 23, 2005 at 09:25:40AM +0000, Chris Ross wrote:
>>> Greg Ungerer escreveu:
>>>> Chris Ross wrote:
>>>>
>>>>> According System.map it is in the function ide_dma_timeout_retry.
>>>>
>>>> Ok, that is good information. I would try and figure out which
>>>> line of code in there is dereferencing a NULL pointer.
>>>
>>> It would seem to be this line
>>>
>>> rq->errors = 0;
>
> because rq is set to NULL by earlier the line
>
> ret = DRIVER(drive)->error(drive, "dma timeout retry",
> hwif->INB(IDE_STATUS_REG));

Which looks like the the correct thing to do. In idedisk_error once the
threshold for the maximum number of retries has been reached the request
is ended because it cannot be serviced

if (rq->errors >= ERROR_MAX)
DRIVER(drive)->end_request(drive, 0);

in idedisk_end_request the request is explicitly set to NULL because it
is now ended, in the code block...

if (!end_that_request_first(rq, uptodate, drive->name)) {
add_blkdev_randomness(MAJOR(rq->rq_dev));
blkdev_dequeue_request(rq);
HWGROUP(drive)->rq = NULL;
end_that_request_last(rq);
ret = 0;
}

Which means that ide_dma_timeout_retry should take account of the fact
that the request might no longer be valid before using it.

In other words it should be...

/* Check whether the request ended early due to disk errors */
if( rq ) {
rq->errors = 0;
rq->sector = rq->bh->b_rsector;
rq->current_nr_sectors = rq->bh->b_size >> 9;
rq->hard_cur_sectors = rq->current_nr_sectors;
rq->buffer = rq->bh->b_data;
}


If anyone has a better solution I would be glad to hear it. Failing that
I'll submit this in normal kernel patch format as soon as I've worked
out how...

Regards,
Chris R.