LinuxLists.cc - LibPATA code issues / 2.6.15.4

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
..
> ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> ata3: status=0x51 { DriveReady SeekComplete Error }
> ata3: error=0x04 { DriveStatusError }

I wonder if the FUA logic is inserting cache-flush commands
and perhaps the drive is rejecting those?

Jeff, we really ought to be including the failed ATA opcode
in those error messages!!

Cheers

2006-02-14 16:27:52

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:

> Justin Piszcz wrote:
> ..
>
>> ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>
>
> I wonder if the FUA logic is inserting cache-flush commands
> and perhaps the drive is rejecting those?
>
> Jeff, we really ought to be including the failed ATA opcode
> in those error messages!!
>
If such a thing were available as a patch then I too would apply it and
hopefully could provide useful feedback.

David
PS My problems:

http://marc.theaimsgroup.com/?l=linux-kernel&m=113769509617034&w=2
http://marc.theaimsgroup.com/?l=linux-ide&m=113828551519727&w=2
http://marc.theaimsgroup.com/?l=linux-ide&m=113829573105369&w=2
http://marc.theaimsgroup.com/?l=linux-ide&m=113933732903205&w=2

2006-02-14 17:12:26

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

I would like to try the patch too, if available.

I got these errors when nothing (apparent) was going on.

[25158.676998] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[25158.677005] ata3: status=0x51 { DriveReady SeekComplete Error }
[25158.677009] ata3: error=0x04 { DriveStatusError }
[27306.663556] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[27306.663563] ata3: status=0x51 { DriveReady SeekComplete Error }
[27306.663567] ata3: error=0x04 { DriveStatusError }

On Tue, 14 Feb 2006, David Greaves wrote:

> Mark Lord wrote:
>
>> Justin Piszcz wrote:
>> ..
>>
>>> ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>>> ata3: status=0x51 { DriveReady SeekComplete Error }
>>> ata3: error=0x04 { DriveStatusError }
>>
>>
>> I wonder if the FUA logic is inserting cache-flush commands
>> and perhaps the drive is rejecting those?
>>
>> Jeff, we really ought to be including the failed ATA opcode
>> in those error messages!!
>>
> If such a thing were available as a patch then I too would apply it and
> hopefully could provide useful feedback.
>
> David
> PS My problems:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113769509617034&w=2
> http://marc.theaimsgroup.com/?l=linux-ide&m=113828551519727&w=2
> http://marc.theaimsgroup.com/?l=linux-ide&m=113829573105369&w=2
> http://marc.theaimsgroup.com/?l=linux-ide&m=113933732903205&w=2
>
>

2006-02-14 18:00:42

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Tuesday 14 February 2006 12:12, Justin Piszcz wrote:
> I would like to try the patch too, if available.

Something like this: (for 2.6.16-rc3-git2, but should be okay on 2.6.15 also).

Untested: include the original SCSI opcode in printk's for libata SCSI errors,
to help understand where the errors are coming from.

Signed-Off-By: Mark Lord <[email protected]>

--- linux/drivers/scsi/libata-scsi.c.orig 2006-02-12 19:27:25.000000000 -0500
+++ linux/drivers/scsi/libata-scsi.c 2006-02-14 12:54:17.000000000 -0500
@@ -420,6 +420,7 @@
* @sk: the sense key we'll fill out
* @asc: the additional sense code we'll fill out
* @ascq: the additional sense code qualifier we'll fill out
+ * @opcode: the original SCSI command opcode byte
*
* Converts an ATA error into a SCSI error. Fill out pointers to
* SK, ASC, and ASCQ bytes for later use in fixed or descriptor
@@ -429,7 +430,7 @@
* spin_lock_irqsave(host_set lock)
*/
void ata_to_sense_error(unsigned id, u8 drv_stat, u8 drv_err, u8 *sk, u8 *asc,
- u8 *ascq)
+ u8 *ascq, u8 opcode)
{
int i;

@@ -508,8 +509,8 @@
}
}
/* No error? Undecoded? */
- printk(KERN_WARNING "ata%u: no sense translation for status: 0x%02x\n",
- id, drv_stat);
+ printk(KERN_WARNING "ata%u: no sense translation for op=0x%02x status: 0x%02x\n",
+ id, opcode, drv_stat);

/* For our last chance pick, use medium read error because
* it's much more common than an ATA drive telling you a write
@@ -520,8 +521,8 @@
*ascq = 0x04; /* "auto-reallocation failed" */

translate_done:
- printk(KERN_ERR "ata%u: translated ATA stat/err 0x%02x/%02x to "
- "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, drv_stat, drv_err,
+ printk(KERN_ERR "ata%u: translated op=0x%02x ATA stat/err 0x%02x/%02x to "
+ "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode, drv_stat, drv_err,
*sk, *asc, *ascq);
return;
}
@@ -562,7 +563,7 @@
*/
if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
- &sb[1], &sb[2], &sb[3]);
+ &sb[1], &sb[2], &sb[3], cmd->cmnd[0]);
sb[1] &= 0x0f;
}

@@ -637,7 +638,7 @@
*/
if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
- &sb[2], &sb[12], &sb[13]);
+ &sb[2], &sb[12], &sb[13], cmd->cmnd[0]);
sb[2] &= 0x0f;
}

2006-02-14 18:06:05

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Thanks, I will reboot later tonight and see what type of error codes it
gives me.

Against 2.6.15.4:

# patch -p1 < /tmp/a
patching file drivers/scsi/libata-scsi.c
Hunk #1 succeeded at 404 (offset -16 lines).
Hunk #2 succeeded at 414 (offset -16 lines).
Hunk #3 succeeded at 493 (offset -16 lines).
Hunk #4 succeeded at 505 (offset -16 lines).
Hunk #5 succeeded at 547 (offset -16 lines).
Hunk #6 succeeded at 622 (offset -16 lines).
#

On Tue, 14 Feb 2006, Mark Lord wrote:

> On Tuesday 14 February 2006 12:12, Justin Piszcz wrote:
>> I would like to try the patch too, if available.
>
> Something like this: (for 2.6.16-rc3-git2, but should be okay on 2.6.15 also).
>
> Untested: include the original SCSI opcode in printk's for libata SCSI errors,
> to help understand where the errors are coming from.
>
> Signed-Off-By: Mark Lord <[email protected]>
>
> --- linux/drivers/scsi/libata-scsi.c.orig 2006-02-12 19:27:25.000000000 -0500
> +++ linux/drivers/scsi/libata-scsi.c 2006-02-14 12:54:17.000000000 -0500
> @@ -420,6 +420,7 @@
> * @sk: the sense key we'll fill out
> * @asc: the additional sense code we'll fill out
> * @ascq: the additional sense code qualifier we'll fill out
> + * @opcode: the original SCSI command opcode byte
> *
> * Converts an ATA error into a SCSI error. Fill out pointers to
> * SK, ASC, and ASCQ bytes for later use in fixed or descriptor
> @@ -429,7 +430,7 @@
> * spin_lock_irqsave(host_set lock)
> */
> void ata_to_sense_error(unsigned id, u8 drv_stat, u8 drv_err, u8 *sk, u8 *asc,
> - u8 *ascq)
> + u8 *ascq, u8 opcode)
> {
> int i;
>
> @@ -508,8 +509,8 @@
> }
> }
> /* No error? Undecoded? */
> - printk(KERN_WARNING "ata%u: no sense translation for status: 0x%02x\n",
> - id, drv_stat);
> + printk(KERN_WARNING "ata%u: no sense translation for op=0x%02x status: 0x%02x\n",
> + id, opcode, drv_stat);
>
> /* For our last chance pick, use medium read error because
> * it's much more common than an ATA drive telling you a write
> @@ -520,8 +521,8 @@
> *ascq = 0x04; /* "auto-reallocation failed" */
>
> translate_done:
> - printk(KERN_ERR "ata%u: translated ATA stat/err 0x%02x/%02x to "
> - "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, drv_stat, drv_err,
> + printk(KERN_ERR "ata%u: translated op=0x%02x ATA stat/err 0x%02x/%02x to "
> + "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode, drv_stat, drv_err,
> *sk, *asc, *ascq);
> return;
> }
> @@ -562,7 +563,7 @@
> */
> if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
> ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
> - &sb[1], &sb[2], &sb[3]);
> + &sb[1], &sb[2], &sb[3], cmd->cmnd[0]);
> sb[1] &= 0x0f;
> }
>
> @@ -637,7 +638,7 @@
> */
> if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
> ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
> - &sb[2], &sb[12], &sb[13]);
> + &sb[2], &sb[12], &sb[13], cmd->cmnd[0]);
> sb[2] &= 0x0f;
> }
>
>

2006-02-14 23:58:39

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

FYI:

Make a 100GB file, md5sum it, copy it to 'problem' drive and md5sum it,
same MD5SUMS.

box:/x8# /usr/bin/time dd if=/dev/zero of=100gb bs=1M count=100000 ;
/usr/bin/time md5sum 100gb; /usr/bin/time cp 100gb /x4 ; cd /x4 ;
/usr/bin/time md5sum 100gb
100000+0 records in
100000+0 records out
104857600000 bytes transferred in 4735.034107 seconds (22145057 bytes/sec)
0.29user 245.59system 1:18:55elapsed 5%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+210minor)pagefaults 0swaps
1e95cd44e2cb773f483ea7b2f676258d 100gb
248.24user 98.17system 32:50.97elapsed 17%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1major+188minor)pagefaults 0swaps
14.75user 341.92system 35:25.25elapsed 16%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (4major+183minor)pagefaults 0swaps
1e95cd44e2cb773f483ea7b2f676258d 100gb
246.95user 110.41system 28:06.49elapsed 21%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1major+190minor)pagefaults 0swaps
box:/x4#

Also, all SMART tests passed with flying colors..

(FYI)

On Tue, 14 Feb 2006, Mark Lord wrote:

> Justin Piszcz wrote:
> ..
>> ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>
> I wonder if the FUA logic is inserting cache-flush commands
> and perhaps the drive is rejecting those?
>
> Jeff, we really ought to be including the failed ATA opcode
> in those error messages!!
>
> Cheers
>

2006-02-17 08:45:39

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> Jeff, we really ought to be including the failed ATA opcode
> in those error messages!!

Submit a patch...

Jeff

2006-02-17 14:59:41

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Friday 17 February 2006 03:45, Jeff Garzik wrote:
>Submit a patch...

You mean, something like this one?
Untested at present, as I was hoping to hear
back from one of the original problem reporters
after they tested it.

Cheers!

-------- Original Message --------
Subject: Re: LibPATA code issues / 2.6.15.4
Date: Tue, 14 Feb 2006 13:00:36 -0500
From: Mark Lord <[email protected]>
To: Justin Piszcz <[email protected]>
CC: David Greaves <[email protected]>, Jeff Garzik <[email protected]>,
[email protected], IDE/ATA development list
<[email protected]>
References: <Pine.LNX.4.64.0602140439580.3567@p34>
<[email protected]> <Pine.LNX.4.64.0602141211350.10793@p34>

On Tuesday 14 February 2006 12:12, Justin Piszcz wrote:
> I would like to try the patch too, if available.

Something like this: (for 2.6.16-rc3-git2, but should be okay on 2.6.15
also).

Untested: include the original SCSI opcode in printk's for libata SCSI
errors,
to help understand where the errors are coming from.

Signed-Off-By: Mark Lord <[email protected]>

--- linux/drivers/scsi/libata-scsi.c.orig 2006-02-12 19:27:25.000000000 -0500
+++ linux/drivers/scsi/libata-scsi.c 2006-02-14 12:54:17.000000000 -0500
@@ -420,6 +420,7 @@
* @sk: the sense key we'll fill out
* @asc: the additional sense code we'll fill out
* @ascq: the additional sense code qualifier we'll fill out
+ * @opcode: the original SCSI command opcode byte
*
* Converts an ATA error into a SCSI error. Fill out pointers to
* SK, ASC, and ASCQ bytes for later use in fixed or descriptor
@@ -429,7 +430,7 @@
* spin_lock_irqsave(host_set lock)
*/
void ata_to_sense_error(unsigned id, u8 drv_stat, u8 drv_err, u8 *sk, u8
*asc,
- u8 *ascq)
+ u8 *ascq, u8 opcode)
{
int i;

@@ -508,8 +509,8 @@
}
}
/* No error? Undecoded? */
- printk(KERN_WARNING "ata%u: no sense translation for status: 0x%02x\n",
- id, drv_stat);
+ printk(KERN_WARNING "ata%u: no sense translation for op=0x%02x status:
0x%02x\n",
+ id, opcode, drv_stat);

/* For our last chance pick, use medium read error because
* it's much more common than an ATA drive telling you a write
@@ -520,8 +521,8 @@
*ascq = 0x04; /* "auto-reallocation failed" */

translate_done:
- printk(KERN_ERR "ata%u: translated ATA stat/err 0x%02x/%02x to "
- "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, drv_stat, drv_err,
+ printk(KERN_ERR "ata%u: translated op=0x%02x ATA stat/err 0x%02x/%02x to "
+ "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode, drv_stat, drv_err,
*sk, *asc, *ascq);
return;
}
@@ -562,7 +563,7 @@
*/
if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
- &sb[1], &sb[2], &sb[3]);
+ &sb[1], &sb[2], &sb[3], cmd->cmnd[0]);
sb[1] &= 0x0f;
}

@@ -637,7 +638,7 @@
*/
if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
- &sb[2], &sb[12], &sb[13]);
+ &sb[2], &sb[12], &sb[13], cmd->cmnd[0]);
sb[2] &= 0x0f;
}

-

2006-02-17 15:01:10

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

I have patched the kernel and rebooted it with your patch, but, of course,
with my luck it has not given me any errors since, even when repeating
major file copies, bonnie++ and iozone!! :(

On Fri, 17 Feb 2006, Mark Lord wrote:

> On Friday 17 February 2006 03:45, Jeff Garzik wrote:
>> Submit a patch...
>
> You mean, something like this one?
> Untested at present, as I was hoping to hear
> back from one of the original problem reporters
> after they tested it.
>
> Cheers!
>
>
>
> -------- Original Message --------
> Subject: Re: LibPATA code issues / 2.6.15.4
> Date: Tue, 14 Feb 2006 13:00:36 -0500
> From: Mark Lord <[email protected]>
> To: Justin Piszcz <[email protected]>
> CC: David Greaves <[email protected]>, Jeff Garzik <[email protected]>,
> [email protected], IDE/ATA development list
> <[email protected]>
> References: <Pine.LNX.4.64.0602140439580.3567@p34>
> <[email protected]> <Pine.LNX.4.64.0602141211350.10793@p34>
>
> On Tuesday 14 February 2006 12:12, Justin Piszcz wrote:
>> I would like to try the patch too, if available.
>
> Something like this: (for 2.6.16-rc3-git2, but should be okay on 2.6.15
> also).
>
> Untested: include the original SCSI opcode in printk's for libata SCSI
> errors,
> to help understand where the errors are coming from.
>
> Signed-Off-By: Mark Lord <[email protected]>
>
> --- linux/drivers/scsi/libata-scsi.c.orig 2006-02-12 19:27:25.000000000 -0500
> +++ linux/drivers/scsi/libata-scsi.c 2006-02-14 12:54:17.000000000 -0500
> @@ -420,6 +420,7 @@
> * @sk: the sense key we'll fill out
> * @asc: the additional sense code we'll fill out
> * @ascq: the additional sense code qualifier we'll fill out
> + * @opcode: the original SCSI command opcode byte
> *
> * Converts an ATA error into a SCSI error. Fill out pointers to
> * SK, ASC, and ASCQ bytes for later use in fixed or descriptor
> @@ -429,7 +430,7 @@
> * spin_lock_irqsave(host_set lock)
> */
> void ata_to_sense_error(unsigned id, u8 drv_stat, u8 drv_err, u8 *sk, u8
> *asc,
> - u8 *ascq)
> + u8 *ascq, u8 opcode)
> {
> int i;
>
> @@ -508,8 +509,8 @@
> }
> }
> /* No error? Undecoded? */
> - printk(KERN_WARNING "ata%u: no sense translation for status: 0x%02x\n",
> - id, drv_stat);
> + printk(KERN_WARNING "ata%u: no sense translation for op=0x%02x status:
> 0x%02x\n",
> + id, opcode, drv_stat);
>
> /* For our last chance pick, use medium read error because
> * it's much more common than an ATA drive telling you a write
> @@ -520,8 +521,8 @@
> *ascq = 0x04; /* "auto-reallocation failed" */
>
> translate_done:
> - printk(KERN_ERR "ata%u: translated ATA stat/err 0x%02x/%02x to "
> - "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, drv_stat, drv_err,
> + printk(KERN_ERR "ata%u: translated op=0x%02x ATA stat/err 0x%02x/%02x to "
> + "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode, drv_stat, drv_err,
> *sk, *asc, *ascq);
> return;
> }
> @@ -562,7 +563,7 @@
> */
> if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
> ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
> - &sb[1], &sb[2], &sb[3]);
> + &sb[1], &sb[2], &sb[3], cmd->cmnd[0]);
> sb[1] &= 0x0f;
> }
>
> @@ -637,7 +638,7 @@
> */
> if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
> ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
> - &sb[2], &sb[12], &sb[13]);
> + &sb[2], &sb[12], &sb[13], cmd->cmnd[0]);
> sb[2] &= 0x0f;
> }
>
> -
>

2006-02-18 20:43:32

by Sander

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote (ao):
> On Friday 17 February 2006 03:45, Jeff Garzik wrote:
> >Submit a patch...
>
> You mean, something like this one?
> Untested at present, as I was hoping to hear
> back from one of the original problem reporters
> after they tested it.

Not the original reporter, but your patch Works For Me.
I get these:

[ 633.449961] md: md1: sync done.
[ 633.456070] RAID5 conf printout:
[ 633.456117] --- rd:9 wd:9 fd:0
[ 633.456164] disk 0, o:1, dev:sda2
[ 633.456208] disk 1, o:1, dev:sdb2
[ 633.456250] disk 2, o:1, dev:sdc2
[ 633.456298] disk 3, o:1, dev:sdd2
[ 633.456340] disk 4, o:1, dev:sde2
[ 633.456383] disk 5, o:1, dev:sdf2
[ 633.456427] disk 6, o:1, dev:sdg2
[ 633.456470] disk 7, o:1, dev:sdh2
[ 633.456514] disk 8, o:1, dev:sdi2
[ 787.639858] kjournald starting. Commit interval 5 seconds
[ 787.657991] EXT3 FS on md1, internal journal
[ 787.658023] EXT3-fs: mounted filesystem with writeback data mode.
[ 1872.338185] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
[ 1872.338239] ata6: status=0xd0 { Busy }
[ 5749.285084] ata8: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
[ 5749.285138] ata8: status=0xd0 { Busy }
[ 5906.008461] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
[ 5906.008515] ata6: status=0xd0 { Busy }
[ 9892.904205] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
[ 9892.904259] ata6: status=0xd0 { Busy }
[10146.084687] ata5: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
[10146.084740] ata5: status=0xd0 { Busy }
[10293.949040] ata5: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
[10293.949093] ata5: status=0xd0 { Busy }

Can you tell from this what they mean?

This is with 2.6.16-rc3, your patch, and running nine Maxtors disks
over onboard nForce4 and MV88SX6081 8-port SATA II PCI-X Controller (rev 09).

for i in `seq 10`
do dd if=/dev/zero of=bigfile.$i bs=1024k count=10000
done
md5sum bigfile.*

The errors mostly seem to happen during the md5sum (not during the dd).

I do not see data corruption or slowdown.

I do need a chunksize of 512k for the raid5. With anything lower (I tried
the default 64k, 128k, 256k, 512k and 4096k) I get data corruption and
the errors reported in:
http://marc.theaimsgroup.com/?l=linux-ide&m=114016903530007&w=2

Thanks!

Sander

--
Humilis IT Services and Solutions
http://www.humilis.net

2006-02-18 21:42:54

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Sander wrote:
> Mark Lord wrote (ao):
>> On Friday 17 February 2006 03:45, Jeff Garzik wrote:
>>> Submit a patch...
>> You mean, something like this one?
...
> [ 633.449961] md: md1: sync done.
> [ 633.456070] RAID5 conf printout:
> [ 633.456117] --- rd:9 wd:9 fd:0
...
> [ 1872.338185] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
> [ 1872.338239] ata6: status=0xd0 { Busy }
> [ 5749.285084] ata8: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
> [ 5749.285138] ata8: status=0xd0 { Busy }
> [ 5906.008461] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
> [ 5906.008515] ata6: status=0xd0 { Busy }
...
> This is with 2.6.16-rc3, your patch, and running nine Maxtors disks
> over onboard nForce4 and MV88SX6081 8-port SATA II PCI-X Controller (rev 09).
>
> for i in `seq 10`
> do dd if=/dev/zero of=bigfile.$i bs=1024k count=10000
> done
> md5sum bigfile.*
>
> The errors mostly seem to happen during the md5sum (not during the dd).

SCSI opcode 0x2a is WRITE_10, so the errors are being reported
in response to the writes to bigfile.$i. But these are different
from the previously reported error status values -- I wonder why
it's getting "Busy" back as a status here ??

2006-02-18 21:52:06

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

$ for i in `seq 10`
> do dd if=/dev/zero of=bigfile.$i bs=1024k count=10000
> done
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 190.997693 seconds (54899930 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 212.242724 seconds (49404568 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 189.324450 seconds (55385134 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 190.280352 seconds (55106898 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 191.567239 seconds (54736708 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 183.640928 seconds (57099254 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 179.974098 seconds (58262606 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 190.126087 seconds (55151611 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 192.227807 seconds (54548612 bytes/sec)
10000+0 records in
10000+0 records out
10485760000 bytes transferred in 185.309607 seconds (56585086 bytes/sec)
war@p34:/x4$ md5sum bigfile.*
26f56024ac39cdc54b228820107f040d bigfile.1
26f56024ac39cdc54b228820107f040d bigfile.10
26f56024ac39cdc54b228820107f040d bigfile.2
26f56024ac39cdc54b228820107f040d bigfile.3
26f56024ac39cdc54b228820107f040d bigfile.4
26f56024ac39cdc54b228820107f040d bigfile.5
26f56024ac39cdc54b228820107f040d bigfile.6
26f56024ac39cdc54b228820107f040d bigfile.7
26f56024ac39cdc54b228820107f040d bigfile.8
26f56024ac39cdc54b228820107f040d bigfile.9

No errors in dmesg yet (for my issue).

On Sat, 18 Feb 2006, Mark Lord wrote:

> Sander wrote:
>> Mark Lord wrote (ao):
>>> On Friday 17 February 2006 03:45, Jeff Garzik wrote:
>>>> Submit a patch...
>>> You mean, something like this one?
> ...
>> [ 633.449961] md: md1: sync done.
>> [ 633.456070] RAID5 conf printout:
>> [ 633.456117] --- rd:9 wd:9 fd:0
> ...
>> [ 1872.338185] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
>> SK/ASC/ASCQ 0xb/47/00
>> [ 1872.338239] ata6: status=0xd0 { Busy }
>> [ 5749.285084] ata8: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
>> SK/ASC/ASCQ 0xb/47/00
>> [ 5749.285138] ata8: status=0xd0 { Busy }
>> [ 5906.008461] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
>> SK/ASC/ASCQ 0xb/47/00
>> [ 5906.008515] ata6: status=0xd0 { Busy }
> ...
>> This is with 2.6.16-rc3, your patch, and running nine Maxtors disks
>> over onboard nForce4 and MV88SX6081 8-port SATA II PCI-X Controller (rev
>> 09).
>>
>> for i in `seq 10`
>> do dd if=/dev/zero of=bigfile.$i bs=1024k count=10000
>> done
>> md5sum bigfile.*
>>
>> The errors mostly seem to happen during the md5sum (not during the dd).
>
> SCSI opcode 0x2a is WRITE_10, so the errors are being reported
> in response to the writes to bigfile.$i. But these are different
> from the previously reported error status values -- I wonder why
> it's getting "Busy" back as a status here ??
>

2006-02-19 07:14:14

by Sander

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote (ao):
> Sander wrote:
> >Mark Lord wrote (ao):
> >>On Friday 17 February 2006 03:45, Jeff Garzik wrote:
> >>>Submit a patch...
> >>You mean, something like this one?
> ...
> >[ 633.449961] md: md1: sync done.
> >[ 633.456070] RAID5 conf printout:
> >[ 633.456117] --- rd:9 wd:9 fd:0
> ...
> >[ 1872.338185] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
> >SK/ASC/ASCQ 0xb/47/00
> >[ 1872.338239] ata6: status=0xd0 { Busy }
> >[ 5749.285084] ata8: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
> >SK/ASC/ASCQ 0xb/47/00
> >[ 5749.285138] ata8: status=0xd0 { Busy }
> >[ 5906.008461] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
> >SK/ASC/ASCQ 0xb/47/00
> >[ 5906.008515] ata6: status=0xd0 { Busy }
> ...
> >This is with 2.6.16-rc3, your patch, and running nine Maxtors disks
> >over onboard nForce4 and MV88SX6081 8-port SATA II PCI-X Controller (rev
> >09).
> >
> >for i in `seq 10`
> >do dd if=/dev/zero of=bigfile.$i bs=1024k count=10000
> >done
> >md5sum bigfile.*
> >
> >The errors mostly seem to happen during the md5sum (not during the dd).
>
> SCSI opcode 0x2a is WRITE_10, so the errors are being reported
> in response to the writes to bigfile.$i.

Ah, my bad then.

> But these are different from the previously reported error status
> values -- I wonder why it's getting "Busy" back as a status here ??

Well, as I wrote, I am not the original reporter whoms thread you
responded to with your patch. I just thought I could use it to get
better errors messages for my bug reports.

I am using the sata_mv driver, which is beta. That might explain why it
behaves not totally as expected in your eyes. I have no clue anyway :-)

I hope my reports are of any use to Jeff wrt the sata_mv driver.

Thank you for your response.

Sander

--
Humilis IT Services and Solutions
http://www.humilis.net

2006-02-19 15:31:02

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Sander wrote:
> Mark Lord wrote (ao):
>> Sander wrote:
>>> Mark Lord wrote (ao):
>>>> On Friday 17 February 2006 03:45, Jeff Garzik wrote:
>>>>> Submit a patch...
>>>> You mean, something like this one?
>> ...
>>> [ 633.449961] md: md1: sync done.
>>> [ 633.456070] RAID5 conf printout:
>>> [ 633.456117] --- rd:9 wd:9 fd:0
>> ...
>>> [ 1872.338185] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
>>> SK/ASC/ASCQ 0xb/47/00
>>> [ 1872.338239] ata6: status=0xd0 { Busy }
>>> [ 5749.285084] ata8: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
>>> SK/ASC/ASCQ 0xb/47/00
>>> [ 5749.285138] ata8: status=0xd0 { Busy }
>>> [ 5906.008461] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
>>> SK/ASC/ASCQ 0xb/47/00
>>> [ 5906.008515] ata6: status=0xd0 { Busy }
...
>> SCSI opcode 0x2a is WRITE_10, so the errors are being reported
>> in response to the writes to bigfile.$i.
...
> I am using the sata_mv driver, which is beta. That might explain why it
> behaves not totally as expected in your eyes. I have no clue anyway :-)

Ahh.. that's useful to know. I expect to be taking a long hard look
at the innards of the sata_mv code in the near future, so whatever is
wrong here just might get fixed soon.

Cheers

2006-02-19 17:16:49

by Sander

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote (ao):
> Sander wrote:
> >Mark Lord wrote (ao):
> >>Sander wrote:
> >>>Mark Lord wrote (ao):
> >>>>On Friday 17 February 2006 03:45, Jeff Garzik wrote:
> >>>>>Submit a patch...
> >>>>You mean, something like this one?
> >>...
> >>>[ 633.449961] md: md1: sync done.
> >>>[ 633.456070] RAID5 conf printout:
> >>>[ 633.456117] --- rd:9 wd:9 fd:0
> >>...
> >>>[ 1872.338185] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
> >>>SK/ASC/ASCQ 0xb/47/00
> >>>[ 1872.338239] ata6: status=0xd0 { Busy }
> >>>[ 5749.285084] ata8: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
> >>>SK/ASC/ASCQ 0xb/47/00
> >>>[ 5749.285138] ata8: status=0xd0 { Busy }
> >>>[ 5906.008461] ata6: translated op=0x2a ATA stat/err 0xd0/00 to SCSI
> >>>SK/ASC/ASCQ 0xb/47/00
> >>>[ 5906.008515] ata6: status=0xd0 { Busy }
> ...
> >>SCSI opcode 0x2a is WRITE_10, so the errors are being reported
> >>in response to the writes to bigfile.$i.
> ...
> >I am using the sata_mv driver, which is beta. That might explain why it
> >behaves not totally as expected in your eyes. I have no clue anyway :-)
>
> Ahh.. that's useful to know.

I'm sorry for omitting that information in my previous mail.

> I expect to be taking a long hard look at the innards of the sata_mv
> code in the near future, so whatever is wrong here just might get
> fixed soon.

Consider me your happy and willing patch test victim :-)

I can easily reproduce data corruption with sata_mv.

FWIW, I like this card very much. It is cheap, seems to perform well,
and Marvell seems to be Linux friendly, providing the docs (according to
http://linux-ata.org/sata-status.html#marvell).

I'm not subscribed to linux-ide, but am to linux-kernel. If you post it
there (or cc me) I'll see and try it.

Sander

--
Humilis IT Services and Solutions
http://www.humilis.net

2006-02-23 23:39:28

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

I have reproduced the error with the patched kernel!

Here it is:

[263864.109854] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[263864.109861] ata3: status=0x51 { DriveReady SeekComplete Error }
[263864.109866] ata3: error=0x04 { DriveStatusError }

Here is how I got it to error:

$ for i in `seq 1 1000`; do dd if=/dev/zero of=file.$i bs=1M count=$i;
done

Now, how to fix? :)

On Tue, 14 Feb 2006, Mark Lord wrote:

> On Tuesday 14 February 2006 12:12, Justin Piszcz wrote:
>> I would like to try the patch too, if available.
>
> Something like this: (for 2.6.16-rc3-git2, but should be okay on 2.6.15 also).
>
> Untested: include the original SCSI opcode in printk's for libata SCSI errors,
> to help understand where the errors are coming from.
>
> Signed-Off-By: Mark Lord <[email protected]>
>
> --- linux/drivers/scsi/libata-scsi.c.orig 2006-02-12 19:27:25.000000000 -0500
> +++ linux/drivers/scsi/libata-scsi.c 2006-02-14 12:54:17.000000000 -0500
> @@ -420,6 +420,7 @@
> * @sk: the sense key we'll fill out
> * @asc: the additional sense code we'll fill out
> * @ascq: the additional sense code qualifier we'll fill out
> + * @opcode: the original SCSI command opcode byte
> *
> * Converts an ATA error into a SCSI error. Fill out pointers to
> * SK, ASC, and ASCQ bytes for later use in fixed or descriptor
> @@ -429,7 +430,7 @@
> * spin_lock_irqsave(host_set lock)
> */
> void ata_to_sense_error(unsigned id, u8 drv_stat, u8 drv_err, u8 *sk, u8 *asc,
> - u8 *ascq)
> + u8 *ascq, u8 opcode)
> {
> int i;
>
> @@ -508,8 +509,8 @@
> }
> }
> /* No error? Undecoded? */
> - printk(KERN_WARNING "ata%u: no sense translation for status: 0x%02x\n",
> - id, drv_stat);
> + printk(KERN_WARNING "ata%u: no sense translation for op=0x%02x status: 0x%02x\n",
> + id, opcode, drv_stat);
>
> /* For our last chance pick, use medium read error because
> * it's much more common than an ATA drive telling you a write
> @@ -520,8 +521,8 @@
> *ascq = 0x04; /* "auto-reallocation failed" */
>
> translate_done:
> - printk(KERN_ERR "ata%u: translated ATA stat/err 0x%02x/%02x to "
> - "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, drv_stat, drv_err,
> + printk(KERN_ERR "ata%u: translated op=0x%02x ATA stat/err 0x%02x/%02x to "
> + "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode, drv_stat, drv_err,
> *sk, *asc, *ascq);
> return;
> }
> @@ -562,7 +563,7 @@
> */
> if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
> ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
> - &sb[1], &sb[2], &sb[3]);
> + &sb[1], &sb[2], &sb[3], cmd->cmnd[0]);
> sb[1] &= 0x0f;
> }
>
> @@ -637,7 +638,7 @@
> */
> if (tf->command & (ATA_BUSY | ATA_DF | ATA_ERR | ATA_DRQ)) {
> ata_to_sense_error(qc->ap->id, tf->command, tf->feature,
> - &sb[2], &sb[12], &sb[13]);
> + &sb[2], &sb[12], &sb[13], cmd->cmnd[0]);
> sb[2] &= 0x0f;
> }
>
>

2006-02-25 11:34:08

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:

>On Tuesday 14 February 2006 12:12, Justin Piszcz wrote:
>
>
>>I would like to try the patch too, if available.
>>
>>
>
>Something like this: (for 2.6.16-rc3-git2, but should be okay on 2.6.15 also).
>
>Untested: include the original SCSI opcode in printk's for libata SCSI errors,
>to help understand where the errors are coming from.
>
>Signed-Off-By: Mark Lord <[email protected]>
>
>
Thanks Mark - I've finally gotten this patch applied.

With smartd disabled and no smart commands issued, a readonly badblocks
scan of /dev/sdb2 shows no problems and now gives:
Feb 25 10:38:31 haze kernel: ata2: status=0x51 { DriveReady SeekComplete
Error }
Feb 25 10:38:32 haze kernel: ata2: no sense translation for op=0x28
status: 0x51
Feb 25 10:38:32 haze kernel: ata2: status=0x51 { DriveReady SeekComplete
Error }
Feb 25 10:38:35 haze kernel: ata2: no sense translation for op=0x28
status: 0x51
hundreds of times.

and during boot I can get:
ata2: no sense translation for op=0x28 status: 0x51
ata2: translated op=0x28 ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
Installing knfsd (copyright (C) 1996 [email protected]).
ata2: no sense translation for op=0x28 status: 0x51
ata2: translated op=0x28 ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x28 status: 0x51
ata2: translated op=0x28 ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }

Subsequently a
smartclt -data -a /dev/sdb
shows no errors.
So could this be a faulty disk that smart shows is OK and shows no read
or write errors?

The other problem I noticed was that
smartctl -o on -data /dev/sda
still just gives:
Feb 25 10:51:47 haze kernel: ata1: PIO error
Feb 25 10:51:47 haze kernel: ata1: status=0x51 { DriveReady SeekComplete
Error }
Feb 25 10:51:47 haze kernel: ata1: error=0x04 { DriveStatusError }
Feb 25 10:51:47 haze kernel: ata1: PIO error
Feb 25 10:51:47 haze kernel: ata1: status=0x51 { DriveReady SeekComplete
Error }
Feb 25 10:51:47 haze kernel: ata1: error=0x04 { DriveStatusError }
Feb 25 10:51:47 haze kernel: ata1: PIO error
many times.

I get similar problems for all the drives under both sata_sil and sata_via.

Linux haze 2.6.15patchsata #6 PREEMPT Fri Feb 24 19:15:07 UTC 2006 i686
GNU/Linux

libata version 1.20 loaded.
sata_sil 0000:00:0a.0: version 0.9
ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 16 (level, low) -> IRQ 17
ata1: SATA max UDMA/100 cmd 0xF8804080 ctl 0xF880408A bmdma 0xF8804000
irq 17
ata2: SATA max UDMA/100 cmd 0xF88040C0 ctl 0xF88040CA bmdma 0xF8804008
irq 17
ata1: dev 0 cfg 49:2f00 82:7869 83:7d09 84:4043 85:7869 86:3c01 87:4043
88:203f
ata1: dev 0 ATA-7, max UDMA/100, 390721968 sectors: LBA48
ata1: dev 0 configured for UDMA/100
scsi0 : sata_sil
ata2: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4063 85:7c69 86:3e01 87:4063
88:007f
ata2: dev 0 ATA-7, max UDMA/133, 398297088 sectors: LBA48
ata2: dev 0 configured for UDMA/100
scsi1 : sata_sil
Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
Type: Direct-Access ANSI SCSI revision: 05
sata_via 0000:00:0f.0: version 1.1
ACPI: PCI Interrupt 0000:00:0f.0[B] -> GSI 20 (level, low) -> IRQ 16
sata_via 0000:00:0f.0: routed to hard irq line 0
ata3: SATA max UDMA/133 cmd 0x9800 ctl 0x9402 bmdma 0x8400 irq 16
ata4: SATA max UDMA/133 cmd 0x9000 ctl 0x8802 bmdma 0x8408 irq 16
ata3: dev 0 cfg 49:2f00 82:346b 83:7d01 84:4003 85:3468 86:3c01 87:4003
88:407f
ata3: dev 0 ATA-6, max UDMA/133, 312581808 sectors: LBA48
ata3: dev 0 configured for UDMA/133
scsi2 : sata_via
ata4: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4063 85:7c68 86:3e01 87:4063
88:407f
ata4: dev 0 ATA-7, max UDMA/133, 398297088 sectors: LBA48
ata4: dev 0 configured for UDMA/133
scsi3 : sata_via
Vendor: ATA Model: ST3160023AS Rev: 3.18
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 390721968 512-byte hdwr sectors (200050 MB)
SCSI device sda: drive cache: write back
sda: sda1
sd 0:0:0:0: Attached scsi disk sda
SCSI device sdb: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sdb: drive cache: write back
SCSI device sdb: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sdb: drive cache: write back
sdb: sdb1 sdb2
sd 1:0:0:0: Attached scsi disk sdb
SCSI device sdc: 312581808 512-byte hdwr sectors (160042 MB)
SCSI device sdc: drive cache: write back
SCSI device sdc: 312581808 512-byte hdwr sectors (160042 MB)
SCSI device sdc: drive cache: write back
sdc: sdc1 sdc2 sdc3 sdc4
sd 2:0:0:0: Attached scsi disk sdc
SCSI device sdd: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sdd: drive cache: write back
SCSI device sdd: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sdd: drive cache: write back
sdd: sdd1 sdd2
sd 3:0:0:0: Attached scsi disk sdd
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 1:0:0:0: Attached scsi generic sg1 type 0
sd 2:0:0:0: Attached scsi generic sg2 type 0
sd 3:0:0:0: Attached scsi generic sg3 type 0

David

--

2006-02-25 15:32:32

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
> I have reproduced the error with the patched kernel!
>
> Here it is:
>
> [263864.109854] ata3: translated ATA stat/err 0x51/04 to SCSI
> SK/ASC/ASCQ 0xb/00/00
> [263864.109861] ata3: status=0x51 { DriveReady SeekComplete Error }
> [263864.109866] ata3: error=0x04 { DriveStatusError }

Nope.. patch not present, as otherwise the line above would have
read something like this:

> [263864.109854] ata3: translated op=0x21 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00

So we didn't get the extra info since the patch wasn't present.

Cheers

2006-02-25 15:58:39

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

The kernel is patched, if you did not get what you wanted maybe the patch
does not work in some instances or there is a bug?

On Sat, 25 Feb 2006, Mark Lord wrote:

> Justin Piszcz wrote:
>> I have reproduced the error with the patched kernel!
>>
>> Here it is:
>>
>> [263864.109854] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
>> 0xb/00/00
>> [263864.109861] ata3: status=0x51 { DriveReady SeekComplete Error }
>> [263864.109866] ata3: error=0x04 { DriveStatusError }
>
> Nope.. patch not present, as otherwise the line above would have
> read something like this:
>
>> [263864.109854] ata3: translated op=0x21 ATA stat/err 0x51/04 to SCSI
> SK/ASC/ASCQ 0xb/00/00
>
> So we didn't get the extra info since the patch wasn't present.
>
> Cheers
>

2006-02-25 16:11:49

by Jesper Juhl

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On 2/25/06, Justin Piszcz <[email protected]> wrote:

Please don't top-post.

> The kernel is patched, if you did not get what you wanted maybe the patch
> does not work in some instances or there is a bug?
>

You may have patched a kernel source with Mark's patch, but you are
very clearly not running a kernel build from that patched source.

As can be seen from (for example) this bit from Mark's patch

translate_done:
- printk(KERN_ERR "ata%u: translated ATA stat/err 0x%02x/%02x to "
- "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, drv_stat, drv_err,
+ printk(KERN_ERR "ata%u: translated op=0x%02x ATA stat/err
0x%02x/%02x to "
+ "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode,
drv_stat, drv_err,
*sk, *asc, *ascq);

the patch changes the text being printed. In this case the text
"ata%u: translated ATA stat/err ..." is changed into "ata%u:
translated ATA stat/err ..."

And if we look at the output you posted :

> >> Here it is:
> >>
> >> [263864.109854] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
> >> 0xb/00/00

That string is clearly from an un-patched kernel as Mark also pointed
out in his reply to you.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-02-25 16:20:04

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
..
> Thanks Mark - I've finally gotten this patch applied.
>
> With smartd disabled and no smart commands issued, a readonly badblocks
> scan of /dev/sdb2 shows no problems and now gives:
> Feb 25 10:38:31 haze kernel: ata2: status=0x51 { DriveReady SeekComplete
> Error }
> Feb 25 10:38:32 haze kernel: ata2: no sense translation for op=0x28
> status: 0x51
> Feb 25 10:38:32 haze kernel: ata2: status=0x51 { DriveReady SeekComplete
> Error }
> Feb 25 10:38:35 haze kernel: ata2: no sense translation for op=0x28
> status: 0x51
> hundreds of times.
..

Mmmm.. okay, it's happening due to a SCSI READ_10 opcode,
which means it isn't being triggered by any of the FUA stuff.

But there's still no obvious reason for the error.
The drive is basically just saying "command rejected",
and libata-scsi is translating that into "medium error"
for some unknown reason.

Unfortunately, the design of the current libata is such that
we no longer have access to the actual ATA opcode that was rejected.
It gets overwritten by the returned drive status on completion.

So.. I need to generate another patch for you now, to save/show
the real ATA opcode that was used to cause the errors.
My theory is that we'll discover that it is one that your drive
legitimately is rejecting (unsupported LBA48 or something..).

But we won't know until we see the output.

Second patch is attached: apply *in addition* to the first one.

Cheers

Attachments:

12_libata_ata_opcode.patch (5.84 kB)

2006-02-25 16:21:31

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
> The kernel is patched, if you did not get what you wanted maybe the
> patch does not work in some instances or there is a bug?

No, the output would be there if those messages came from the patched kernel.
(read the patch and see what I mean..).

Cheers

2006-02-25 17:45:10

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Second patch fails for me.

On a clean 2.6.15.4 source tree:

p34:/usr/src# ls -ld linux
lrwxrwxrwx 1 root src 14 2006-02-25 12:41 linux -> linux-2.6.15.4/

The one from your e-mail earlier:
p34:/usr/src/linux# patch -p1 < /tmp/patch1
patching file drivers/scsi/libata-scsi.c
Hunk #1 succeeded at 404 (offset -16 lines).
Hunk #2 succeeded at 414 (offset -16 lines).
Hunk #3 succeeded at 493 (offset -16 lines).
Hunk #4 succeeded at 505 (offset -16 lines).
Hunk #5 succeeded at 547 (offset -16 lines).
Hunk #6 succeeded at 622 (offset -16 lines).

p34:/usr/src/linux# patch -p1 < /tmp/12_libata_ata_opcode.patch
patching file drivers/scsi/libata-core.c
Hunk #1 succeeded at 245 (offset -8 lines).
Hunk #2 succeeded at 267 (offset -8 lines).
Hunk #3 succeeded at 288 (offset -8 lines).
Hunk #4 succeeded at 310 (offset -8 lines).
Hunk #5 succeeded at 500 (offset -8 lines).
Hunk #6 FAILED at 626.
1 out of 6 hunks FAILED -- saving rejects to file
drivers/scsi/libata-core.c.rej
patching file drivers/scsi/libata-scsi.c
Hunk #1 succeeded at 414 (offset -24 lines).
Hunk #2 succeeded at 493 (offset -24 lines).
Hunk #3 FAILED at 505.
Hunk #4 succeeded at 547 (offset -24 lines).
Hunk #5 succeeded at 622 (offset -24 lines).
Hunk #6 succeeded at 1308 (offset -29 lines).
1 out of 6 hunks FAILED -- saving rejects to file
drivers/scsi/libata-scsi.c.rej
patching file include/linux/ata.h
Hunk #1 succeeded at 239 (offset -5 lines).
patching file include/linux/libata.h
Hunk #1 succeeded at 368 (offset -52 lines).
Hunk #2 succeeded at 452 (offset -60 lines).
p34:/usr/src/linux#

Should I be using 2.6.16-rcX?

On Sat, 25 Feb 2006, Mark Lord wrote:

> David Greaves wrote:
> ..
>> Thanks Mark - I've finally gotten this patch applied.
>>
>> With smartd disabled and no smart commands issued, a readonly badblocks
>> scan of /dev/sdb2 shows no problems and now gives:
>> Feb 25 10:38:31 haze kernel: ata2: status=0x51 { DriveReady SeekComplete
>> Error }
>> Feb 25 10:38:32 haze kernel: ata2: no sense translation for op=0x28
>> status: 0x51
>> Feb 25 10:38:32 haze kernel: ata2: status=0x51 { DriveReady SeekComplete
>> Error }
>> Feb 25 10:38:35 haze kernel: ata2: no sense translation for op=0x28
>> status: 0x51
>> hundreds of times.
> ..
>
> Mmmm.. okay, it's happening due to a SCSI READ_10 opcode,
> which means it isn't being triggered by any of the FUA stuff.
>
> But there's still no obvious reason for the error.
> The drive is basically just saying "command rejected",
> and libata-scsi is translating that into "medium error"
> for some unknown reason.
>
> Unfortunately, the design of the current libata is such that
> we no longer have access to the actual ATA opcode that was rejected.
> It gets overwritten by the returned drive status on completion.
>
> So.. I need to generate another patch for you now, to save/show
> the real ATA opcode that was used to cause the errors.
> My theory is that we'll discover that it is one that your drive
> legitimately is rejecting (unsupported LBA48 or something..).
>
> But we won't know until we see the output.
>
> Second patch is attached: apply *in addition* to the first one.
>
> Cheers
>
>

2006-02-25 18:28:14

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
> Second patch fails for me.
..
> Should I be using 2.6.16-rcX?

Mmm... that's what I'm using (plus other patches),
so, yes.. give that a try. 2.6.16 does seem to
be shaping up to be a nice kernel.

Cheers

2006-02-25 18:56:04

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

I will give 2.6.16-rcX a try shortly, here is the error again (with a
freshly patched 2.6.15.4) just to rule out any problems with the first
time that I patched:

[ 1037.451784] ata3: translated op=0x2a ATA stat/err 0x51/04 to SCSI
SK/ASC/ASCQ 0xb/00/00
[ 1037.451791] ata3: status=0x51 { DriveReady SeekComplete Error }
[ 1037.451796] ata3: error=0x04 { DriveStatusError }
[ 1517.050496] ata3: no sense translation for op=0x2a status: 0x51
[ 1517.050504] ata3: translated op=0x2a ATA stat/err 0x51/00 to SCSI
SK/ASC/ASCQ 0x3/11/04
[ 1517.050506] ata3: status=0x51 { DriveReady SeekComplete Error }

On Sat, 25 Feb 2006, Mark Lord wrote:

> Justin Piszcz wrote:
>> Second patch fails for me.
> ..
>> Should I be using 2.6.16-rcX?
>
> Mmm... that's what I'm using (plus other patches),
> so, yes.. give that a try. 2.6.16 does seem to
> be shaping up to be a nice kernel.
>
> Cheers
>

2006-02-25 19:29:23

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Which kernel did you run your patch against?

With 2.6.16-rc4....

First patch looks good..

p34:/usr/src/linux# patch -p1 < /tmp/patch1
patching file drivers/scsi/libata-scsi.c

p34:/usr/src/linux# patch -p1 < /tmp/12_libata_ata_opcode.patch
patching file drivers/scsi/libata-core.c
Hunk #1 succeeded at 245 (offset -8 lines).
Hunk #2 succeeded at 267 (offset -8 lines).
Hunk #3 succeeded at 288 (offset -8 lines).
Hunk #4 succeeded at 310 (offset -8 lines).
Hunk #5 succeeded at 500 (offset -8 lines).
Hunk #6 succeeded at 626 (offset -8 lines).
patching file drivers/scsi/libata-scsi.c
Hunk #1 succeeded at 430 (offset -8 lines).
Hunk #2 succeeded at 509 (offset -8 lines).
Hunk #3 FAILED at 521.
Hunk #4 succeeded at 563 (offset -8 lines).
Hunk #5 succeeded at 638 (offset -8 lines).
Hunk #6 succeeded at 1329 (offset -8 lines).
1 out of 6 hunks FAILED -- saving rejects to file
drivers/scsi/libata-scsi.c.rej
patching file include/linux/ata.h
patching file include/linux/libata.h
Hunk #1 succeeded at 373 (offset -47 lines).
Hunk #2 succeeded at 463 (offset -49 lines).
p34:/usr/src/linux# ls -ld /usr/src/linux
lrwxrwxrwx 1 root src 16 2006-02-25 14:24 /usr/src/linux ->
linux-2.6.16-rc4/
p34:/usr/src/linux#

Here is the *.rej file:

# cat libata-scsi.c.rej
***************
*** 521,528 ****
*ascq = 0x04; /* "auto-reallocation failed" */

translate_done:
- DPRINTK(KERN_ERR "ata%u: translated op=0x%02x ATA stat/err
0x%02x/%02x to "
- "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode, drv_stat,
drv_err,
*sk, *asc, *ascq);
return;
}
--- 521,528 ----
*ascq = 0x04; /* "auto-reallocation failed" */

translate_done:
+ DPRINTK(KERN_ERR "ata%u: translated op=0x%02x cmd=0x%02x ATA
stat/err 0x%02x/%02x to "
+ "SCSI SK/ASC/ASCQ 0x%x/%02x/%02x\n", id, opcode, cmd,
drv_stat, drv_err,
*sk, *asc, *ascq);
return;
}

On Sat, 25 Feb 2006, Mark Lord wrote:

> Justin Piszcz wrote:
>> Second patch fails for me.
> ..
>> Should I be using 2.6.16-rcX?
>
> Mmm... that's what I'm using (plus other patches),
> so, yes.. give that a try. 2.6.16 does seem to
> be shaping up to be a nice kernel.
>
> Cheers
>

2006-02-25 19:46:57

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:

> Justin Piszcz wrote:
>
>> Should I be using 2.6.16-rcX?
>
>
> Mmm... that's what I'm using (plus other patches),
> so, yes.. give that a try. 2.6.16 does seem to
> be shaping up to be a nice kernel.

OK, failed for me too - I updated to 2.6.16-rc4 and it still failed
(despite -F) so I fixed by hand.
(printk -> DPRINTK

anyway:
Linux haze 2.6.16-rc4patched #1 PREEMPT Sat Feb 25 19:29:11 UTC 2006
i686 GNU/Linux

ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
sd 1:0:0:0: SCSI error: return code = 0x8000002
sdb: Current: sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sdb, sector 398283329
raid1: Disk failure on sdb2, disabling device.
Operation continuing on 1 devices

and later...

device-mapper: 4.5.0-ioctl (2005-10-04) initialised: [email protected]
XFS mounting filesystem dm-0
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
ata1: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata1: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
ata1: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata1: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
ata1: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata1: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
ata1: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata1: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
sd 0:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 390716735
raid5: Disk failure on sda1, disabling device. Operation continuing on 2
devices
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
sd 1:0:0:0: SCSI error: return code = 0x8000002
sdb: Current: sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sdb, sector 390716735
raid5: Disk failure on sdb1, disabling device. Operation continuing on 1
devices
RAID5 conf printout:
--- rd:3 wd:1 fd:2
disk 0, o:1, dev:sdd1
disk 1, o:0, dev:sdb1
disk 2, o:0, dev:sda1
xfs_force_shutdown(dm-0,0x1) called from line 338 of file
fs/xfs/xfs_rw.c. Return address = 0xc020c0e9
Filesystem "dm-0": I/O Error Detected. Shutting down filesystem: dm-0
Please umount the filesystem, and rectify the problem(s)
I/O error in filesystem ("dm-0") meta-data dev dm-0 block
0x640884a ("xlog_bwrite") error 5 buf count 262144
XFS: failed to locate log tail
XFS: log mount/recovery failed: error 5
XFS: log mount failed
RAID5 conf printout:
--- rd:3 wd:1 fd:2
disk 0, o:1, dev:sdd1
disk 1, o:0, dev:sdb1
RAID5 conf printout:
--- rd:3 wd:1 fd:2
disk 0, o:1, dev:sdd1
disk 1, o:0, dev:sdb1
RAID5 conf printout:
--- rd:3 wd:1 fd:2
disk 0, o:1, dev:sdd1

So I guess my raid just blew up too... hope there's no corruption!

David

(PS Hi Mark, this is lbt from the Empeg BBS :) )

--

2006-02-25 19:53:44

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:

> Which kernel did you run your patch against?
>
> With 2.6.16-rc4....
>
> First patch looks good..
>
Justin, I'll help you out off-list :)

David

2006-02-26 02:28:01

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
>
> Linux haze 2.6.16-rc4patched #1 PREEMPT Sat Feb 25 19:29:11 UTC 2006
> i686 GNU/Linux
>
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: error=0x04 { DriveStatusError }
> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
> ata2: status=0x51 { DriveReady SeekComplete Error }
> sd 1:0:0:0: SCSI error: return code = 0x8000002
> sdb: Current: sense key: Medium Error
> Additional sense: Unrecovered read error - auto reallocate failed
> end_request: I/O error, dev sdb, sector 398283329
> raid1: Disk failure on sdb2, disabling device.
> Operation continuing on 1 devices

Oh good, *now* we've gotten somewhere!!

Albert / Jens / Jeff:

The command failing above is SCSI WRITE_10, which is being
translated into ATA_CMD_WRITE_FUA_EXT by libata.

This command fails -- unrecognized by the drive in question.
But libata reports it (most incorrectly) as a "medium error",
and the drive is taken out of service from its RAID.

Bad, bad, and worse.

Libata should really recover from this, by recognizing that
the command was rejected, and replacing it with a simple
WRITE_EXT instead. Possibly followed by FLUSH_CACHE.

So.. I've forgotten who put FUA into libata, but hopefully
it's one of the folks on the CC: list, and that nice person
can now generate a patch to fix this bug somehow.

Cheers

2006-02-26 09:56:19

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:

>> sdb: Current: sense key: Medium Error
>> Additional sense: Unrecovered read error - auto reallocate failed
>> end_request: I/O error, dev sdb, sector 398283329
>> raid1: Disk failure on sdb2, disabling device.
>> Operation continuing on 1 devices
>
>
> Oh good, *now* we've gotten somewhere!!
>
> Albert / Jens / Jeff:
>
> The command failing above is SCSI WRITE_10, which is being
> translated into ATA_CMD_WRITE_FUA_EXT by libata.
>
> This command fails -- unrecognized by the drive in question.
> But libata reports it (most incorrectly) as a "medium error",
> and the drive is taken out of service from its RAID.
>
> Bad, bad, and worse.
>
> Libata should really recover from this, by recognizing that
> the command was rejected, and replacing it with a simple
> WRITE_EXT instead. Possibly followed by FLUSH_CACHE.
>
> So.. I've forgotten who put FUA into libata, but hopefully
> it's one of the folks on the CC: list, and that nice person
> can now generate a patch to fix this bug somehow.

Thanks Mark

I'm glad it's a bug and not bad hardware.

I am quite concerned that the basic effect of just booting a practically
vanilla 2.6.16-rc4 like this was to fry my raid array.

Luckily it dropped 2 (of 3) disks so quickly that the event counter was
the same allowing an easy rebuild.

2.6.15 has similar issues but they seem to happen *very* infrequently by
comparison - this hit me several times during a single boot.

Should Linus (cc'ed) hold off on 2.6.16 because of this or not?

David

2006-02-26 12:27:08

by James Courtier-Dutton

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> David Greaves wrote:
>>
>> Linux haze 2.6.16-rc4patched #1 PREEMPT Sat Feb 25 19:29:11 UTC 2006
>> i686 GNU/Linux
>>
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: error=0x04 { DriveStatusError }
>> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> sd 1:0:0:0: SCSI error: return code = 0x8000002
>> sdb: Current: sense key: Medium Error
>> Additional sense: Unrecovered read error - auto reallocate failed
>> end_request: I/O error, dev sdb, sector 398283329
>> raid1: Disk failure on sdb2, disabling device.
>> Operation continuing on 1 devices
>
> Oh good, *now* we've gotten somewhere!!
>
> Albert / Jens / Jeff:
>
> The command failing above is SCSI WRITE_10, which is being
> translated into ATA_CMD_WRITE_FUA_EXT by libata.
>
> This command fails -- unrecognized by the drive in question.
> But libata reports it (most incorrectly) as a "medium error",
> and the drive is taken out of service from its RAID.
>
> Bad, bad, and worse.
>

I have what looks like similar problems. The issue I have is that I
don't think the problem is ONLY libata related.
I have two linux PCs. One called "games", the other called "localhost".
The problem happens quite quickly on the old "games" machine, but I can
run for days/weeks until I see the problem on the "localhost".
It might be happening on the "localhost", but I am just not noticing.
The difference being that if reiserfs sees this error, it cannot
recover, and I have reiserfs on the "games" machine. The "localhost"
only uses ext3, and ext3 recovers gracefully from this problem.
Can I use libata on this old "games" machine? It is an old Pentium 3
machine.
In any case, The "games" machine is currently switched off until I can
find a kernel that works, so I will happily test different kernels and
patches, if people have suggestions.

I have two desktop linux machines. One is an old Pentium 3 which shows
the following errors(no libata involved):
Linux version 2.6.15-rc4 (root@games) (gcc version 4.0.3 20051111
(prerelease) (Debian 4.0.2-4)
) #1 Sat Dec 3 18:47:19 GMT 2005
Dec 16 22:51:57 games kernel: hdc: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=53058185, sector=53057951
Dec 16 22:52:32 games kernel: ide: failed opcode was: unknown
Dec 16 22:52:32 games kernel: end_request: I/O error, dev hdc, sector
53057951
Dec 16 22:52:32 games kernel: hdc: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x10 {
SectorIdNotFound }, LBAsect=53058185, sector=53057959
Dec 16 22:52:32 games kernel: ide: failed opcode was: unknown

The other has the following errors:
Linux version 2.6.15.1 (root@localhost) (gcc version 3.4.5 (Gentoo
3.4.5, ssp-3.4.5-1.0, pi
e-8.7.9)) #3 SMP PREEMPT Fri Feb 3 23:19:05 GMT 2006
Feb 10 23:30:07 localhost kernel: ata3: command 0xb0 timeout, stat 0xd0
host_stat 0x0
Feb 10 23:30:07 localhost kernel: ata3: translated ATA stat/err 0xd0/00
to SCSI SK/ASC/ASCQ 0xb/47/00
Feb 10 23:30:07 localhost kernel: ata3: status=0xd0 { Busy }
Feb 10 23:30:07 localhost kernel: ATA: abnormal status 0xD0 on port
0xF880E087
Feb 10 23:30:07 localhost last message repeated 3 times
Feb 10 23:30:10 localhost kernel: ata3: PIO error
Feb 10 23:30:10 localhost kernel: ata3: status=0x50 { DriveReady
SeekComplete }
Feb 11 10:18:10 localhost kernel: ata2: command 0xb0 timeout, stat 0xd0
host_stat 0x0
Feb 11 10:18:10 localhost kernel: ata2: translated ATA stat/err 0xd0/00
to SCSI SK/ASC/ASCQ 0xb/47/00
Feb 11 10:18:10 localhost kernel: ata2: status=0xd0 { Busy }
Feb 11 10:18:10 localhost kernel: ATA: abnormal status 0xD0 on port 0x177
Feb 11 10:18:10 localhost last message repeated 3 times

2006-02-26 12:55:05

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

James Courtier-Dutton wrote:

> I have two desktop linux machines. One is an old Pentium 3 which shows
> the following errors(no libata involved):
> Linux version 2.6.15-rc4 (root@games) (gcc version 4.0.3 20051111
> (prerelease) (Debian 4.0.2-4)
> ) #1 Sat Dec 3 18:47:19 GMT 2005
> Dec 16 22:51:57 games kernel: hdc: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=53058185, sector=53057951
> Dec 16 22:52:32 games kernel: ide: failed opcode was: unknown
> Dec 16 22:52:32 games kernel: end_request: I/O error, dev hdc, sector
> 53057951
> Dec 16 22:52:32 games kernel: hdc: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x10 {
> SectorIdNotFound }, LBAsect=53058185, sector=53057959
> Dec 16 22:52:32 games kernel: ide: failed opcode was: unknown

This looks like a simple bad disk drive. Notice that the sectors are
quite close.
If you like you can move the drive to a working machine and run a
badblocks on it.
do 'man badblocks' before you start.
Is it SMART capable? What does
smartctl -a /dev/hdc
show?

ddrescue may be your friend if you need to recover data.

Reply offlist if this is the case.

> The other has the following errors:
> Linux version 2.6.15.1 (root@localhost) (gcc version 3.4.5 (Gentoo
> 3.4.5, ssp-3.4.5-1.0, pi
> e-8.7.9)) #3 SMP PREEMPT Fri Feb 3 23:19:05 GMT 2006
> Feb 10 23:30:07 localhost kernel: ata3: command 0xb0 timeout, stat
> 0xd0 host_stat 0x0
> Feb 10 23:30:07 localhost kernel: ata3: translated ATA stat/err
> 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
> Feb 10 23:30:07 localhost kernel: ata3: status=0xd0 { Busy }
> Feb 10 23:30:07 localhost kernel: ATA: abnormal status 0xD0 on port
> 0xF880E087
> Feb 10 23:30:07 localhost last message repeated 3 times
> Feb 10 23:30:10 localhost kernel: ata3: PIO error
> Feb 10 23:30:10 localhost kernel: ata3: status=0x50 { DriveReady
> SeekComplete }
> Feb 11 10:18:10 localhost kernel: ata2: command 0xb0 timeout, stat
> 0xd0 host_stat 0x0
> Feb 11 10:18:10 localhost kernel: ata2: translated ATA stat/err
> 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
> Feb 11 10:18:10 localhost kernel: ata2: status=0xd0 { Busy }
> Feb 11 10:18:10 localhost kernel: ATA: abnormal status 0xD0 on port 0x177
> Feb 11 10:18:10 localhost last message repeated 3 times

Have you got smartd running?
I get a similar problem running some smartcl commands (-s on and -o on)
I suspect this is a libata ata passthru problem - but I'm *guessing* :)

check the last messages in dmesg then run
smartctl -data -s on /dev/sd...
smartctl -data -o on /dev/sd...
See if there are new messages in dmesg

David

--

2006-02-26 13:56:15

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

James Courtier-Dutton wrote:
>
> I have what looks like similar problems. The issue I have is that I

Nope. Different issues.

> ) #1 Sat Dec 3 18:47:19 GMT 2005
> Dec 16 22:51:57 games kernel: hdc: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=53058185, sector=53057951

The disk really does have bad sectors in this case (above).

> The other has the following errors:
> Linux version 2.6.15.1 (root@localhost) (gcc version 3.4.5 (Gentoo
> 3.4.5, ssp-3.4.5-1.0, pi
> e-8.7.9)) #3 SMP PREEMPT Fri Feb 3 23:19:05 GMT 2006
> Feb 10 23:30:07 localhost kernel: ata3: command 0xb0 timeout, stat 0xd0
> host_stat 0x0
> Feb 10 23:30:07 localhost kernel: ata3: translated ATA stat/err 0xd0/00
> to SCSI SK/ASC/ASCQ 0xb/47/00
> Feb 10 23:30:07 localhost kernel: ata3: status=0xd0 { Busy }
> Feb 10 23:30:07 localhost kernel: ATA: abnormal status 0xD0 on port
> 0xF880E087
> Feb 10 23:30:07 localhost last message repeated 3 times
> Feb 10 23:30:10 localhost kernel: ata3: PIO error
> Feb 10 23:30:10 localhost kernel: ata3: status=0x50 { DriveReady
> SeekComplete }
> Feb 11 10:18:10 localhost kernel: ata2: command 0xb0 timeout, stat 0xd0
> host_stat 0x0
> Feb 11 10:18:10 localhost kernel: ata2: translated ATA stat/err 0xd0/00
> to SCSI SK/ASC/ASCQ 0xb/47/00
> Feb 11 10:18:10 localhost kernel: ata2: status=0xd0 { Busy }
> Feb 11 10:18:10 localhost kernel: ATA: abnormal status 0xD0 on port 0x177
> Feb 11 10:18:10 localhost last message repeated 3 times

PIO errors? Are you using Alan Cox's experimental PATA code for libata?

-ml

2006-02-26 14:04:24

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
> Mark Lord wrote:
>
>>> sdb: Current: sense key: Medium Error
>>> Additional sense: Unrecovered read error - auto reallocate failed
>>> end_request: I/O error, dev sdb, sector 398283329
>>> raid1: Disk failure on sdb2, disabling device.
>>> Operation continuing on 1 devices
..
>> The command failing above is SCSI WRITE_10, which is being
>> translated into ATA_CMD_WRITE_FUA_EXT by libata.
>>
>> This command fails -- unrecognized by the drive in question.
>> But libata reports it (most incorrectly) as a "medium error",
>> and the drive is taken out of service from its RAID.
>>
>> Bad, bad, and worse.
..
> Thanks Mark
>
> I'm glad it's a bug and not bad hardware.
>
> I am quite concerned that the basic effect of just booting a practically
> vanilla 2.6.16-rc4 like this was to fry my raid array.
>
> Luckily it dropped 2 (of 3) disks so quickly that the event counter was
> the same allowing an easy rebuild.
>
> 2.6.15 has similar issues but they seem to happen *very* infrequently by
> comparison - this hit me several times during a single boot.
>
> Should Linus (cc'ed) hold off on 2.6.16 because of this or not?

Well, no doubt whatsoever about it being a "regression",
since the FUA code is *new* in 2.6.16 (not present in 2.6.15).

The FUA code should either get fixed, or removed from 2.6.16.

Cheers

2006-02-26 14:30:33

by James Courtier-Dutton

[permalink] [raw]

Subject: Kernel SeekCompleteErrors... Different from Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> James Courtier-Dutton wrote:
>>
>> I have what looks like similar problems. The issue I have is that I
>
> Nope. Different issues.
I have changed the Subject line to indicate this so any future responses
can be indicated.

>
>> ) #1 Sat Dec 3 18:47:19 GMT 2005
>> Dec 16 22:51:57 games kernel: hdc: dma_intr: status=0x51 { DriveReady
>> SeekComplete Error }
>> Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x40 {
>> UncorrectableError }, LBAsect=53058185, sector=53057951
>
> The disk really does have bad sectors in this case (above).
The disk has NO bad sectors. It has been checked using two different tests.
1) seatools (The seagate test tool passed the deep test where it reads
all sectors.)
2) dd of the entire HD image onto another HD.
No sector errors were encountered in either case.

>
>
>> The other has the following errors:
>> Linux version 2.6.15.1 (root@localhost) (gcc version 3.4.5 (Gentoo
>> 3.4.5, ssp-3.4.5-1.0, pi
>> e-8.7.9)) #3 SMP PREEMPT Fri Feb 3 23:19:05 GMT 2006
>> Feb 10 23:30:07 localhost kernel: ata3: command 0xb0 timeout, stat
>> 0xd0 host_stat 0x0
>> Feb 10 23:30:07 localhost kernel: ata3: translated ATA stat/err
>> 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
>> Feb 10 23:30:07 localhost kernel: ata3: status=0xd0 { Busy }
>> Feb 10 23:30:07 localhost kernel: ATA: abnormal status 0xD0 on port
>> 0xF880E087
>> Feb 10 23:30:07 localhost last message repeated 3 times
>> Feb 10 23:30:10 localhost kernel: ata3: PIO error
>> Feb 10 23:30:10 localhost kernel: ata3: status=0x50 { DriveReady
>> SeekComplete }
>> Feb 11 10:18:10 localhost kernel: ata2: command 0xb0 timeout, stat
>> 0xd0 host_stat 0x0
>> Feb 11 10:18:10 localhost kernel: ata2: translated ATA stat/err
>> 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
>> Feb 11 10:18:10 localhost kernel: ata2: status=0xd0 { Busy }
>> Feb 11 10:18:10 localhost kernel: ATA: abnormal status 0xD0 on port
>> 0x177
>> Feb 11 10:18:10 localhost last message repeated 3 times
>
> PIO errors? Are you using Alan Cox's experimental PATA code for libata?
>
> -ml
>
No, this is Linux kernel 2.6.15.1 with no patches.

I cut and pasted the Linux version number to the top of each trace
output in my original email.

2006-02-26 17:03:50

[permalink] [raw]

Subject: Re: Kernel SeekCompleteErrors... Different from Re: LibPATA code issues / 2.6.15.4

James Courtier-Dutton wrote:
> Mark Lord wrote:
>> James Courtier-Dutton wrote:
>>>
>>> I have what looks like similar problems. The issue I have is that I
>>
>> Nope. Different issues.
> I have changed the Subject line to indicate this so any future responses
> can be indicated.
>
>>
>>> ) #1 Sat Dec 3 18:47:19 GMT 2005
>>> Dec 16 22:51:57 games kernel: hdc: dma_intr: status=0x51 { DriveReady
>>> SeekComplete Error }
>>> Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x40 {
>>> UncorrectableError }, LBAsect=53058185, sector=53057951
>>
>> The disk really does have bad sectors in this case (above).
> The disk has NO bad sectors. It has been checked using two different tests.

The *only* test that matters is to enable S.M.A.R.T.,
and read out the error logs from it.

"smartctl" is the tool.

Cheers

2006-02-26 17:13:21

by Dr. David Alan Gilbert

[permalink] [raw]

Subject: Re: Kernel SeekCompleteErrors... Different from Re: LibPATA code issues / 2.6.15.4

* Mark Lord ([email protected]) wrote:
> >>James Courtier-Dutton wrote:
> >>>
> >>>I have what looks like similar problems. The issue I have is that I
> >>
> >>Nope. Different issues.
> >I have changed the Subject line to indicate this so any future responses
> >can be indicated.
> >
> >>
> >>>) #1 Sat Dec 3 18:47:19 GMT 2005
> >>>Dec 16 22:51:57 games kernel: hdc: dma_intr: status=0x51 { DriveReady
> >>>SeekComplete Error }
> >>>Dec 16 22:52:32 games kernel: hdc: dma_intr: error=0x40 {
> >>>UncorrectableError }, LBAsect=53058185, sector=53057951
> >>
> >>The disk really does have bad sectors in this case (above).
> >The disk has NO bad sectors. It has been checked using two different tests.
>
> The *only* test that matters is to enable S.M.A.R.T.,
> and read out the error logs from it.

I have seen a set of drives that has reported UncorrectableErrors
and :
* Shows the Uncorrectable error in the SMART log
* Passes a full SMART test
* Shows no remapped sectors
* Passes the vendors drive test
* Now fully passes a dd if=/dev/hdx of=/dev/null with no errors.

They were a set of 250GB SATA drives by the same vendor; I've taken
them out one at a time as each did the same thing and replaced them
with another vendors drive. They were all in use in RAID-1 MD
configuration (under heavy load).

I do wonder about the 'uncorrectable error rate' that vendors report;
it doesn't seem very large - but I'll admit to not understanding its
units. Are soft non-repeatable uncorrectable errors expected in
principal? (Pointers to a good explanation of what this actually
means would be appreciated).

I do wonder how often this happens to people and if the read succeeds
again they just blame it on software.

Dave
--
-----Open up your eyes, open up your mind, open up your code -------
/ Dr. David Alan Gilbert | Running GNU/Linux on Alpha,68K| Happy \
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC & HPPA | In Hex /
\ _________________________|_____ http://www.treblig.org |_______/

2006-02-26 17:41:08

by Alan

[permalink] [raw]

Subject: Re: Kernel SeekCompleteErrors... Different from Re: LibPATA code issues / 2.6.15.4

On Sul, 2006-02-26 at 17:13 +0000, Dr. David Alan Gilbert wrote:
> > The *only* test that matters is to enable S.M.A.R.T.,
> > and read out the error logs from it.

SMART is unreliable for many cases

> I have seen a set of drives that has reported UncorrectableErrors
> and :
> * Shows the Uncorrectable error in the SMART log
> * Passes a full SMART test
> * Shows no remapped sectors
> * Passes the vendors drive test
> * Now fully passes a dd if=/dev/hdx of=/dev/null with no errors.

The very early SATA code didnt decode the errors from the drive fully so
could produce bogus reports. The current code decodes it and also
displays the ATA level diagnostics so should be reliable.

2006-02-26 20:36:33

[permalink] [raw]

Subject: Re: Kernel SeekCompleteErrors... Different from Re: LibPATA code issues / 2.6.15.4

Alan Cox wrote:
>
> The very early SATA code didnt decode the errors from the drive fully so
> could produce bogus reports. The current code decodes it and also
> displays the ATA level diagnostics so should be reliable.

It still is unreliable, as being discussed in another thread.

libata wrongly says "medium error" any time it issues a command
that the drive rejects (unsupported, invalid parameters, etc..).

This is biting a few people in 2.6.16-rc*, due to the FUA stuff.

2006-02-27 11:45:07

by Alan

[permalink] [raw]

Subject: Re: Kernel SeekCompleteErrors... Different from Re: LibPATA code issues / 2.6.15.4

On Sul, 2006-02-26 at 15:36 -0500, Mark Lord wrote:
> It still is unreliable, as being discussed in another thread.
>
> libata wrongly says "medium error" any time it issues a command
> that the drive rejects (unsupported, invalid parameters, etc..).

It seems to still get a single case wrong. But it does the report the
ATA state correctly still.

> This is biting a few people in 2.6.16-rc*, due to the FUA stuff.

It is driven by a table in

libata-scsi.c:ata_to_sense_error()

so if you can figure out the wrong entry and tweak the table that would
be great

2006-02-27 13:40:31

[permalink] [raw]

Subject: Re: Kernel SeekCompleteErrors... Different from Re: LibPATA code issues / 2.6.15.4

Alan Cox wrote:
> On Sul, 2006-02-26 at 15:36 -0500, Mark Lord wrote:
>> It still is unreliable, as being discussed in another thread.
>>
>> libata wrongly says "medium error" any time it issues a command
>> that the drive rejects (unsupported, invalid parameters, etc..).
>
> It seems to still get a single case wrong. But it does the report the
> ATA state correctly still.
>
>> This is biting a few people in 2.6.16-rc*, due to the FUA stuff.
>
> It is driven by a table in
>
> libata-scsi.c:ata_to_sense_error()
>
> so if you can figure out the wrong entry and tweak the table that would be great

It's the fall-through case, where the table is not used.

/* No error? Undecoded? */
printk(KERN_WARNING "ata%u: no sense translation for op=0x%02x status: 0x%02x\n",
id, opcode, drv_stat);

/* For our last chance pick, use medium read error because
* it's much more common than an ATA drive telling you a write
* has failed.
*/
*sk = MEDIUM_ERROR;
*asc = 0x11; /* "unrecovered read error" */
*ascq = 0x04; /* "auto-reallocation failed" */

Cheers

2006-02-27 21:34:33

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
>> Mark Lord wrote:
>>
>>>> sdb: Current: sense key: Medium Error
>>>> Additional sense: Unrecovered read error - auto reallocate failed
>>>> end_request: I/O error, dev sdb, sector 398283329
>>>> raid1: Disk failure on sdb2, disabling device.
>>>> Operation continuing on 1 devices
> ..
>>> The command failing above is SCSI WRITE_10, which is being
>>> translated into ATA_CMD_WRITE_FUA_EXT by libata.
>>>
>>> This command fails -- unrecognized by the drive in question.
>>> But libata reports it (most incorrectly) as a "medium error",
>>> and the drive is taken out of service from its RAID.
>>>
>>> Bad, bad, and worse.

.. hold off on 2.6.16 because of this or not?

>
> Well, no doubt whatsoever about it being a "regression",
> since the FUA code is *new* in 2.6.16 (not present in 2.6.15).
>
> The FUA code should either get fixed, or removed from 2.6.16.

Actually, now that I've done a little more digging, this FUA stuff
is inherently dangerous as implemented. A least a few SATA controllers
including pipelines and whatnot that rely upon recognizing the (S)ATA
opcodes being using. And I sincerely doubt that any of those will
recognize the very newish (and aptly named..) FUA opcodes.

These may be unsafe in general, unless we tag controllers as
FUA-capable and NON-FUA-capable, in addition to tagging the drives.

:/

2006-02-28 01:33:09

by Tejun Heo

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Hello, Mark.

Mark Lord wrote:
>
> .. hold off on 2.6.16 because of this or not?
>

It certainly is dangerous. I guess we should turn off FUA for the time
being. Barrier auto-fallback was once implemented but it didn't seem
like a good idea as it was too complex and hides low level bug from
higher level. The concensus seems to be developing blacklist of drives
which lie about FUA support (currently only one drive). Official kernel
doesn't seem to be the correct place to grow the blacklist, Maybe we
should do it from -mm?

>>
>> Well, no doubt whatsoever about it being a "regression",
>> since the FUA code is *new* in 2.6.16 (not present in 2.6.15).
>>
>> The FUA code should either get fixed, or removed from 2.6.16.
>
>
> Actually, now that I've done a little more digging, this FUA stuff
> is inherently dangerous as implemented. A least a few SATA controllers
> including pipelines and whatnot that rely upon recognizing the (S)ATA
> opcodes being using. And I sincerely doubt that any of those will
> recognize the very newish (and aptly named..) FUA opcodes.
>
> These may be unsafe in general, unless we tag controllers as
> FUA-capable and NON-FUA-capable, in addition to tagging the drives.

All sii controllers and piix/ahci seem to handle FUA pretty ok. And
yeah, we may have to create controller blacklist too.

BTW, can you let me know what drive we're talking about now (model name
and firmware revision)?

--
tejun

2006-02-28 01:47:44

by Linus Torvalds

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Tue, 28 Feb 2006, Tejun Heo wrote:

> Hello, Mark.
>
> Mark Lord wrote:
> >
> > .. hold off on 2.6.16 because of this or not?
> >
>
> It certainly is dangerous. I guess we should turn off FUA for the time being.
> Barrier auto-fallback was once implemented but it didn't seem like a good idea
> as it was too complex and hides low level bug from higher level. The concensus
> seems to be developing blacklist of drives which lie about FUA support
> (currently only one drive). Official kernel doesn't seem to be the correct
> place to grow the blacklist, Maybe we should do it from -mm?

For 2.6.16, the only sane solution for now is to just turn it off.

Somebody want to send me a patch that does that, along with an ack from
Mark (and whoever else sees this) that it fixes his/their problems?

Linus

2006-02-28 02:07:36

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Please pull from 'upstream-fixes' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

to receive the following updates:

drivers/scsi/libata-core.c | 4 ++++
drivers/scsi/libata-scsi.c | 2 ++
drivers/scsi/libata.h | 1 +
3 files changed, 7 insertions(+)

Jeff Garzik:
[libata] Disable FUA by default

diff --git a/drivers/scsi/libata-core.c b/drivers/scsi/libata-core.c
index 5f1d758..ab3c9a4 100644
--- a/drivers/scsi/libata-core.c
+++ b/drivers/scsi/libata-core.c
@@ -82,6 +82,10 @@ int atapi_enabled = 0;
module_param(atapi_enabled, int, 0444);
MODULE_PARM_DESC(atapi_enabled, "Enable discovery of ATAPI devices (0=off, 1=on)");

+int fua = 0;
+module_param(fua, int, 0444);
+MODULE_PARM_DESC(fua, "FUA support (0=off, 1=on)");
+
MODULE_AUTHOR("Jeff Garzik");
MODULE_DESCRIPTION("Library module for ATA devices");
MODULE_LICENSE("GPL");
diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c
index 07b1e7c..5ce33ae 100644
--- a/drivers/scsi/libata-scsi.c
+++ b/drivers/scsi/libata-scsi.c
@@ -1708,6 +1708,8 @@ static int ata_dev_supports_fua(u16 *id)
{
unsigned char model[41], fw[9];

+ if (!fua)
+ return 0;
if (!ata_id_has_fua(id))
return 0;

diff --git a/drivers/scsi/libata.h b/drivers/scsi/libata.h
index e03ce48..abfd18f 100644
--- a/drivers/scsi/libata.h
+++ b/drivers/scsi/libata.h
@@ -41,6 +41,7 @@ struct ata_scsi_args {

/* libata-core.c */
extern int atapi_enabled;
+extern int fua;
extern struct ata_queued_cmd *ata_qc_new_init(struct ata_port *ap,
struct ata_device *dev);
extern int ata_rwcmd_protocol(struct ata_queued_cmd *qc);

Attachments:

libata.txt (1.61 kB)

2006-02-28 02:15:36

by Linus Torvalds

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Mon, 27 Feb 2006, Jeff Garzik wrote:
>
> I've had this waiting in the wings, in fact... [see attached]

I really hate having a _global_ variable called "fua". That's just bad
taste. I would suggest calling it "atapi_forced_unit_attention_enabled",
but maybe that is going a bit overboard. It's definitely better than just
"fua", though.

Linus

2006-02-28 02:52:43

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Linus Torvalds wrote:
>
> On Mon, 27 Feb 2006, Jeff Garzik wrote:
>
>>I've had this waiting in the wings, in fact... [see attached]
>
>
> I really hate having a _global_ variable called "fua". That's just bad
> taste. I would suggest calling it "atapi_forced_unit_attention_enabled",
> but maybe that is going a bit overboard. It's definitely better than just
> "fua", though.

<shrug> It will go away when things are fixed, and only users who are
testing will even bother with it.

Looking over the module subsystem, it looks like one could use
module_param_named() to achieve proper namespace separation (C versus
module opt) -- then you could call it libata_fua -- but for a temporary
module option it seems like more trouble than its worth.

Jeff

2006-02-28 03:36:50

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Please pull from 'upstream-fixes' branch of
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

to receive the following updates:

drivers/scsi/libata-core.c | 4 ++++
drivers/scsi/libata-scsi.c | 2 ++
drivers/scsi/libata.h | 1 +
3 files changed, 7 insertions(+)

Jeff Garzik:
[libata] Disable FUA

diff --git a/drivers/scsi/libata-core.c b/drivers/scsi/libata-core.c
index 5f1d758..4f91b0d 100644
--- a/drivers/scsi/libata-core.c
+++ b/drivers/scsi/libata-core.c
@@ -82,6 +82,10 @@ int atapi_enabled = 0;
module_param(atapi_enabled, int, 0444);
MODULE_PARM_DESC(atapi_enabled, "Enable discovery of ATAPI devices (0=off, 1=on)");

+int libata_fua = 0;
+module_param_named(fua, libata_fua, int, 0444);
+MODULE_PARM_DESC(fua, "FUA support (0=off, 1=on)");
+
MODULE_AUTHOR("Jeff Garzik");
MODULE_DESCRIPTION("Library module for ATA devices");
MODULE_LICENSE("GPL");
diff --git a/drivers/scsi/libata-scsi.c b/drivers/scsi/libata-scsi.c
index 07b1e7c..59503c9 100644
--- a/drivers/scsi/libata-scsi.c
+++ b/drivers/scsi/libata-scsi.c
@@ -1708,6 +1708,8 @@ static int ata_dev_supports_fua(u16 *id)
{
unsigned char model[41], fw[9];

+ if (!libata_fua)
+ return 0;
if (!ata_id_has_fua(id))
return 0;

diff --git a/drivers/scsi/libata.h b/drivers/scsi/libata.h
index e03ce48..fddaf47 100644
--- a/drivers/scsi/libata.h
+++ b/drivers/scsi/libata.h
@@ -41,6 +41,7 @@ struct ata_scsi_args {

/* libata-core.c */
extern int atapi_enabled;
+extern int libata_fua;
extern struct ata_queued_cmd *ata_qc_new_init(struct ata_port *ap,
struct ata_device *dev);
extern int ata_rwcmd_protocol(struct ata_queued_cmd *qc);

Attachments:

libata.txt (1.63 kB)

2006-02-28 04:12:00

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Jeff Garzik wrote:
> Linus Torvalds wrote:
..
>> I really hate having a _global_ variable called "fua". That's just bad
>> taste. I would suggest calling it "atapi_forced_unit_attention_enabled"

Heh heh..
It's actually short for "Force Unit Access",
though oddly enough I don't think the patch
mentions that in the MODULE_PARM_DESC().

> Here's the cleaner namespace version...

David, do you want to ack this one for us?

Cheers

2006-02-28 04:16:44

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Tejun Heo wrote:
..
>> These may be unsafe in general, unless we tag controllers as
>> FUA-capable and NON-FUA-capable, in addition to tagging the drives.
>
> All sii controllers and piix/ahci seem to handle FUA pretty ok. And
> yeah, we may have to create controller blacklist too.

Or maybe a whitelist instead, since nearly all existing hardware
pre-dates FUA commands.

Or maybe just have a libata function to test whether the FUA commands
actually work or not, before enabling them for general use.
*That* could be a much better approach, given the large number of
possible drive/controller combos, and it cuts down on the maintenance
headache of having to list everything on a list somewhere.

> BTW, can you let me know what drive we're talking about now (model name
> and firmware revision)?

David: we need to see the output from "hdparm --Istdout /dev/sda
(or whichever drive it was that was failing on your system).

Cheers

2006-02-28 08:04:08

by Jens Axboe

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Mon, Feb 27 2006, Linus Torvalds wrote:
>
>
> On Tue, 28 Feb 2006, Tejun Heo wrote:
>
> > Hello, Mark.
> >
> > Mark Lord wrote:
> > >
> > > .. hold off on 2.6.16 because of this or not?
> > >
> >
> > It certainly is dangerous. I guess we should turn off FUA for the
> > time being. Barrier auto-fallback was once implemented but it
> > didn't seem like a good idea as it was too complex and hides low
> > level bug from higher level. The concensus seems to be developing
> > blacklist of drives which lie about FUA support (currently only one
> > drive). Official kernel doesn't seem to be the correct place to grow
> > the blacklist, Maybe we should do it from -mm?
>
> For 2.6.16, the only sane solution for now is to just turn it off.
>
> Somebody want to send me a patch that does that, along with an ack from
> Mark (and whoever else sees this) that it fixes his/their problems?

That's the best solution right now. I guess there's no way around a
blacklist for FUA support and we need time to grow that :-(
And proper fallback to non-FUA writes with disabling FUA based barriers
as well.

Mark, what drive model+firmware are you using?

--
Jens Axboe

2006-02-28 10:27:26

by Alan

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Llu, 2006-02-27 at 21:07 -0500, Jeff Garzik wrote:
> led, "Enable discovery of ATAPI devices (0=off, 1=on)");
>
> +int fua = 0;
> +module_param(fua, int, 0444);
> +MODULE_PARM_DESC(fua, "FUA support (0=off, 1=on)");
> +

Not a good name for a global.

2006-02-28 10:28:29

by Alan

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Llu, 2006-02-27 at 23:16 -0500, Mark Lord wrote:
> Or maybe a whitelist instead, since nearly all existing hardware
> pre-dates FUA commands.

For controllers just add it as a host flag and it can be handled the
same way as LBA48 is right now. It may also be some hosts can issue FUA
with a bit of bandaging (state machine resets/pio etc)

Alan

2006-02-28 10:30:10

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Tue, 28 Feb 2006, Alan Cox wrote:

> On Llu, 2006-02-27 at 23:16 -0500, Mark Lord wrote:
>> Or maybe a whitelist instead, since nearly all existing hardware
>> pre-dates FUA commands.
>
> For controllers just add it as a host flag and it can be handled the
> same way as LBA48 is right now. It may also be some hosts can issue FUA
> with a bit of bandaging (state machine resets/pio etc)
>
> Alan
>

While I have not yet been able to reproduce the problem with the verbose
patch, here is the hdparm -I:

/dev/sdc:

ATA device, with non-removable media
Model Number: WDC WD4000KD-00NAB0
Serial Number: WD-WMAMY1020930
Firmware Revision: 01.06A01
Standards:
Supported: 7 6 5 4
Likely used: 7
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 781422768
device size with M = 1024*1024: 381554 MBytes
device size with M = 1000*1000: 400088 MBytes (400 GB)
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, with device specific
minimum
R/W multiple sector transfer: Max = 16 Current = 0
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* NOP cmd
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
* Look-ahead
* Write cache
* Power Management feature set
Security Mode feature set
* SMART feature set
* FLUSH CACHE EXT command
* Mandatory FLUSH CACHE command
* Device Configuration Overlay feature set
* 48-bit Address feature set
Automatic Acoustic Management feature set
SET MAX security extension
* DOWNLOAD MICROCODE cmd
* General Purpose Logging feature set
* SMART self-test
* SMART error logging
Security:
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
Checksum: correct

2006-02-28 10:39:20

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:

> Tejun Heo wrote:
>
>> BTW, can you let me know what drive we're talking about now (model
>> name and firmware revision)?
>
>
> David: we need to see the output from "hdparm --Istdout /dev/sda
> (or whichever drive it was that was failing on your system).
>
> Cheers
>
So here's the info for sda and sdb (see below for related log data).

/dev/sda:
IO_support = 0 (default 16-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 24321/255/63, sectors = 390721968, start = 0
0040 3fff c837 0010 0000 0000 003f 0000
0000 0000 4234 3033 3852 5248 2020 2020
2020 2020 2020 2020 0003 4000 0004 4241
4e43 3139 3830 4d61 7874 6f72 2036 4232
3030 4d30 2020 2020 2020 2020 2020 2020
2020 2020 2020 2020 2020 2020 2020 8010
0000 2f00 4000 0200 0000 0007 3fff 0010
003f fc10 00fb 0100 ffff 0fff 0000 0007
0003 0078 0078 0078 0078 0000 0000 0000
0000 0000 0000 0000 0002 0000 0000 0000
00fe 001e 7869 7d09 4043 7869 3c01 4043
203f 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 f1b0 1749 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0113 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 d3a5

/dev/sdb:
IO_support = 0 (default 16-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 24792/255/63, sectors = 398297088, start = 0
0040 3fff c837 0010 0000 0000 003f 0000
0000 0000 4234 3152 5641 3148 2020 2020
2020 2020 2020 2020 0003 4000 0004 4241
4e43 3142 5930 4d61 7874 6f72 2036 4232
3030 4d30 2020 2020 2020 2020 2020 2020
2020 2020 2020 2020 2020 2020 2020 8010
0000 2f00 4000 0200 0000 0007 3fff 0010
003f fc10 00fb 0100 ffff 0fff 0000 0007
0003 0078 0078 0078 0078 0000 0000 0000
0000 0000 0000 001f 0102 0000 0000 0000
00fe 001e 7c6b 7f09 4063 7c69 3e01 4063
207f 0000 0000 0000 fffe 0000 c0fe 0000
0000 0000 0000 0000 8800 17bd 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0001 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0113 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 d8a5

The info below is from the log I saved booted with 2.6.16-rc4
I got these errors:

sd 0:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 390716735
raid5: Disk failure on sda1, disabling device. Operation continuing on 2
devices
ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
ata2: status=0x51 { DriveReady SeekComplete Error }
sd 1:0:0:0: SCSI error: return code = 0x8000002
sdb: Current: sense key: Medium Error
Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sdb, sector 390716735
raid5: Disk failure on sdb1, disabling device. Operation continuing on 1
devices

They are both attached to:
libata version 1.20 loaded.
sata_sil 0000:00:0a.0: version 0.9
ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 16 (level, low) -> IRQ 17
ata1: SATA max UDMA/100 cmd 0xF8804080 ctl 0xF880408A bmdma 0xF8804000
irq 17
ata2: SATA max UDMA/100 cmd 0xF88040C0 ctl 0xF88040CA bmdma 0xF8804008
irq 17
ata1: SATA link up 1.5 Gbps (SStatus 113)
ata1: dev 0 cfg 49:2f00 82:7869 83:7d09 84:4043 85:7869 86:3c01 87:4043
88:203f
ata1: dev 0 ATA-7, max UDMA/100, 390721968 sectors: LBA48
ata1: dev 0 configured for UDMA/100
scsi0 : sata_sil
ata2: SATA link up 1.5 Gbps (SStatus 113)
ata2: dev 0 cfg 49:2f00 82:7c6b 83:7f09 84:4063 85:7c69 86:3e01 87:4063
88:007f
ata2: dev 0 ATA-7, max UDMA/133, 398297088 sectors: LBA48
ata2: dev 0 configured for UDMA/100
scsi1 : sata_sil
Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
Type: Direct-Access ANSI SCSI revision: 05

Are there any other tests; like swapping the disks to the other
controller (sata_via) and seeing what happens. With and without the patch?

David

--

2006-02-28 14:37:16

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
>
> /dev/sda:
..
> 0040 3fff c837 0010 0000 0000 003f 0000
> 0000 0000 4234 3033 3852 5248 2020 2020
> 2020 2020 2020 2020 0003 4000 0004 4241
> 4e43 3139 3830 4d61 7874 6f72 2036 4232
> 3030 4d30 2020 2020 2020 2020 2020 2020
> 2020 2020 2020 2020 2020 2020 2020 8010
> 0000 2f00 4000 0200 0000 0007 3fff 0010
> 003f fc10 00fb 0100 ffff 0fff 0000 0007
> 0003 0078 0078 0078 0078 0000 0000 0000
> 0000 0000 0000 0000 0002 0000 0000 0000
> 00fe 001e 7869 7d09 4043 7869 3c01 4043
> 203f 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 f1b0 1749 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0113 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 0000
> 0000 0000 0000 0000 0000 0000 0000 d3a5
..
hdparm-6.4 says:

Model Number: Maxtor 6B200M0
Serial Number: B4038RRH
Firmware Revision: BANC1980

Commands/features:
Enabled Supported:
* NOP cmd
* READ BUFFER cmd
* WRITE BUFFER cmd
* Look-ahead
* Write cache
* Power Management feature set
* SMART feature set
* FLUSH_CACHE_EXT
* Mandatory FLUSH_CACHE
* Device Configuration Overlay feature set
* 48-bit Address feature set
SET_MAX security extension
Advanced Power Management feature set
* DOWNLOAD_MICROCODE
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* SMART self-test
* SMART error logging

So, yes, the drive is either lying about "* WRITE_{DMA|MULTIPLE}_FUA_EXT",
or it didn't like the parameters it was given, or the SATA/IDE controller
chip didn't like the command.

Cheers

2006-02-28 14:38:50

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
..
> sd 0:0:0:0: SCSI error: return code = 0x8000002
> sda: Current: sense key: Medium Error
> Additional sense: Unrecovered read error - auto reallocate failed
> end_request: I/O error, dev sda, sector 390716735
> raid5: Disk failure on sda1, disabling device. Operation continuing on 2
> devices
> ata2: no sense translation for op=0x2a cmd=0x3d status: 0x51
> ata2: status=0x51 { DriveReady SeekComplete Error }
> sd 1:0:0:0: SCSI error: return code = 0x8000002
> sdb: Current: sense key: Medium Error
> Additional sense: Unrecovered read error - auto reallocate failed
> end_request: I/O error, dev sdb, sector 390716735
> raid5: Disk failure on sdb1, disabling device. Operation continuing on 1
> devices
..

The error handling still sucks, regardless of FUA.
All of this nonsense about "Medium Error" is pure bogosity here.

Cheers

2006-02-28 15:12:43

by Alan

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Maw, 2006-02-28 at 09:38 -0500, Mark Lord wrote:
>
> The error handling still sucks, regardless of FUA.
> All of this nonsense about "Medium Error" is pure bogosity here.

I've flipped my tree to report Aborted Command. Not sure there is a
better scsi sense match for "it broke and I dont know why"

2006-02-28 15:31:57

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
>
> scsi1 : sata_sil
> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
> Type: Direct-Access ANSI SCSI revision: 05
> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
> Type: Direct-Access ANSI SCSI revision: 05

I wonder if the non-FUA component here is the sata_sil,
rather than the two Maxtor drives.

Also, your drives have different firmware,
but both have trouble with FUA here.

(sdb is slightly newer, and larger, than sda).

Cheers

2006-02-28 15:34:38

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> David Greaves wrote:
>
>>
>> scsi1 : sata_sil
>> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
>> Type: Direct-Access ANSI SCSI revision: 05
>> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
>> Type: Direct-Access ANSI SCSI revision: 05
>
>
> I wonder if the non-FUA component here is the sata_sil,
> rather than the two Maxtor drives.
>
> Also, your drives have different firmware,
> but both have trouble with FUA here.

sata_sil is indeed a piece of hardware that needs to know the opcodes
ahead of time...

Jeff

2006-02-28 16:57:37

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

those drives should support all FUA opcodes properly, both queued and unqueued

On 2/28/06, Jeff Garzik <[email protected]> wrote:
> Mark Lord wrote:
> > David Greaves wrote:
> >
> >>
> >> scsi1 : sata_sil
> >> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
> >> Type: Direct-Access ANSI SCSI revision: 05
> >> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
> >> Type: Direct-Access ANSI SCSI revision: 05
> >
> >
> > I wonder if the non-FUA component here is the sata_sil,
> > rather than the two Maxtor drives.
> >
> > Also, your drives have different firmware,
> > but both have trouble with FUA here.
>
> sata_sil is indeed a piece of hardware that needs to know the opcodes
> ahead of time...
>
> Jeff
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2006-03-01 01:04:36

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Eric D. Mudama wrote:
> those drives should support all FUA opcodes properly, both queued and unqueued

His first drive (sda) does not support queued commands at all,
but the newer firmware in his second drive (sdb) does support NCQ.

Both drives support FUA.

cheers

2006-03-01 11:37:14

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Tue, 28 Feb 2006, Mark Lord wrote:

> Eric D. Mudama wrote:
>> those drives should support all FUA opcodes properly, both queued and
>> unqueued
>
> His first drive (sda) does not support queued commands at all,
> but the newer firmware in his second drive (sdb) does support NCQ.
>
> Both drives support FUA.
>
> cheers
>

To trust or not to trust?

I have a 400GB SATA drive: WDC WD4000KD-00N. With these errors in dmesg
that have been mentioned throughout the thread, should I trust Linux using
this drive, or should I remove it/wait until a patch is released to
address this issue?

Also, in the forums (storagereview.com I believe), it has been noted that
these drives do NOT work on the Intel ICH5 controller, and this turned out
to be true, when I put it on the Intel ICH5, the box stalls for 2-3
minutes and then it does not see the drive. However, on the Silicon
Image, Inc. SiI 3112 chipset or Promise SATA/150 TX2 it works okay but it
has those errors in dmesg.

My question is, performing long and short smart tests, everything is
physically ok with the drive; however, I probably should not use this
drive for anything important in Linux, comments?

Justin.

2006-03-01 13:17:46

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Tue, 28 Feb 2006, Mark Lord wrote:

> Eric D. Mudama wrote:
>> those drives should support all FUA opcodes properly, both queued and
>> unqueued
>
> His first drive (sda) does not support queued commands at all,
> but the newer firmware in his second drive (sdb) does support NCQ.
>
> Both drives support FUA.
>
> cheers
>

Could someone *PLEASE* produce a *unified* patch that is compatible with
2.6.16-rc5 or 2.6.15.4 so I can reproduce the error?

Mark had two patches, I have had the most PIA time getting them to work,
patch properly, etc..

With 2.6.16-rc5:

# make bzImage
CHK include/linux/version.h
scripts/kconfig/conf -s arch/i386/Kconfig
#
# using defaults found in .config
#
SPLIT include/linux/autoconf.h -> include/config/*
CHK include/linux/compile.h
CHK usr/initramfs_list
GEN .version
CHK include/linux/compile.h
UPD include/linux/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
drivers/built-in.o: In function `ata_to_sense_error': undefined reference
to `print'
drivers/built-in.o: In function `ata_to_sense_error': undefined reference
to `print'
make: *** [.tmp_vmlinux1] Error 1
Command exited with non-zero status 2

2006-03-01 17:33:06

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Alan Cox wrote:

>On Maw, 2006-02-28 at 09:38 -0500, Mark Lord wrote:
>
>
>>The error handling still sucks, regardless of FUA.
>>All of this nonsense about "Medium Error" is pure bogosity here.
>>
>>
>
>I've flipped my tree to report Aborted Command. Not sure there is a
>better scsi sense match for "it broke and I dont know why"
>
>
As a user I prefer
It Broke And I Dont Know Why
to
Aborted Command

(honesty is the best policy)

I certainly hate Medium Error as modern hard disks seem to be flakier
than ever.

David

2006-03-01 17:40:51

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Jeff Garzik wrote:

> Mark Lord wrote:
>
>> David Greaves wrote:
>>
>>>
>>> scsi1 : sata_sil
>>> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
>>> Type: Direct-Access ANSI SCSI revision: 05
>>> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
>>> Type: Direct-Access ANSI SCSI revision: 05
>>
>>
>>
>> I wonder if the non-FUA component here is the sata_sil,
>> rather than the two Maxtor drives.
>>
>> Also, your drives have different firmware,
>> but both have trouble with FUA here.
>
>
> sata_sil is indeed a piece of hardware that needs to know the opcodes
> ahead of time...
>
> Jeff
>
I actually have 3 of those drives - one runs through sata_via and
doesn't have the same problem.

(the sata_via ones *do* have :
ata3: status=0x50 { DriveReady SeekComplete }
ata3: PIO error
problems with SMART)

David

2006-03-01 17:46:04

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
>
> I actually have 3 of those drives - one runs through sata_via and
> doesn't have the same problem.
>
> (the sata_via ones *do* have :
> ata3: status=0x50 { DriveReady SeekComplete }
> ata3: PIO error
> problems with SMART)

And once again, not enough information in the error messages
for anyone to actually do anything about it (not David's fault).

What command do you use to get that bug to pop up?

BTW:
hdparm-6.5 is now available (sourceforge),
and should show all of the fancy features
of your drives for comparism between versions.

Cheers

2006-03-01 18:12:47

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:

> David Greaves wrote:
>
>>
>> I actually have 3 of those drives - one runs through sata_via and
>> doesn't have the same problem.
>>
>> (the sata_via ones *do* have :
>> ata3: status=0x50 { DriveReady SeekComplete }
>> ata3: PIO error
>> problems with SMART)
>
>
> And once again, not enough information in the error messages
> for anyone to actually do anything about it (not David's fault).
>
> What command do you use to get that bug to pop up?

(FYI I'm running 2.6.15 with both 'info' patches 'cos I'm scared of
2.6.16-rc4!)

haze:/usr/src# smartctl -data -s on /dev/sdc
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

No messages in dmesg

haze:/usr/src# smartctl -data -o on /dev/sdc
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
Error SMART Enable Automatic Offline failed: Input/output error
Smartctl: SMART Enable Automatic Offline Failed.

dmesg contains this message repeated 31 times:
ata3: PIO error
ata3: status=0x50 { DriveReady SeekComplete }

haze:/usr/src# smartctl -data -o off /dev/sdc
succeeds but gives me:

ata3: status=0x50 { DriveReady SeekComplete }
ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x04 { DriveStatusError }
ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x04 { DriveStatusError }
ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x04 { DriveStatusError }

haze:/usr/src# smartctl -data -o on /dev/sdd
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
Error SMART Enable Automatic Offline failed: Input/output error
Smartctl: SMART Enable Automatic Offline Failed.

ata4: PIO error
ata4: status=0x50 { DriveReady SeekComplete }

# smartctl -data -o off /dev/sdd
ata4: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata4: status=0x51 { DriveReady SeekComplete Error }
ata4: error=0x04 { DriveStatusError }
ata4: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata4: status=0x51 { DriveReady SeekComplete Error }
ata4: error=0x04 { DriveStatusError }
ata4: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata4: status=0x51 { DriveReady SeekComplete Error }
ata4: error=0x04 { DriveStatusError }

haze:/usr/src# hdparm --Istdout /dev/sdc

/dev/sdc:
IO_support = 0 (default 16-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 19457/255/63, sectors = 312581808, start = 0
0c5a 3fff c837 0010 0000 0000 003f 0000
0000 0000 334a 5332 4b53 4c33 2020 2020
2020 2020 2020 2020 0000 4000 0004 332e
3138 2020 2020 5354 3331 3630 3032 3341
5320 2020 2020 2020 2020 2020 2020 2020
2020 2020 2020 2020 2020 2020 2020 8010
0000 2f00 0000 0200 0200 0007 3fff 0010
003f fc10 00fb 0110 ffff 0fff 0000 0007
0003 0078 0078 00f0 0078 0000 0000 0000
0000 0000 0000 0000 0002 0000 0000 0000
007e 001b 346b 7d01 4003 3468 3c01 4003
407f 0000 0000 fefe 0000 0000 fe00 0000
0000 0000 0000 0000 9eb0 12a1 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0001 9eb0 12a1 9eb0 12a1 2020 0002 42b6
8000 008a 3c06 3c0a ffff 07c6 0100 0800
0ff0 1000 0002 0030 0000 0000 0000 fe06
0000 0002 0050 008a 954f 0000 0023 000b
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 7ea5

haze:/usr/src# hdparm --Istdout /dev/sdd

/dev/sdd:
IO_support = 0 (default 16-bit)
readonly = 0 (off)
readahead = 256 (on)
geometry = 24792/255/63, sectors = 398297088, start = 0
0040 3fff c837 0010 0000 0000 003f 0000
0000 0000 4234 3152 5643 3248 2020 2020
2020 2020 2020 2020 0003 4000 0004 4241
4e43 3142 5930 4d61 7874 6f72 2036 4232
3030 4d30 2020 2020 2020 2020 2020 2020
2020 2020 2020 2020 2020 2020 2020 8010
0000 2f00 4000 0200 0000 0007 3fff 0010
003f fc10 00fb 0110 ffff 0fff 0000 0007
0003 0078 0078 0078 0078 0000 0000 0000
0000 0000 0000 001f 0102 0000 0000 0000
00fe 001e 7c6b 7f09 4063 7c68 3e01 4063
407f 0000 0000 0000 fffe 0000 c0fe 0000
0000 0000 0000 0000 8800 17bd 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0001 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0113 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 a6a5

David

>
> BTW:
> hdparm-6.5 is now available (sourceforge),
> and should show all of the fancy features
> of your drives for comparism between versions.

OK - soonish...

2006-03-01 18:30:12

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
>
> haze:/usr/src# smartctl -data -o off /dev/sdc
> succeeds but gives me:
>
> ata3: status=0x50 { DriveReady SeekComplete }
> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> ata3: status=0x51 { DriveReady SeekComplete Error }
> ata3: error=0x04 { DriveStatusError }
> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> ata3: status=0x51 { DriveReady SeekComplete Error }
> ata3: error=0x04 { DriveStatusError }
> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> ata3: status=0x51 { DriveReady SeekComplete Error }
> ata3: error=0x04 { DriveStatusError }

"DriveStatusError" is "Command Aborted" in ac-speak.
From the man page for smartctl, we read:

>-o VALUE Enables or disables SMART automatic offline test ...
>Note that the SMART automatic offline test command is listed as "Obsolete" in every
>version of the ATA and ATA/ATAPI Specifications. It was originally part of the
>SFF-8035i Revision 2.0 specification, but was never part of any ATA specification.

There's a chance that your drives simply do not fully support this feature,
and are rejecting attempts to use it.

By the way, the latest 2.6.16-rc5-git4 is available,
and has FUA turned off by default now. So it should
work with your drives, and *you* are expected to verify
that for us all now.

Cheers

-ml

2006-03-01 18:32:23

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Wed, 1 Mar 2006, Mark Lord wrote:

> David Greaves wrote:
>>
>> haze:/usr/src# smartctl -data -o off /dev/sdc
>> succeeds but gives me:
>>
>> ata3: status=0x50 { DriveReady SeekComplete }
>> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>
> "DriveStatusError" is "Command Aborted" in ac-speak.
> From the man page for smartctl, we read:
>
>> -o VALUE Enables or disables SMART automatic offline test ...
>> Note that the SMART automatic offline test command is listed as "Obsolete"
> in every
>> version of the ATA and ATA/ATAPI Specifications. It was originally part
> of the
>> SFF-8035i Revision 2.0 specification, but was never part of any ATA
> specification.
>
> There's a chance that your drives simply do not fully support this feature,
> and are rejecting attempts to use it.
>
> By the way, the latest 2.6.16-rc5-git4 is available,
> and has FUA turned off by default now. So it should
> work with your drives, and *you* are expected to verify
> that for us all now.
>
> Cheers
>
> -ml
>

When running that command, I get it too:

[4294684.510000] ACPI: PCI Interrupt 0000:02:06.0[A] -> GSI 22 (level,
low) -> I
RQ 17
[4294686.762000] process `syslogd' is using obsolete setsockopt
SO_BSDCOMPAT
[4295292.736000] +++PATCH: Original kernel error:
[4295292.736000] ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI
SK/ASC/AS
CQ 0xb/00/00
[4295292.736000] +++PATCH: Mark Lord's extended verbosity patch:
[4295292.736000] ata3: translated op=0x85 cmd=0xb0 ATA stat/err 0x51/04 to
SCSI
SK/ASC/ASCQ 0xb/00/00
[4295292.736000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4295292.736000] ata3: error=0x04 { DriveStatusError }
[4295292.736000] +++PATCH: Original kernel error:
[4295292.736000] ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI
SK/ASC/AS
CQ 0xb/00/00
[4295292.736000] +++PATCH: Mark Lord's extended verbosity patch:
[4295292.736000] ata3: translated op=0x85 cmd=0xb0 ATA stat/err 0x51/04 to
SCSI
SK/ASC/ASCQ 0xb/00/00
[4295292.736000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4295292.736000] ata3: error=0x04 { DriveStatusError }
[4295292.736000] +++PATCH: Original kernel error:
[4295292.736000] ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI
SK/ASC/AS
CQ 0xb/00/00
[4295292.736000] +++PATCH: Mark Lord's extended verbosity patch:
[4295292.736000] ata3: translated op=0x85 cmd=0xb0 ATA stat/err 0x51/04 to
SCSI
SK/ASC/ASCQ 0xb/00/00
[4295292.736000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4295292.736000] ata3: error=0x04 { DriveStatusError }
[4295292.736000] +++PATCH: Original kernel error:
[4295292.736000] ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI
SK/ASC/AS
CQ 0xb/00/00
[4295292.736000] +++PATCH: Mark Lord's extended verbosity patch:
[4295292.736000] ata3: translated op=0x85 cmd=0xb0 ATA stat/err 0x51/04 to
SCSI
SK/ASC/ASCQ 0xb/00/00
[4295292.736000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4295292.736000] ata3: error=0x04 { DriveStatusError }
[4295292.736000] +++PATCH: Original kernel error:
[4295292.736000] ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI
SK/ASC/AS
CQ 0xb/00/00
[4295292.736000] +++PATCH: Mark Lord's extended verbosity patch:
[4295292.736000] ata3: translated op=0x85 cmd=0xb0 ATA stat/err 0x51/04 to
SCSI
SK/ASC/ASCQ 0xb/00/00
[4295292.736000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4295292.736000] ata3: error=0x04 { DriveStatusError }

2006-03-01 18:33:04

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

2006-03-01 18:35:45

by Alan

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Mer, 2006-03-01 at 17:33 +0000, David Greaves wrote:
> As a user I prefer
> It Broke And I Dont Know Why
> to
> Aborted Command

So whats the SCSI sense encoding for that ?

2006-03-01 18:48:40

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:

> By the way, the latest 2.6.16-rc5-git4 is available,
> and has FUA turned off by default now. So it should
> work with your drives, and *you* are expected to verify
> that for us all now.

Yeah, I know - I've got it on the machine... but it's my wife's machine.
I've asked nicely but she's editing a Hercule Poirot video so I'm not
allowed to reboot it for a while...

I've told her I'm not making pancakes until I've tested it so expect a
report Real Soon Now...

David

2006-03-01 19:02:31

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

> those drives should support all FUA opcodes properly, both queued and unqueued
>
> On 2/28/06, Jeff Garzik <[email protected]> wrote:
> > Mark Lord wrote:
> > > David Greaves wrote:
> > >
> > >>
> > >> scsi1 : sata_sil
> > >> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
> > >> Type: Direct-Access ANSI SCSI revision: 05
> > >> Vendor: ATA Model: Maxtor 6B200M0 Rev: BANC
> > >> Type: Direct-Access ANSI SCSI revision: 05

How about the drives that got blacklisted following :
http://bugzilla.kernel.org/show_bug.cgi?id=5914 ?
and
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 ?

Device Model: Maxtor 6L300S0
Firmware Version: BANC1G10

on

Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)

Regards,

--
Nicolas Mailhot

Attachments:

signature.asc (199.00 B)
Ceci est une partie de message num?riquement sign

2006-03-01 19:06:16

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Wed, 1 Mar 2006, Mark Lord wrote:

> David Greaves wrote:
>>
>> haze:/usr/src# smartctl -data -o off /dev/sdc
>> succeeds but gives me:
>>
>> ata3: status=0x50 { DriveReady SeekComplete }
>> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>> ata3: translated op=0x85 ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata3: status=0x51 { DriveReady SeekComplete Error }
>> ata3: error=0x04 { DriveStatusError }
>
> "DriveStatusError" is "Command Aborted" in ac-speak.
> From the man page for smartctl, we read:
>
>> -o VALUE Enables or disables SMART automatic offline test ...
>> Note that the SMART automatic offline test command is listed as "Obsolete"
> in every
>> version of the ATA and ATA/ATAPI Specifications. It was originally part
> of the
>> SFF-8035i Revision 2.0 specification, but was never part of any ATA
> specification.
>
> There's a chance that your drives simply do not fully support this feature,
> and are rejecting attempts to use it.
>
> By the way, the latest 2.6.16-rc5-git4 is available,
> and has FUA turned off by default now. So it should
> work with your drives, and *you* are expected to verify
> that for us all now.
>
> Cheers
>
> -ml
>

By the way, the latest 2.6.16-rc5-git4 is available,

I am using 2.6.16-rc5-git4, and after running:

# smartctl -data -o off /dev/sdc

I get:

[4294785.192000] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[4294785.192000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4294785.192000] ata3: error=0x04 { DriveStatusError }
[4294785.192000] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[4294785.192000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4294785.192000] ata3: error=0x04 { DriveStatusError }
[4294785.192000] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[4294785.192000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4294785.192000] ata3: error=0x04 { DriveStatusError }
[4294785.192000] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[4294785.192000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4294785.192000] ata3: error=0x04 { DriveStatusError }
[4294785.192000] ata3: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ
0xb/00/00
[4294785.192000] ata3: status=0x51 { DriveReady SeekComplete Error }
[4294785.192000] ata3: error=0x04 { DriveStatusError }

Did you mean you wanted us to test it like we normally do, ie, copy
files/md5sum them on the disk and see if we can make it occur again, or?

Justin.

2006-03-01 19:22:11

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Nicolas Mailhot wrote:
>>
> How about the drives that got blacklisted following :
> http://bugzilla.kernel.org/show_bug.cgi?id=5914 ?
> and
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 ?
>
> Device Model: Maxtor 6L300S0
> Firmware Version: BANC1G10
>
> on Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)

Mmm.. somebody with one of those controllers should check
to see if *any* drives work with FUA, and blacklist the controller
instead of the drives if everything is failing.

Cheers

2006-03-01 19:27:49

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
>
> I am using 2.6.16-rc5-git4, and after running:
>
> # smartctl -data -o off /dev/sdc
>
> I get:
>
> [4294785.192000] ata3: translated ATA stat/err 0x51/04 to SCSI
> SK/ASC/ASCQ 0xb/00/00
> [4294785.192000] ata3: status=0x51 { DriveReady SeekComplete Error }
> [4294785.192000] ata3: error=0x04 { DriveStatusError }

That's probably just your drive reporting "unsupported sub-command".
Nothing serious -- the man page for smartctl even mentions the possibility.

Cheers

2006-03-01 19:35:25

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
>
> Did you mean you wanted us to test it like we normally do, ie, copy
> files/md5sum them on the disk and see if we can make it occur again, or?

Yes. The S.M.A.R.T. stuff doesn't matter nearly as much as normal I/O.

And Justin, can you get those S.M.A.R.T. errors to pop up on 2.6.15 as well?

2006-03-01 19:38:18

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Wed, 1 Mar 2006, Mark Lord wrote:

> Justin Piszcz wrote:
>>
>> Did you mean you wanted us to test it like we normally do, ie, copy
>> files/md5sum them on the disk and see if we can make it occur again, or?
>
> Yes. The S.M.A.R.T. stuff doesn't matter nearly as much as normal I/O.
>
> And Justin, can you get those S.M.A.R.T. errors to pop up on 2.6.15 as well?
>

Have not tested, can test later if necessary, running some I/O tests to
the disk which is probably going to take quite a while to see if I can get
it to error again with 2.6.16-rc5-git4.

Justin.

2006-03-01 19:42:12

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
>
>
> On Wed, 1 Mar 2006, Mark Lord wrote:
>
>> Justin Piszcz wrote:
>>
>>>
>>> Did you mean you wanted us to test it like we normally do, ie, copy
>>> files/md5sum them on the disk and see if we can make it occur again, or?
>>
>>
>> Yes. The S.M.A.R.T. stuff doesn't matter nearly as much as normal I/O.
>>
>> And Justin, can you get those S.M.A.R.T. errors to pop up on 2.6.15 as
>> well?
>>
>
> Have not tested, can test later if necessary, running some I/O tests to
> the disk which is probably going to take quite a while to see if I can
> get it to error again with 2.6.16-rc5-git4.

If there are FUA problems, it would be immediately apparent on the first
write...

Jeff

2006-03-01 19:48:50

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:

>Mark Lord wrote:
>
>
>
>>By the way, the latest 2.6.16-rc5-git4 is available,
>>and has FUA turned off by default now. So it should
>>work with your drives, and *you* are expected to verify
>>that for us all now.
>>
>>
>Yeah, I know - I've got it on the machine... but it's my wife's machine.
>I've asked nicely but she's editing a Hercule Poirot video so I'm not
>allowed to reboot it for a while...
>
>I've told her I'm not making pancakes until I've tested it so expect a
>report Real Soon Now...
>
>
OK that worked (the pancakes - the kernel's not doing so well...)

haze:~# uname -a
Linux haze 2.6.16-rc5-git4 #2 PREEMPT Wed Mar 1 19:07:58 UTC 2006 i686
GNU/Linux

The boot is pretty clean.
I ran an xfs_repair -n on the lvm volume and got the following errors.
The repair reported a clean filesystem and the drive was not booted from
the raid so that's a big improvement.

I was not able to trigger similar messages on ata1 but a simple dd
doesn't trigger the messages on ata2 either (and for various reasons,
xfs_repair wouldn't run on ata1 - I thought I'd leave it and report this
first)

ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: error=0x04 { DriveStatusError }
ata2: no sense translation for status: 0x51
ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for status: 0x51
ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for status: 0x51
ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for status: 0x51
ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for status: 0x51
ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for status: 0x51
ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }
ata2: no sense translation for status: 0x51
ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata2: status=0x51 { DriveReady SeekComplete Error }

David

--

2006-03-01 20:14:11

by Phillip Susi

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Alan Cox wrote:
> On Mer, 2006-03-01 at 17:33 +0000, David Greaves wrote:
>> As a user I prefer
>> It Broke And I Dont Know Why
>> to
>> Aborted Command
>
> So whats the SCSI sense encoding for that ?
>

Wouldn't that just be 0/0/0? IIRC the standard defines that as "NO
ADDITIONAL SENSE DATA" which sounds to me like another way of saying "I
don't know what went wrong, but that didn't work".

2006-03-01 23:14:06

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Le mercredi 01 mars 2006 à 14:22 -0500, Mark Lord a écrit :
> Nicolas Mailhot wrote:
> >>
> > How about the drives that got blacklisted following :
> > http://bugzilla.kernel.org/show_bug.cgi?id=5914 ?
> > and
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 ?
> >
> > Device Model: Maxtor 6L300S0
> > Firmware Version: BANC1G10
> >
> > on Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
>
> Mmm.. somebody with one of those controllers should check
> to see if *any* drives work with FUA, and blacklist the controller
> instead of the drives if everything is failing.

I'm a someone with such a controller (that's my boog here)
But I only have these drives.
So I can only confirm the combo it deadly.
(I could possibly try to plug one on the nforce4 controller, not sure if
extracting the box from the tangle of cables and hardware he's part of
is worth it. sata_nv is rev-eng, while the siI docs are public, right?)

I do suspect Eric D. Mudama knows if the problem is on the hard-drive
side though

Regards,

--
Nicolas Mailhot

Attachments:

signature.asc (199.00 B)
Ceci est une partie de message num?riquement sign

2006-03-01 23:31:59

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Nicolas Mailhot wrote:
> is worth it. sata_nv is rev-eng, while the siI docs are public, right?)

sata_nv was written by NVIDIA.

Jeff

2006-03-02 01:19:32

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On 3/1/06, Nicolas Mailhot <[email protected]> wrote:
> Le mercredi 01 mars 2006 ? 14:22 -0500, Mark Lord a ?crit :
> > Nicolas Mailhot wrote:
> > >>
> > > How about the drives that got blacklisted following :
> > > http://bugzilla.kernel.org/show_bug.cgi?id=5914 ?
> > > and
> > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951 ?
> > >
> > > Device Model: Maxtor 6L300S0
> > > Firmware Version: BANC1G10
> > >
> > > on Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02)
> >
> > Mmm.. somebody with one of those controllers should check
> > to see if *any* drives work with FUA, and blacklist the controller
> > instead of the drives if everything is failing.
>
> I'm a someone with such a controller (that's my boog here)
> But I only have these drives.
> So I can only confirm the combo it deadly.
> (I could possibly try to plug one on the nforce4 controller, not sure if
> extracting the box from the tangle of cables and hardware he's part of
> is worth it. sata_nv is rev-eng, while the siI docs are public, right?)
>
> I do suspect Eric D. Mudama knows if the problem is on the hard-drive
> side though
>
> Regards,
>
> --
> Nicolas Mailhot
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.1 (GNU/Linux)
>
> iEYEABECAAYFAkQGKmoACgkQI2bVKDsp8g0veQCggJkweq1nQn7YNSEIobOHitk0
> QXsAn0TnHI/6LBG9nezBnS0MTskLml0W
> =s1TM
> -----END PGP SIGNATURE-----
>

I didn't know offhand so we plugged in a bus analzyer and took a look
here in the lab... We didn't have a 3114 lying around, but issuing the
Write DMA FUA (0x3D) opcode on a 3112 resulted in a D0h soft hang. I
think they're related (4-port vs 2-port).

Looking at the bus trace, the command is issued on the SATA bus, the
drive generates a DMA Activate FIS which is accepted by the 3112, and
then the 3112 generates a Data Payload FIS (46h) with no contents.

The first DWORD of the payload is a HOLD primitive, to which the
device promptly responds with HOLDA, and the two are in a soft bus
lock and will sit forever. No data is ever generated by the host
(stopped capture after 4 seconds).

I believe this core should not be part of the FUA whitelist. If I
remember correctly, there are other implementations out there with
similar limitations to opcodes this "new" to ATA.

--eric

2006-03-02 01:39:47

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On 3/1/06, Eric D. Mudama <[email protected]> wrote:
> I believe this core should not be part of the FUA whitelist. If I
> remember correctly, there are other implementations out there with
> similar limitations to opcodes this "new" to ATA.

That being said, I see from

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951

that a blacklisting of some Maxtor drives for this issue has
supposedly occurred or been pushed and accepted "upstream" in git ....
For the obvious (selfish) reasons, I'd like to minimize the number of
Maxtor drives that are blacklisted, as I don't believe this is a drive
issue at all.

If there's a drive model out there reporting support for FUA but
screwing it up, I'm all ears as that's something I need to know about.
If basic adapter functional testing is required for some of these
low-level commands, then that might be something I can help with too
(on a very limited scale), since we have access to ~100 different
chipsets.

--eric

2006-03-02 01:56:21

[permalink] [raw]

Subject: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Eric D. Mudama wrote:
> I didn't know offhand so we plugged in a bus analzyer and took a look
> here in the lab... We didn't have a 3114 lying around, but issuing the
> Write DMA FUA (0x3D) opcode on a 3112 resulted in a D0h soft hang. I
> think they're related (4-port vs 2-port).

Looking at the public docs posted at
http://gkernel.sourceforge.net/specs/sii/ ... FUA is not in the list of
supported opcodes (Table 10-1).

The 311x does have a facility that allows the driver to specify the
command protocol associated with an unknown-to-the-chip opcode. Someone
sufficiently interested could investigate using the VS Unlock and VS Set
Command Protocol commands to patch in support (section 10.4.*).

For libata, I think an ATA_FLAG_NO_FUA would be appropriate for
situations like this... assume FUA is supported in the controller, and
set a flag where it is not. Most chips will support FUA, either by
design or by sheer luck. The ones that do not support FUA are the
controllers that snoop the ATA command opcode, and internally choose the
protocol based on that opcode. For such hardware, unknown opcodes will
inevitably cause problems.

Jeff

2006-03-02 01:58:10

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Jeff Garzik wrote:
> For libata, I think an ATA_FLAG_NO_FUA would be appropriate for
> situations like this... assume FUA is supported in the controller, and
> set a flag where it is not. Most chips will support FUA, either by
> design or by sheer luck. The ones that do not support FUA are the
> controllers that snoop the ATA command opcode, and internally choose the
> protocol based on that opcode. For such hardware, unknown opcodes will
> inevitably cause problems.

This also begs the question... what controller was being used, when the
single Maxtor device listed in the blacklist was added? Perhaps it was
a problem with the controller, not the device.

Jeff

2006-03-02 02:20:26

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

On 3/1/06, Jeff Garzik <[email protected]> wrote:
> This also begs the question... what controller was being used, when the
> single Maxtor device listed in the blacklist was added? Perhaps it was
> a problem with the controller, not the device.
>
> Jeff

As reported here:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951

the controller was a 3114, and the bug was "fixed" by blacklisting his
Maxtor drive's FUA support. I'd like Maxtor drives to be
un-blacklisted if possible.

--eric

2006-03-02 02:46:28

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Eric D. Mudama wrote:
> On 3/1/06, Jeff Garzik <[email protected]> wrote:
>
>>This also begs the question... what controller was being used, when the
>>single Maxtor device listed in the blacklist was added? Perhaps it was
>>a problem with the controller, not the device.
>>
>> Jeff
>
>
> As reported here:
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951
>
> the controller was a 3114, and the bug was "fixed" by blacklisting his
> Maxtor drive's FUA support. I'd like Maxtor drives to be
> un-blacklisted if possible.

If its 3114 I agree un-blacklisting is the way to go... but its not
clear to me whether the problematic configuration included sata_sil or
sata_nv. Since I'm apparently blind :) which part of the bug points
conclusively to sata_sil?

Jeff

2006-03-02 03:00:21

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

On 3/1/06, Jeff Garzik <[email protected]> wrote:
> Eric D. Mudama wrote:
> > On 3/1/06, Jeff Garzik <[email protected]> wrote:
> >
> >>This also begs the question... what controller was being used, when the
> >>single Maxtor device listed in the blacklist was added? Perhaps it was
> >>a problem with the controller, not the device.
> >>
> >> Jeff
> >
> >
> > As reported here:
> >
> > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951
> >
> > the controller was a 3114, and the bug was "fixed" by blacklisting his
> > Maxtor drive's FUA support. I'd like Maxtor drives to be
> > un-blacklisted if possible.
>
> If its 3114 I agree un-blacklisting is the way to go... but its not
> clear to me whether the problematic configuration included sata_sil or
> sata_nv. Since I'm apparently blind :) which part of the bug points
> conclusively to sata_sil?
>
> Jeff

The "failing dmesg" has the plextor connected to sata_nv, and the two
Maxtor drives connected to sata_sil, if I read it correctly. They're
ata5/ata6 ports, mapped as sda/sdb.

Nicolas' comment in the thread "Re: LibPATA code issues / 2.6.15.4"
seemed to say it was the same adapter:

http://marc.theaimsgroup.com/?l=linux-kernel&m=114123989405668&w=2

--eric

2006-03-02 03:06:35

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Eric D. Mudama wrote:
> The "failing dmesg" has the plextor connected to sata_nv, and the two
> Maxtor drives connected to sata_sil, if I read it correctly. They're
> ata5/ata6 ports, mapped as sda/sdb.
>
> Nicolas' comment in the thread "Re: LibPATA code issues / 2.6.15.4"
> seemed to say it was the same adapter:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=114123989405668&w=2

Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is the
way to go...

Jeff

2006-03-02 03:13:49

by Tejun Heo

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Jeff Garzik wrote:
> Eric D. Mudama wrote:
>
>> The "failing dmesg" has the plextor connected to sata_nv, and the two
>> Maxtor drives connected to sata_sil, if I read it correctly. They're
>> ata5/ata6 ports, mapped as sda/sdb.
>>
>> Nicolas' comment in the thread "Re: LibPATA code issues / 2.6.15.4"
>> seemed to say it was the same adapter:
>>
>> http://marc.theaimsgroup.com/?l=linux-kernel&m=114123989405668&w=2
>
>
> Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is the
> way to go...
>

Agreed. I'm currently implementing VDMA on sata_sil and will get to FUA
via explicit protocol soon.

--
tejun

2006-03-02 03:15:57

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Jeff Garzik wrote:
..
> Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is the
> way to go...

Might as well add sata_mv to that blacklist as well.

And while I'm at it, the pdc_adma and sata_qstor controllers/drivers are fine with FUA.

-ml

2006-03-02 03:18:30

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Mark Lord wrote:
> Jeff Garzik wrote:
> ..
>
>> Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is
>> the way to go...
>
>
> Might as well add sata_mv to that blacklist as well.

Have you confirmed that it doesn't work with FUA?

We recently patched sata_mv to add ATA_CMD_WRITE_FUA_EXT, in response to
a nasty bug report, and ISTR the complainer went away.

> And while I'm at it, the pdc_adma and sata_qstor controllers/drivers are
> fine with FUA.

Verified or just guessing?

Jeff

2006-03-02 06:23:39

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

On 3/1/06, Jeff Garzik <[email protected]> wrote:
> Mark Lord wrote:
> > Jeff Garzik wrote:
> > ..
> >
> >> Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is
> >> the way to go...
> >
> >
> > Might as well add sata_mv to that blacklist as well.
>
> Have you confirmed that it doesn't work with FUA?

I'll see if I can find one of these around the lab tomorrow and test
the raw command support. If that's fine at a basic level, it might be
a bug in the driver?

2006-03-02 07:23:16

by Jens Axboe

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

On Wed, Mar 01 2006, Jeff Garzik wrote:
> Jeff Garzik wrote:
> >For libata, I think an ATA_FLAG_NO_FUA would be appropriate for
> >situations like this... assume FUA is supported in the controller, and
> >set a flag where it is not. Most chips will support FUA, either by
> >design or by sheer luck. The ones that do not support FUA are the
> >controllers that snoop the ATA command opcode, and internally choose the
> >protocol based on that opcode. For such hardware, unknown opcodes will
> >inevitably cause problems.
>
> This also begs the question... what controller was being used, when the
> single Maxtor device listed in the blacklist was added? Perhaps it was
> a problem with the controller, not the device.

Yeah which explains it a lot better as well... The FUA drive problem
never made much sense to me.

--
Jens Axboe

2006-03-02 08:57:34

by Sander

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Jeff Garzik wrote (ao):
> Mark Lord wrote:
> >Jeff Garzik wrote:
> >..
> >
> >>Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is
> >>the way to go...
> >
> >
> >Might as well add sata_mv to that blacklist as well.
>
> Have you confirmed that it doesn't work with FUA?
>
> We recently patched sata_mv to add ATA_CMD_WRITE_FUA_EXT, in response to
> a nasty bug report, and ISTR the complainer went away.

That is correct. I was that complainer and reported that the patch works
for me: http://lkml.org/lkml/2006/2/15/175

Also, the patch went into the next -rc kernel that time.

Sander

PS, can I get you guys interested in the sata_mv driver? I would really
love to use Marvell controller:
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.2/0914.html

I'd be very happy to test any patches and will report how they do.

--
Humilis IT Services and Solutions
http://www.humilis.net

2006-03-02 09:01:00

by Sander

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Eric D. Mudama wrote (ao):
> On 3/1/06, Jeff Garzik <[email protected]> wrote:
> > Mark Lord wrote:
> > > Jeff Garzik wrote:
> > > ..
> > >
> > >> Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is
> > >> the way to go...
> > >
> > >
> > > Might as well add sata_mv to that blacklist as well.
> >
> > Have you confirmed that it doesn't work with FUA?
>
> I'll see if I can find one of these around the lab tomorrow and test
> the raw command support. If that's fine at a basic level, it might be
> a bug in the driver?

If you tell me what to do (what to type in etc) I can save you from
looking for one. I have a:

Marvell Technology Group Ltd. MV88SX6081 8-port SATA II PCI-X Controller
(rev 09)

I can connect a Maxtor MaXLine Pro 500, a Maxtor DiamondMax11 and a WD
Raptor 74GB to test if necessary.

Sander

--
Humilis IT Services and Solutions
http://www.humilis.net

2006-03-02 11:53:04

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Eric D. Mudama wrote:
> On 3/1/06, Jeff Garzik <[email protected]> wrote:
>
>>Mark Lord wrote:
>>
>>>Jeff Garzik wrote:
>>>..
>>>
>>>
>>>>Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is
>>>>the way to go...
>>>
>>>
>>>Might as well add sata_mv to that blacklist as well.
>>
>>Have you confirmed that it doesn't work with FUA?
>
>
> I'll see if I can find one of these around the lab tomorrow and test
> the raw command support. If that's fine at a basic level, it might be
> a bug in the driver?

Quite possibly. Anything goes with sata_mv at the moment... I've done
my best to cover most of the errata and get it working, but there are
still some key errata workarounds missing. It's still marked "HIGHLY
EXPERIMENTAL" in the Kconfig ;-)

Jeff

2006-03-02 16:06:32

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Le Jeu 2 mars 2006 02:58, Jeff Garzik a écrit :
> Jeff Garzik wrote:
>> For libata, I think an ATA_FLAG_NO_FUA would be appropriate for
>> situations like this... assume FUA is supported in the controller, and
>> set a flag where it is not. Most chips will support FUA, either by
>> design or by sheer luck. The ones that do not support FUA are the
>> controllers that snoop the ATA command opcode, and internally choose the
>> protocol based on that opcode. For such hardware, unknown opcodes will
>> inevitably cause problems.
>
> This also begs the question... what controller was being used, when the
> single Maxtor device listed in the blacklist was added? Perhaps it was
> a problem with the controller, not the device.

The controller in the bugzilla entry ie a SiI 3114.
It was a quick fix and I did expect more thorough investigation later
(probably 2.6.17 frame). Though it seems FUA-related problems are so
numerous FUA itself will be blacklisted for 2.6.16, so the limited
blacklist is no longer needed.

The thread leading to the blacklist is referenced in the bugzilla entry

--
Nicolas Mailhot

2006-03-02 16:08:20

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Le Jeu 2 mars 2006 03:46, Jeff Garzik a écrit :
> Eric D. Mudama wrote:

> If its 3114 I agree un-blacklisting is the way to go... but its not
> clear to me whether the problematic configuration included sata_sil or
> sata_nv. Since I'm apparently blind :) which part of the bug points
> conclusively to sata_sil?

It's sata-sil
I'm 100% sure it's how I cabled the system
sata-nv only got a plextor drive attached
(pata-nv has two pata drives on too)

--
Nicolas Mailhot

2006-03-02 16:11:15

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Le Jeu 2 mars 2006 04:00, Eric D. Mudama a écrit :

> The "failing dmesg" has the plextor connected to sata_nv, and the two
> Maxtor drives connected to sata_sil, if I read it correctly. They're
> ata5/ata6 ports, mapped as sda/sdb.
>
> Nicolas' comment in the thread "Re: LibPATA code issues / 2.6.15.4"
> seemed to say it was the same adapter:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=114123989405668&w=2

Not only it's the same adapter model, but we're talking about the same
physical system. I opened the original boog, posted on lkml, etc

--
Nicolas Mailhot

2006-03-02 16:14:52

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Le Jeu 2 mars 2006 04:06, Jeff Garzik a écrit :
> Eric D. Mudama wrote:
>> The "failing dmesg" has the plextor connected to sata_nv, and the two
>> Maxtor drives connected to sata_sil, if I read it correctly. They're
>> ata5/ata6 ports, mapped as sda/sdb.
>>
>> Nicolas' comment in the thread "Re: LibPATA code issues / 2.6.15.4"
>> seemed to say it was the same adapter:
>>
>> http://marc.theaimsgroup.com/?l=linux-kernel&m=114123989405668&w=2
>
> Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is the
> way to go...

Please add the ATA_FLAG_NO_FUA flag and *after* unblacklist the drive as I
distinctly have no wish to do fsck stressing again.

--
Nicolas Mailhot

2006-03-02 16:20:01

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Le Jeu 2 mars 2006 03:20, Eric D. Mudama a écrit :
> On 3/1/06, Jeff Garzik <[email protected]> wrote:
>> This also begs the question... what controller was being used, when the
>> single Maxtor device listed in the blacklist was added? Perhaps it was
>> a problem with the controller, not the device.
>>
>> Jeff
>
> As reported here:
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=177951
>
> the controller was a 3114, and the bug was "fixed" by blacklisting his
> Maxtor drive's FUA support. I'd like Maxtor drives to be
> un-blacklisted if possible.

BTW Eric you should know :
- these specific drives (and the Maxtor PATA drives they replaced) where
bought because I knew you were hanging on the lists
- I fully intended to ask you if the blacklisting where valif after the
FUA dust had settled a little

Regards,

--
Nicolas Mailhot

2006-03-02 16:38:00

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Nicolas Mailhot wrote:
> The controller in the bugzilla entry ie a SiI 3114.
> It was a quick fix and I did expect more thorough investigation later
> (probably 2.6.17 frame). Though it seems FUA-related problems are so
> numerous FUA itself will be blacklisted for 2.6.16, so the limited
> blacklist is no longer needed.

Well, we're looking for a long term solution :)

Disabling FUA by default in 2.6.16 is a temporary solution.

Jeff

2006-03-03 00:34:28

[permalink] [raw]

Subject: Re: FUA and 311x (was Re: LibPATA code issues / 2.6.15.4)

Jeff Garzik wrote:
> Mark Lord wrote:
>> Jeff Garzik wrote:
>> ..
>>
>>> Sounds like un-blacklisting the drive, and adding ATA_FLAG_NO_FUA is
>>> the way to go...
>>
>>
>> Might as well add sata_mv to that blacklist as well.
>
> Have you confirmed that it doesn't work with FUA?

Ooops. Defective memory here.

The Marvell documentation for the 6081/6041 does indeed state
that the FUA DMA commands *are* supported (queued or non-queued).

So it should be okay, at least for those two specific chips.

2006-03-03 19:38:36

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Wed, 1 Mar 2006, David Greaves wrote:

> David Greaves wrote:
>
>> Mark Lord wrote:
>>
>>
>>
>>> By the way, the latest 2.6.16-rc5-git4 is available,
>>> and has FUA turned off by default now. So it should
>>> work with your drives, and *you* are expected to verify
>>> that for us all now.
>>>
>>>
>> Yeah, I know - I've got it on the machine... but it's my wife's machine.
>> I've asked nicely but she's editing a Hercule Poirot video so I'm not
>> allowed to reboot it for a while...
>>
>> I've told her I'm not making pancakes until I've tested it so expect a
>> report Real Soon Now...
>>
>>
> OK that worked (the pancakes - the kernel's not doing so well...)
>
> haze:~# uname -a
> Linux haze 2.6.16-rc5-git4 #2 PREEMPT Wed Mar 1 19:07:58 UTC 2006 i686
> GNU/Linux
>
> The boot is pretty clean.
> I ran an xfs_repair -n on the lvm volume and got the following errors.
> The repair reported a clean filesystem and the drive was not booted from
> the raid so that's a big improvement.
>
> I was not able to trigger similar messages on ata1 but a simple dd
> doesn't trigger the messages on ata2 either (and for various reasons,
> xfs_repair wouldn't run on ata1 - I thought I'd leave it and report this
> first)
>
> ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: error=0x04 { DriveStatusError }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
>
> David
>
> --
>

As of 2.6.16-rc5-git4, I have written 281GB so far over a period of 48+
hours with no errors yet :)

Will keep you updated if I see any errors, but so far, so good!

Thanks,

Justin.

2006-03-03 22:45:47

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Just FYI - I'm away (in Canada) for 2 weeks so can't do any additional
testing until I return.

David

--

2006-03-04 14:26:06

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:
> Just FYI - I'm away (in Canada) for 2 weeks so can't do any additional
> testing until I return.

Am I correct, in that your last test on rc5-git4 was a failure?

But without the "opcode" display in the error messages,
so we have no idea exactly what caused the errors (again!)?

[Whatcha doin up here?]

Cheers

2006-03-05 11:43:35

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Wed, 1 Mar 2006, David Greaves wrote:

> David Greaves wrote:
>
>> Mark Lord wrote:
>>
>>
>>
>>> By the way, the latest 2.6.16-rc5-git4 is available,
>>> and has FUA turned off by default now. So it should
>>> work with your drives, and *you* are expected to verify
>>> that for us all now.
>>>
>>>
>> Yeah, I know - I've got it on the machine... but it's my wife's machine.
>> I've asked nicely but she's editing a Hercule Poirot video so I'm not
>> allowed to reboot it for a while...
>>
>> I've told her I'm not making pancakes until I've tested it so expect a
>> report Real Soon Now...
>>
>>
> OK that worked (the pancakes - the kernel's not doing so well...)
>
> haze:~# uname -a
> Linux haze 2.6.16-rc5-git4 #2 PREEMPT Wed Mar 1 19:07:58 UTC 2006 i686
> GNU/Linux
>
> The boot is pretty clean.
> I ran an xfs_repair -n on the lvm volume and got the following errors.
> The repair reported a clean filesystem and the drive was not booted from
> the raid so that's a big improvement.
>
> I was not able to trigger similar messages on ata1 but a simple dd
> doesn't trigger the messages on ata2 either (and for various reasons,
> xfs_repair wouldn't run on ata1 - I thought I'd leave it and report this
> first)
>
> ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: error=0x04 { DriveStatusError }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
> ata2: no sense translation for status: 0x51
> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
> ata2: status=0x51 { DriveReady SeekComplete Error }
>
> David
>
> --
>

Using 2.6.16-rc5-git4 and removing a directory of around 5.0GB of files
while streaming a 1MB/s video stream on another (SATA disk), the I/O
seemed to freeze up for a moment and I got this error:

[4342671.839000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x22

Only 1 in dmesg, any idea what causes this error?

2006-03-05 12:41:32

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Sun, 5 Mar 2006, Justin Piszcz wrote:

> On Wed, 1 Mar 2006, David Greaves wrote:
>
>> David Greaves wrote:
>>
>>> Mark Lord wrote:
>>>
>>>
>>>
>>>> By the way, the latest 2.6.16-rc5-git4 is available,
>>>> and has FUA turned off by default now. So it should
>>>> work with your drives, and *you* are expected to verify
>>>> that for us all now.
>>>>
>>>>
>>> Yeah, I know - I've got it on the machine... but it's my wife's machine.
>>> I've asked nicely but she's editing a Hercule Poirot video so I'm not
>>> allowed to reboot it for a while...
>>>
>>> I've told her I'm not making pancakes until I've tested it so expect a
>>> report Real Soon Now...
>>>
>>>
>> OK that worked (the pancakes - the kernel's not doing so well...)
>>
>> haze:~# uname -a
>> Linux haze 2.6.16-rc5-git4 #2 PREEMPT Wed Mar 1 19:07:58 UTC 2006 i686
>> GNU/Linux
>>
>> The boot is pretty clean.
>> I ran an xfs_repair -n on the lvm volume and got the following errors.
>> The repair reported a clean filesystem and the drive was not booted from
>> the raid so that's a big improvement.
>>
>> I was not able to trigger similar messages on ata1 but a simple dd
>> doesn't trigger the messages on ata2 either (and for various reasons,
>> xfs_repair wouldn't run on ata1 - I thought I'd leave it and report this
>> first)
>>
>> ata2: translated ATA stat/err 0x51/04 to SCSI SK/ASC/ASCQ 0xb/00/00
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: error=0x04 { DriveStatusError }
>> ata2: no sense translation for status: 0x51
>> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for status: 0x51
>> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for status: 0x51
>> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for status: 0x51
>> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for status: 0x51
>> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for status: 0x51
>> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>> ata2: no sense translation for status: 0x51
>> ata2: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
>> ata2: status=0x51 { DriveReady SeekComplete Error }
>>
>> David
>>
>> --
>>
>
> Using 2.6.16-rc5-git4 and removing a directory of around 5.0GB of files while
> streaming a 1MB/s video stream on another (SATA disk), the I/O seemed to
> freeze up for a moment and I got this error:
>
> [4342671.839000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x22
>
> Only 1 in dmesg, any idea what causes this error?
>
>

The drive it occured on was a 74GB raptor on an ICH5 controller.

[4294673.245000] Vendor: ATA Model: WDC WD740GD-00FL Rev: 33.0
0000:00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA
Controller (rev 02)

2006-03-05 22:58:30

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Justin Piszcz wrote:
>
>> Using 2.6.16-rc5-git4 and removing a directory of around 5.0GB of
>> files while streaming a 1MB/s video stream on another (SATA disk), the
>> I/O seemed to freeze up for a moment and I got this error:
>>
>> [4342671.839000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x22
>>
>> Only 1 in dmesg, any idea what causes this error?
>
> The drive it occured on was a 74GB raptor on an ICH5 controller.
>
> [4294673.245000] Vendor: ATA Model: WDC WD740GD-00FL Rev: 33.0
> 0000:00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA
> Controller (rev 02)

SCSI opcode 0x35 is SYNCHRONIZE_CACHE.

Pity we don't know exactly what that got translated to by libata.
It would have been either a FLUSH_CACHE of some kind,
or possibly(?) one of the _FUA_ commands.

Cheers

2006-03-05 23:00:41

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> Justin Piszcz wrote:
>>
>>> Using 2.6.16-rc5-git4 and removing a directory of around 5.0GB of
>>> files while streaming a 1MB/s video stream on another (SATA disk),
>>> the I/O seemed to freeze up for a moment and I got this error:
>>>
>>> [4342671.839000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x22
>>>
>>> Only 1 in dmesg, any idea what causes this error?
>>
>> The drive it occured on was a 74GB raptor on an ICH5 controller.
>>
>> [4294673.245000] Vendor: ATA Model: WDC WD740GD-00FL Rev: 33.0
>> 0000:00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA
>> Controller (rev 02)
>
> SCSI opcode 0x35 is SYNCHRONIZE_CACHE.

Oh, wait a sec.. on that path, libata actually does show the ATA opcode,
which would have been WRITE_DMA_EXT. Not an FUA command.

Dunno what it's complaining about, though.

2006-03-05 23:19:47

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Sun, 5 Mar 2006, Mark Lord wrote:

> Mark Lord wrote:
>> Justin Piszcz wrote:
>>>
>>>> Using 2.6.16-rc5-git4 and removing a directory of around 5.0GB of
files
>>>> while streaming a 1MB/s video stream on another (SATA disk), the I/O
>>>> seemed to freeze up for a moment and I got this error:
>>>>
>>>> [4342671.839000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x22
>>>>
>>>> Only 1 in dmesg, any idea what causes this error?
>>>
>>> The drive it occured on was a 74GB raptor on an ICH5 controller.
>>>
>>> [4294673.245000] Vendor: ATA Model: WDC WD740GD-00FL Rev:
33.0
>>> 0000:00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA
>>> Controller (rev 02)
>>
>> SCSI opcode 0x35 is SYNCHRONIZE_CACHE.
>
> Oh, wait a sec.. on that path, libata actually does show the ATA opcode,
> which would have been WRITE_DMA_EXT. Not an FUA command.
>
> Dunno what it's complaining about, though.
>

Well I know what it was now...

The hard drive (RAPTOR/74GB failed)...

[4294685.928000] process `syslogd' is using obsolete setsockopt
SO_BSDCOMPAT
[4342671.839000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x22
[4347012.243000] ata1: command 0x25 timeout, stat 0x50 host_stat 0x20
[4347157.486000] ata1: command 0x25 timeout, stat 0x80 host_stat 0x22
[4347157.486000] ata1: translated ATA stat/err 0x80/00 to SCSI SK/ASC/ASCQ
0xb/4
7/00
[4347157.486000] ata1: status=0x80 { Busy }
[4347157.486000] sd 0:0:0:0: SCSI error: return code = 0x8000002
[4347157.486000] sda: Current: sense key=0xb
[4347157.486000] ASC=0x47 ASCQ=0x0
[4347157.486000] end_request: I/O error, dev sda, sector 27646928
[4347157.486000] Buffer I/O error on device sda, logical block 3455866
[4347157.486000] ATA: abnormal status 0x80 on port 0xC007
[4347157.486000] ATA: abnormal status 0x80 on port 0xC007
[4347157.486000] ATA: abnormal status 0x80 on port 0xC007
[4347187.486000] ata1: command 0x25 timeout, stat 0x50 host_stat 0x21
[4347407.657000] ATA: abnormal status 0x80 on port 0xC007
[4347407.657000] ATA: abnormal status 0x80 on port 0xC007
[4347407.657000] ATA: abnormal status 0x80 on port 0xC007
[4347437.656000] ata1: command 0x35 timeout, stat 0x80 host_stat 0x21
[4347437.656000] ata1: translated ATA stat/err 0x80/00 to SCSI SK/ASC/ASCQ
0xb/4
7/00
[4347437.656000] ata1: status=0x80 { Busy }
[4347437.656000] sd 0:0:0:0: SCSI error: return code = 0x8000002
[4347437.656000] sda: Current: sense key=0xb
[4347437.656000] ASC=0x47 ASCQ=0x0
[4347437.656000] end_request: I/O error, dev sda, sector 76339746
[4347437.656000] ATA: abnormal status 0x80 on port 0xC007
[4347437.656000] ATA: abnormal status 0x80 on port 0xC007
[4347437.656000] ATA: abnormal status 0x80 on port 0xC007
[4347467.656000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x21
[4347467.656000] Device sda2 - XFS write error in file system meta-data
block 0x
449af90 in sda2
[4347467.656000] ata1: command 0x35 timeout, stat 0x50 host_stat 0x21
[4347467.656000] Device sda2 - XFS write error in file system meta-data
block 0x
449af90 in sda2
[4347497.656000] ata1: command 0x25 timeout, stat 0x50 host_stat 0x21
[4347527.663000] ata1: command 0x25 timeout, stat 0x50 host_stat 0x22
[4347527.663000] Unable to handle kernel paging request at virtual address
858f9
a70
[4347527.663000] printing eip:
[4347527.663000] c021ff87
[4347527.663000] *pde = 00000000
[4347527.663000] Oops: 0000 [#1]
[4347527.663000] PREEMPT SMP
[4347527.663000] CPU: 0
[4347527.663000] EIP: 0060:[<c021ff87>] Not tainted VLI
[4347527.663000] EFLAGS: 00210282 (2.6.16-rc5-git4 #3)
[4347527.663000] EIP is at xfs_dir2_block_lookup_int+0xb0/0x1e9
[4347527.663000] eax: 9b86a560 ebx: 00000000 ecx: cdc352b0 edx:
00000000
[4347527.663000] esi: 177504f0 edi: 5e5cb7f4 ebp: 00000000 esp:
f6c8bd18
[4347527.663000] ds: 007b es: 007b ss: 0068
[4347527.663000] Process nfsd (pid: 1359, threadinfo=f6c8a000
task=f7c14030)
[4347527.663000] Stack: <0>00000000 c91fa944 00000000 021a0480 00000000
f6c8bd64
00000000 f6c8bd84
[4347527.663000] f6c8bd88 f6c8bdac c73e7438 f6f916c0 00000004
f7dbc800 00
000000 f3aa2000
[4347527.663000] 61a5869b c91fa9ac f7db9380 c73e7438 00000000
c91fa944 f6
c8bdac 00000000
[4347527.663000] Call Trace:
[4347527.663000] [<c02200da>] xfs_dir2_block_lookup+0x1a/0xa1
[4347527.663000] [<c021f721>] xfs_dir2_lookup+0xd3/0x151
[4347527.663000] [<c035e9d3>] ip_output+0x171/0x2de
[4347527.663000] [<c035e1c9>] ip_finish_output+0x0/0x22d
[4347527.663000] [<c024e836>] xfs_dir_lookup_int+0x40/0x125
[4347527.663000] [<c0150b0d>] cache_alloc_refill+0xf1/0x50c
[4347527.663000] [<c0252b39>] xfs_lookup+0x5f/0x88
[4347527.663000] [<c02613cc>] linvfs_lookup+0x52/0x99
[4347527.663000] [<c0161563>] __lookup_hash+0xc4/0xf3
[4347527.663000] [<c016160f>] lookup_one_len+0x7d/0x84
[4347527.663000] [<c01ad6c7>] nfsd_lookup+0xc0/0x4b2
[4347527.663000] [<c01b4bcd>] nfsd3_proc_lookup+0xa5/0xf3
[4347527.663000] [<c01a9497>] nfsd_dispatch+0x9c/0x214
[4347527.663000] [<c039fb21>] svc_process+0x3bf/0x69e
[4347527.663000] [<c01a97bc>] nfsd+0x1ad/0x331
[4347527.663000] [<c01a960f>] nfsd+0x0/0x331
[4347527.663000] [<c0100e95>] kernel_thread_helper+0x5/0xb
[4347527.663000] Code: 89 44 24 40 89 c2 0f ca 8d 04 d5 00 00 00 00 29 c6
8d 42
ff 8b 4c 24 24 8b 79 14 31 d2 eb 07 8d 51 01 39 c2 7f 17 8d 0c 02 d1 f9
<8b> 1c
ce 0f cb 39 df 74 2a 77 e9 8d 41 ff 39 c2 7e e9 8b 74 24
[4347527.663000]
[4347527.663000] <4>ATA: abnormal status 0x80 on port 0xC007
[4347567.674000] ATA: abnormal status 0x80 on port 0xC007
[4347567.674000] ATA: abnormal status 0x80 on port 0xC007
[4347597.674000] ata1: command 0x35 timeout, stat 0x80 host_stat 0x21
[4347597.674000] ata1: translated ATA stat/err 0x80/00 to SCSI SK/ASC/ASCQ
0xb/4
7/00
[4347597.674000] ata1: status=0x80 { Busy }
[4347597.674000] sd 0:0:0:0: SCSI error: return code = 0x8000002
[4347597.674000] sda: Current: sense key=0xb
[4347597.674000] ASC=0x47 ASCQ=0x0
[4347597.674000] end_request: I/O error, dev sda, sector 4401810
[4347597.674000] ATA: abnormal status 0x80 on port 0xC007
[4347597.674000] ATA: abnormal status 0x80 on port 0xC007
[4347597.674000] ATA: abnormal status 0x80 on port 0xC007
[4347627.674000] ata1: command 0x35 timeout, stat 0x80 host_stat 0x21
[4347627.674000] ata1: translated ATA stat/err 0x80/00 to SCSI SK/ASC/ASCQ
0xb/4
7/00
[4347627.674000] ata1: status=0x80 { Busy }
[4347627.674000] sd 0:0:0:0: SCSI error: return code = 0x8000002
[4347627.674000] sda: Current: sense key=0xb
[4347627.674000] ASC=0x47 ASCQ=0x0
[4347627.674000] end_request: I/O error, dev sda, sector 110074018
[4347627.674000] ATA: abnormal status 0x80 on port 0xC007
[4347627.674000] ATA: abnormal status 0x80 on port 0xC007
[4347627.674000] ATA: abnormal status 0x80 on port 0xC007

..

ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66006018
Buffer I/O error on device sda2, logical block 61604208
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66006019
Buffer I/O error on device sda2, logical block 61604209
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66006020
Buffer I/O error on device sda2, logical block 61604210
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66006021
Buffer I/O error on device sda2, logical block 61604211
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66006018
Buffer I/O error on device sda2, logical block 61604208
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x40 { UncorrectableError }
SCSI error : <0 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66006019

..

I later ran mkfs.ext2 -c /dev/sda and it kept returning errors such as
these:

ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
ata3: status=0x51 { DriveReady SeekComplete Error }
ata3: error=0x40 { UncorrectableError }
SCSI error : <2 0 0 0> return code = 0x8000002
sda: Current: sense key=0x3
ASC=0x11 ASCQ=0x4
end_request: I/O error, dev sda, sector 66006016

I ran WD's tool on the drive, it confirmed it had problems.

Luckily I have a spare raptor and restored from backup and I am now back
up and running with no errors yet.

Justin.

2006-03-05 23:39:49

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> SCSI opcode 0x35 is SYNCHRONIZE_CACHE.
>
> Pity we don't know exactly what that got translated to by libata.

Gave up on reading code? If not, we know exactly what it was translated
into.

Jeff

2006-03-06 06:13:48

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> David Greaves wrote:
>> Just FYI - I'm away (in Canada) for 2 weeks so can't do any additional
>> testing until I return.
>
> Am I correct, in that your last test on rc5-git4 was a failure?
It was *much* better than rc4 but it did have an error.
I *think* the problem I'm seeing is likely to be similar to the one I
orginally reported (on 2.6.15 IIRC)
Same sporadic warning/error which didn't usually trigger the
raid-boot-the-disk behaviour that the FUA code seemed to.
> But without the "opcode" display in the error messages,
> so we have no idea exactly what caused the errors (again!)?
Yes. I thought the/a opcode-verbose patch was in there but I guess not.
I don't have remote console access to the machine so wouldn't be able to
carry out reliable kernel tests - sorry.
Of course I'll do this as soon as I return.
>
> [Whatcha doin up here?]
[:) 2weeks skiing in Whistler (this time - 10 days canadian canoeing in
Algonquin last time!)
Canada's great !!]

David

2006-03-07 16:57:48

by Bill Davidsen

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Mark Lord wrote:
> David Greaves wrote:
>>
>> /dev/sda:
[...snip...]
> ..
> hdparm-6.4 says:

Is there a version of that which will build on x86? I grabbed the
version offered at freshmeat, but it won't compile on any x86 distro or
gcc version to which I have access. RH8, RH9, FC1, FC3, FC4, ubuntu...
with or without using the suggested alternate header.
>
> Model Number: Maxtor 6B200M0
> Serial Number: B4038RRH
> Firmware Revision: BANC1980
>
> Commands/features:
> Enabled Supported:
> * NOP cmd
> * READ BUFFER cmd
> * WRITE BUFFER cmd
> * Look-ahead
> * Write cache
> * Power Management feature set
> * SMART feature set
> * FLUSH_CACHE_EXT
> * Mandatory FLUSH_CACHE
> * Device Configuration Overlay feature set
> * 48-bit Address feature set
> SET_MAX security extension
> Advanced Power Management feature set
> * DOWNLOAD_MICROCODE
> * WRITE_{DMA|MULTIPLE}_FUA_EXT
> * SMART self-test
> * SMART error logging
>
> So, yes, the drive is either lying about "* WRITE_{DMA|MULTIPLE}_FUA_EXT",
> or it didn't like the parameters it was given, or the SATA/IDE controller
> chip didn't like the command.

2006-03-08 02:57:14

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Bill Davidsen wrote:
>
> Is there a version of that which will build on x86? I grabbed the
> version offered at freshmeat, but it won't compile on any x86 distro or
> gcc version to which I have access. RH8, RH9, FC1, FC3, FC4, ubuntu...
> with or without using the suggested alternate header.

hdparm-6.5 is the current version now. Both it, and 6.4,
build/install/run cleanly on Ubunutu-5.10, Debian-Sarge,
and SLES9-SP3.

You seem to be having trouble on only Redhat distros..
I guess they've done something unfriendly again.

Care to be more specific about what Redhat is doing?

Cheers

2006-03-08 03:19:12

by Dave Jones

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Tue, Mar 07, 2006 at 09:57:07PM -0500, Mark Lord wrote:
> Bill Davidsen wrote:
> >
> >Is there a version of that which will build on x86? I grabbed the
> >version offered at freshmeat, but it won't compile on any x86 distro or
> >gcc version to which I have access. RH8, RH9, FC1, FC3, FC4, ubuntu...
> >with or without using the suggested alternate header.
>
> hdparm-6.5 is the current version now. Both it, and 6.4,
> build/install/run cleanly on Ubunutu-5.10, Debian-Sarge,
> and SLES9-SP3.
>
> You seem to be having trouble on only Redhat distros..
> I guess they've done something unfriendly again.
>
> Care to be more specific about what Redhat is doing?

looks like our userspace includes aren't up to date with some of the kernel
changes, so currently they're lacking the ide_task_request_t and related
taskfile bits.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=184349

Dave

--
http://www.codemonkey.org.uk

2006-03-08 03:23:35

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

Dave Jones wrote:
>
> looks like our userspace includes aren't up to date with some of the kernel
> changes, so currently they're lacking the ide_task_request_t and related
> taskfile bits.
>
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=184349

Ahh.. Thanks, Dave.

hdparm-6.6 being released *now*, with that stuff #ifdef'd out when
the necessary header structs are missing.

It builds/runs for me, on RHEL4 at least.

Cheers

2006-03-08 15:37:49

by Bill Davidsen

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Tue, 7 Mar 2006, Mark Lord wrote:

> Bill Davidsen wrote:
> >
> > Is there a version of that which will build on x86? I grabbed the
> > version offered at freshmeat, but it won't compile on any x86 distro or
> > gcc version to which I have access. RH8, RH9, FC1, FC3, FC4, ubuntu...
> > with or without using the suggested alternate header.
>
> hdparm-6.5 is the current version now. Both it, and 6.4,
> build/install/run cleanly on Ubunutu-5.10, Debian-Sarge,
> and SLES9-SP3.
>
> You seem to be having trouble on only Redhat distros..
> I guess they've done something unfriendly again.
>
> Care to be more specific about what Redhat is doing?

I'll mail you the first few hundred errors from the compiler after I go
find 6.5 and try that. My ubuntu tester reported similar results, so I'm
not sure what we are doing.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with little computers since 1979

2006-03-08 16:43:10

by Alan

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

On Mer, 2006-03-01 at 15:12 -0500, Phillip Susi wrote:
> >> It Broke And I Dont Know Why
> >> to
> >> Aborted Command
> >
> > So whats the SCSI sense encoding for that ?
> >
>
> Wouldn't that just be 0/0/0? IIRC the standard defines that as "NO
> ADDITIONAL SENSE DATA" which sounds to me like another way of saying "I
> don't know what went wrong, but that didn't work".

The 0/0/0 sense is already used. The question is what error do you use
with that sense. At the moment I'm using aborted command.

2006-03-21 18:11:37

[permalink] [raw]

Subject: Re: LibPATA code issues / 2.6.15.4

David Greaves wrote:

> Mark Lord wrote:
>
>> David Greaves wrote:
>>
>>> Just FYI - I'm away (in Canada) for 2 weeks so can't do any additional
>>> testing until I return.
>>
>>
>> Am I correct, in that your last test on rc5-git4 was a failure?
>
> It was *much* better than rc4 but it did have an error.
> I *think* the problem I'm seeing is likely to be similar to the one I
> orginally reported (on 2.6.15 IIRC)
> Same sporadic warning/error which didn't usually trigger the
> raid-boot-the-disk behaviour that the FUA code seemed to.
>
>> But without the "opcode" display in the error messages,
>> so we have no idea exactly what caused the errors (again!)?
>
> Yes. I thought the/a opcode-verbose patch was in there but I guess not.
> I don't have remote console access to the machine so wouldn't be able
> to carry out reliable kernel tests - sorry.
> Of course I'll do this as soon as I return.

Hi

Back now :)

I've upgraded to 2.6.16 and applied your verbosity patches.

I've persuaded my array to re-assemble and during the resync I got these
messages

dmesg:
ata1: translated op=0x28 cmd=0x25 ATA stat/err 0x51/04 to SCSI
SK/ASC/ASCQ 0xb/00/00
ata1: status=0x51 { DriveReady SeekComplete Error }
ata1: error=0x04 { DriveStatusError }
...(18mins later)
ata1: no sense translation for op=0x28 cmd=0x25 status: 0x51
ata1: translated op=0x28 cmd=0x25 ATA stat/err 0x51/00 to SCSI
SK/ASC/ASCQ 0x3/11/04
ata1: status=0x51 { DriveReady SeekComplete Error }

smartd is not running
This did not cause the raid subsystem to boot the disk (thank goodness!)

David

2006-03-22 15:23:37